git-annex could be useful to anyone looking to manage a large volume of photos. However, it has a steep learning curve, even if you are already a proficient git user (and git itself has a learning curve). This post covers my experience on that learning curve with a focus on using git-annex for photography. This post will assume familiarity with the command line and basic familiarity with git. The git-annex assistant might be useful if you want to synchronize folders without using the command line, however, I haven't tried it and I'm not convinced that it would provide a good tradeoff between convenience and handling "gotchas."
Disclaimer: Use this advice at your own risk of data loss.
To be useful, git-annex needs your photos to exist as individual files on a filesystem. By contrast, some image organization programs like Apple Photos, Lightroom, and Capture One will ingest photos into a catalogue managed by the program. We could put the catalogue itself in git-annex, but doing so would not provide many of the benefits that I appreciate.
I decided I wanted my photos on the filesystem even before I decided that I wanted to use git-annex. Photos on the filesystem:
git annex get
it from a remote. This is the feature that first made me consider git annex
, since I was looking for an easier way to manage this state than the rsync-based scripts that I was using before.git annex sync
), not a background sync.git annex whereis
to check which remotes have the content.Much like git, you don't need to understand everything about git-annex to get what you want out of it. But you should know some things, maybe even before you dive into the quickstart:
Locked files are not my favorite feature. By default, git-annex adds files in a "locked" state. This does not use the "locked file" feature on macOS (which I also don't use), but it is similar, and includes making the file read-only. If you git annex unlock
a file, you now have two copies of the file on disk: An "unlocked" copy in your working directory, which you are free to modify, and an original copy kept by git-annex. If you first configured git config annex.thin true
then you will avoid using double space, but the docs caution that will this configuration, "any modification made to a file will cause the old version of the file to be lost from the local repository."
At this point, we should consider whether we actually want to keep old versions of files. Some applications of git-annex will want to keep old file versions. I just don't think that normal photography is one of those applications. We should be using non-destructive editors, and I personally don't feel the need to apply version control to my edits (other than occasionally keeping multiple versions of the same image, which is a manual form of version control). I also don't feel the need to version-control changes that I make to file metadata (for example, when I use exiftool to add GPS metadata or fix time zones).
So, perhaps we should configure git config annex.thin true
as well as git config annex.addunlocked
. Then, you might think that we we would usually be working with unlocked files, and we wouldn't need to worry about them taking double space, although we wouldn't get to keep old versions of modified files. Alas, that mental model would be too simple. For one thing, the docs said that the old version of the file would be lost from the local repository. Remote repositories will probably still have the old version, even after syncing. For another thing, the docs are not correct on macOS with APFS. We actually do still have the old version of the file, we can restore it, etc. The flip side of that is that if you modify the file, then you will have two copies in your local filesystem. I think this is because macOS with APFS uses reflinks, which enable a copy-on-write behavior.
Reflinks are great. It would also be great:
git config annex.thin true
and git config annex.addunlocked true
, or maybe even if git-annex just didn't have the locked files concept;Recommendation: If it is not important for you to keep old versions of files, then set git config annex.thin true
and git config annex.addunlocked true
to make your life easier. Set these from within the repo's working directory, immediately after you create or clone a new repo. But you will probably still be keeping old versions in some cases. To remove these old versions (reclaiming storage space), you can periodically run git annex unused
followed by git annex dropunused <range>
on each of your repos.
This is a potentially confusing behavior, just something to be aware of.
git-annex will deduplicate identical files within a repo. Suppose you run the following:
cd your_repo
cp ~/Pictures/photo.jpg .
mkdir subdir
cp ~/Pictures/photo.jpg subdir/
git annex add photo.jpg subdir/
git commit -am "added photo twice"
Your git-annex repo will only have a single copy of photo.jpg. On macOS with APFS, I believe both photo.jpg
and subdir/photo.jpg
will be reflinks that share the same data blocks (at least if you're using annex.thin
and unlocked files as I advised above). If you modify just photo.jpg
, then you triggered a copy-on-write, and the files are no longer duplicates, at least on macOS with APFS; you may need to be careful with this pattern on other filesystems. So far, so good. But if you never modified either of the files, and you run git annex drop photo.jpg
while they are still duplicates, then the content would be entirely dropped from your local repo. Neither photo.jpg
nor subdir/photo.jpg
would have content. If you git annex get photo.jpg
to get the content from a remote, then the files at both paths would again have content.
This is a key command, and it's worth understanding what it does. In practice, you should make sure to sync before and after making any changes in your working directory.
You probably want each of your repos to have up-to-date working directories, rather than just up-to-date synced/main branches; that way, they're still useful even if you stop using git-annex. If you want each of your repos to have up-to-date working directories, then you need to go into each of the repos and run git annex sync
from there. If you moved a lot of files around, this can take a long time (potentially hours), so it is best to do this semi-regularly. If I organized a lot of bird photos, I will usually invoke a script to run it on my NAS overnight.
There is a good guide in the official docs.
Prerequisite: Just don't use exFAT for any filesystems, since it supports neither hardlinks nor symlinks. Any other common filesystem is probably fine, including APFS, HFS+, ext4, ext3, Btrfs, and NTFS.
Prerequisite (optional): You may want to set git config --global init.defaultBranch main
.
When you get to #8 removing files, be sure to learn the difference between git annex drop <filename>
and rm <filename>
(or git rm <filename>
; I rarely bother with git rm
, since the result is the same sa rm
or mv <filename> <out_of_repo>
after commit). If you actually need the data to be gone from the repo and unrecoverable, you may need to use git annex unused
and then git annex dropunused
after rm
.
You can probably skip #9 modifying annexed files if you're using git config annex.addunlocked true
and git config annex.thin true
.
I haven't felt the need to use #15 using tags and branches, nor #19 git annex numcopies, nor #20 automatically managing content.
For #17 fsck: verifying your data, note that git annex fsck
takes an optional --fast
flag, which will skip slower operations.
git annex sync
, it's a good habit to always provide the flags --verbose --debug
. Although it is a bit noisy, you may be grateful to have the logs if the command ends up taking a long time or appearing to get stuck.git annex sync --verbose --debug
appears to be stuck, and if you feel the need to interrupt it, then git status
or git annex status
may show a lot of modified files in your working directory afterward. git annex restage --debug
might fix this up, but if it exits without appearing to do anything, then it might be due to the .git/index.lock
file. If that file exists and ps aux | grep git
shows no git processes running, then you can delete the .git/index.lock
file and re-run git annex restage --debug
. In a similar situation, I felt the need to run git annex fsck --fast
, which fixed things up, although I don't remember how I got into that situation.git annex add
, I usually provide the -Jcpus
flag to parallelize the checksumming; however, this occasionally crashes, in which case I simply retry.-J2
flag will usually result in a speedup.git annex sync
on a repo before moving/modifying files in that repo. (And git remote -v
to make sure all the upstream repos are accessible remotes to the repo you are modifying). Only consider skipping git annex sync
if you know no other repos have committed changes since you last synced. If you do accidentally get in a situation where you need to resolve merge conflicts, it might be easiest to resolve those conflicts by using git annex reflog
to discard non-essential commits.I'm using an entry-level network-attached storage device from QNAP. I'm storing data at /share/AaronDataFolder/
, I can ssh in with ssh nas
, and I can ssh in with sudo privileges by running ssh nasadmin
.
First, we need to install git-annex on the NAS. For QNAP devices, the easiest way is to download the appropriate tarball (arm64 in this example), verify the hash if you like, and then:
scp git-annex-standalone-arm64.tar.gz nas:/share/AaronDataFolder/
ssh nasadmin
cd /share/AaronDataFolder/
tar -zxvf git-annex-standalone-arm64.tar.gz
That created /share/AaronDataFolder/git-annex.linux/
. Now we have to set up some symlinks:
sudo ln -s /share/AaronDataFolder/git-annex.linux/git /usr/bin/git
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex /usr/bin/git-annex
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-receive-pack /usr/bin/git-receive-pack
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-upload-pack /usr/bin/git-upload-pack
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex-shell /usr/bin/git-annex-shell
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-shell /usr/bin/git-shell
Unfortunately, the QNAP NAS appears to clear those symlinks on restart. It doesn't restart often, so I just have a script that I use to quickly re-create the symlinks if I ever get failures like git-annex: command not found
. I keep the script in a ZSH shell function:
function redoNasSymlinks() {
ssh nasadmin "sudo ln -s /share/AaronDataFolder/git-annex.linux/git /usr/bin/git && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex /usr/bin/git-annex && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-receive-pack /usr/bin/git-receive-pack && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-upload-pack /usr/bin/git-upload-pack && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex-shell /usr/bin/git-annex-shell && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-shell /usr/bin/git-shell"
}
git-annex commands:
git annex add
git annex sync --verbose --debug
git annex whereis
git annex copy --to
and git annex copy --from
git annex drop
git annex listunused
and git annex dropunused
(not to be confused with git annex drop
!)git annex info --fast --in here .
git commands: * git status
(and sometimes git annex status
and git annex restage
if the git status
output looks unexpected) * git commit -am "message"
TODO