Lowercase Greek letter LambdaAaron's website

git-annex for Photography

Introduction

git-annex could be useful to anyone looking to manage a large volume of photos. However, it has a steep learning curve, even if you are already a proficient git user (and git itself has a learning curve). This post covers my experience on that learning curve with a focus on using git-annex for photography. This post will assume familiarity with the command line and basic familiarity with git. The git-annex assistant might be useful if you want to synchronize folders without using the command line, however, I haven't tried it and I'm not convinced that it would provide a good tradeoff between convenience and handling "gotchas."

Disclaimer: Use this advice at your own risk of data loss.

Prerequisite: Photos on the filesystem

To be useful, git-annex needs your photos to exist as individual files on a filesystem. By contrast, some image organization programs like Apple Photos, Lightroom, and Capture One will ingest photos into a catalogue managed by the program. We could put the catalogue itself in git-annex, but doing so would not provide many of the benefits that I appreciate.

I decided I wanted my photos on the filesystem even before I decided that I wanted to use git-annex. Photos on the filesystem:

Benefits of git-annex

Required concepts and gotchas

Much like git, you don't need to understand everything about git-annex to get what you want out of it. But you should know some things, maybe even before you dive into the quickstart:

Locked and unlocked files

Locked files are not my favorite feature. By default, git-annex adds files in a "locked" state. This does not use the "locked file" feature on macOS (which I also don't use), but it is similar, and includes making the file read-only. If you git annex unlock a file, you now have two copies of the file on disk: An "unlocked" copy in your working directory, which you are free to modify, and an original copy kept by git-annex. If you first configured git config annex.thin true then you will avoid using double space, but the docs caution that will this configuration, "any modification made to a file will cause the old version of the file to be lost from the local repository."

At this point, we should consider whether we actually want to keep old versions of files. Some applications of git-annex will want to keep old file versions. I just don't think that normal photography is one of those applications. We should be using non-destructive editors, and I personally don't feel the need to apply version control to my edits (other than occasionally keeping multiple versions of the same image, which is a manual form of version control). I also don't feel the need to version-control changes that I make to file metadata (for example, when I use exiftool to add GPS metadata or fix time zones).

So, perhaps we should configure git config annex.thin true as well as git config annex.addunlocked. Then, you might think that we we would usually be working with unlocked files, and we wouldn't need to worry about them taking double space, although we wouldn't get to keep old versions of modified files. Alas, that mental model would be too simple. For one thing, the docs said that the old version of the file would be lost from the local repository. Remote repositories will probably still have the old version, even after syncing. For another thing, the docs are not correct on macOS with APFS. We actually do still have the old version of the file, we can restore it, etc. The flip side of that is that if you modify the file, then you will have two copies in your local filesystem. I think this is because macOS with APFS uses reflinks, which enable a copy-on-write behavior.

Reflinks are great. It would also be great:

Recommendation: If it is not important for you to keep old versions of files, then set git config annex.thin true and git config annex.addunlocked true to make your life easier. Set these from within the repo's working directory, immediately after you create or clone a new repo. But you will probably still be keeping old versions in some cases. To remove these old versions (reclaiming storage space), you can periodically run git annex unused followed by git annex dropunused <range> on each of your repos.

The Deduplication Gotcha

This is a potentially confusing behavior, just something to be aware of.

git-annex will deduplicate identical files within a repo. Suppose you run the following:

cd your_repo
cp ~/Pictures/photo.jpg .
mkdir subdir
cp ~/Pictures/photo.jpg subdir/
git annex add photo.jpg subdir/
git commit -am "added photo twice"

Your git-annex repo will only have a single copy of photo.jpg. On macOS with APFS, I believe both photo.jpg and subdir/photo.jpg will be reflinks that share the same data blocks (at least if you're using annex.thin and unlocked files as I advised above). If you modify just photo.jpg, then you triggered a copy-on-write, and the files are no longer duplicates, at least on macOS with APFS; you may need to be careful with this pattern on other filesystems. So far, so good. But if you never modified either of the files, and you run git annex drop photo.jpg while they are still duplicates, then the content would be entirely dropped from your local repo. Neither photo.jpg nor subdir/photo.jpg would have content. If you git annex get photo.jpg to get the content from a remote, then the files at both paths would again have content.

git annex sync

This is a key command, and it's worth understanding what it does. In practice, you should make sure to sync before and after making any changes in your working directory.

You probably want each of your repos to have up-to-date working directories, rather than just up-to-date synced/main branches; that way, they're still useful even if you stop using git-annex. If you want each of your repos to have up-to-date working directories, then you need to go into each of the repos and run git annex sync from there. If you moved a lot of files around, this can take a long time (potentially hours), so it is best to do this semi-regularly. If I organized a lot of bird photos, I will usually invoke a script to run it on my NAS overnight.

Quickstart

There is a good guide in the official docs.

Prerequisite: Just don't use exFAT for any filesystems, since it supports neither hardlinks nor symlinks. Any other common filesystem is probably fine, including APFS, HFS+, ext4, ext3, Btrfs, and NTFS.

Prerequisite (optional): You may want to set git config --global init.defaultBranch main.

When you get to #8 removing files, be sure to learn the difference between git annex drop <filename> and rm <filename> (or git rm <filename>; I rarely bother with git rm, since the result is the same sa rm or mv <filename> <out_of_repo> after commit). If you actually need the data to be gone from the repo and unrecoverable, you may need to use git annex unused and then git annex dropunused after rm.

You can probably skip #9 modifying annexed files if you're using git config annex.addunlocked true and git config annex.thin true.

I haven't felt the need to use #15 using tags and branches, nor #19 git annex numcopies, nor #20 automatically managing content.

For #17 fsck: verifying your data, note that git annex fsck takes an optional --fast flag, which will skip slower operations.

Tips

QNAP NAS Installation

I'm using an entry-level network-attached storage device from QNAP. I'm storing data at /share/AaronDataFolder/, I can ssh in with ssh nas, and I can ssh in with sudo privileges by running ssh nasadmin.

First, we need to install git-annex on the NAS. For QNAP devices, the easiest way is to download the appropriate tarball (arm64 in this example), verify the hash if you like, and then:

scp git-annex-standalone-arm64.tar.gz nas:/share/AaronDataFolder/
ssh nasadmin
cd /share/AaronDataFolder/
tar -zxvf git-annex-standalone-arm64.tar.gz

That created /share/AaronDataFolder/git-annex.linux/. Now we have to set up some symlinks:

sudo ln -s /share/AaronDataFolder/git-annex.linux/git /usr/bin/git
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex /usr/bin/git-annex
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-receive-pack /usr/bin/git-receive-pack
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-upload-pack /usr/bin/git-upload-pack
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex-shell /usr/bin/git-annex-shell
sudo ln -s /share/AaronDataFolder/git-annex.linux/git-shell /usr/bin/git-shell

Unfortunately, the QNAP NAS appears to clear those symlinks on restart. It doesn't restart often, so I just have a script that I use to quickly re-create the symlinks if I ever get failures like git-annex: command not found. I keep the script in a ZSH shell function:

function redoNasSymlinks() {
  ssh nasadmin "sudo ln -s /share/AaronDataFolder/git-annex.linux/git /usr/bin/git && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex /usr/bin/git-annex && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-receive-pack /usr/bin/git-receive-pack && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-upload-pack /usr/bin/git-upload-pack && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-annex-shell /usr/bin/git-annex-shell && sudo ln -s /share/AaronDataFolder/git-annex.linux/git-shell /usr/bin/git-shell"
}

All the commands I use

git-annex commands:

git commands: * git status (and sometimes git annex status and git annex restage if the git status output looks unexpected) * git commit -am "message"

Helpful scripts

TODO