Using Duplicity backup with Amazon Glacier storage

Andrew Todd

2018-10-12T08:00:00Z

Update, 2020-07-05: Duplicity now supports Glacier and Glacier Deep Archive natively; that's a much better choice than using these scripts.

Duplicity's not a bad choice for making secure backups on Linux. It uses GnuPG to encrypt data and integrates well with typical Unix workflows. Best of all, it has support for many storage backends; the same tool can be used to back up to a USB stick or to Amazon S3.

I also use Tarsnap. Tarsnap has a smarter model than Duplicity for incremental backups that allows for deletion of old data. However, it's also tightly tied to the most reliable form of Amazon's S3 storage, which can make it relatively expensive.

Therefore, I use a hybrid model, where critical, extremely security-sensitive data is stored in Tarsnap, and the bulk of my personal data is backed up to a USB drive and cloud storage via Duplicity.

Even so, as I uploaded more Duplicity files into Amazon S3, I wanted to save more money. Duplicity doesn't have direct support for Amazon's super-cheap, super-slow Glacier service, but it's possible to ship objects in S3 buckets to Glacier without too much difficulty. Now, I spend less than a dollar a month on remote backup.

The rest of this article assumes familiarity with Duplicity and the S3 backend.

I'm posting my work here more as a proof-of-concept than a general solution; that's why I haven't placed it in Git. It's also lacking comments, but should be fairly intuitive to read. One obvious optimization would be to back up the cache itself to S3, so that cache reconstruction on a new machine would be much faster. There's no reason that this support couldn't be integrated into Duplicity itself, either.

It's important to note that there's another, unimplemented side to this: recovery. Retrieving an object from Glacier storage can take hours. If I were to naively try to restore my backup right now, it wouldn't complete for months. Duplicity would try to serially retrieve each object, and AWS would repeatedly block while transparently restoring that object from Glacier to S3. Since a restore could require thousands of objects, this could take tens of thousands of hours. So, to make this complete, a restore script needs to be written that will instruct AWS to pull all of the objects out of Glacier at once, before running Duplicity. I haven't written that script yet.

You will need to install boto3 somehow on your system; I just use the Ubuntu package, but others may wish to use virtualenv, etc. By default, boto uses Amazon's usual credentials file, at ~/.aws/credentials.

There are still some manual steps required for each Duplicity backup -- this is only demo code, after all:

You'll need to have already set up your S3 bucket and created a successful Duplicity backup to it. Doing so is outside the scope of this article. As part of this, you should set up AWS credentials for a user that has permission to do this, as well as create tags on your objects.
Use the S3 console to manually add a policy to your bucket that moves all objects tagged with key glacierable and value true to Glacier after 24 hours.

Finally, this is not an endorsement of Glacier's reliability or lack thereof. I have not restored backups nor tested their integrity. It is best to ensure that your most critical data is backed up using multiple methods, on multiple storage backends. If you have the resources, performing periodic integrity tests on your backups is also an excellent idea.

#!/usr/bin/env python3

import boto3
import pathlib
import sys
import time

start_time = time.time()

bucket_id = sys.argv[1]
cache_file = '/path/to/cache/directory/tagged_keys_' + bucket_id
pathlib.Path(cache_file).touch()

glacierable_tag = {
    'Key': 'glacierable',
    'Value': 'true'
}

non_glacierable_tag = {
    'Key': 'glacierable',
    'Value': 'false'
}

total_obj_count = 0
cached_obj_count = 0
prev_tagged_obj_count = 0
non_glacierable_count = 0
glacierable_count = 0

with open(cache_file, encoding='utf_8', mode='rt', newline='\n') as f:
    tagged_keys = set(f.read().splitlines())

s3 = boto3.resource('s3')
client = boto3.client('s3')

bucket = s3.Bucket(bucket_id)

for obj in bucket.objects.all():
    total_obj_count += 1

    if obj.key in tagged_keys:
        cached_obj_count += 1
        continue

    tags = client.get_object_tagging(
        Bucket=bucket_id,
        Key=obj.key
    )

    if glacierable_tag in tags['TagSet'] or non_glacierable_tag in tags['TagSet']:
        prev_tagged_obj_count += 1
        tagged_keys.add(obj.key)
        continue

    if 'manifest' in obj.key \
       or 'signatures' in obj.key:
        non_glacierable_count += 1

        client.put_object_tagging(
            Bucket=bucket_id,
            Key=obj.key,
            Tagging={
                'TagSet': [
                    {
                        'Key': 'glacierable',
                        'Value': 'false'
                    },
                ]
            }
        )

        tagged_keys.add(obj.key)
        continue

    glacierable_count += 1

    client.put_object_tagging(
        Bucket=bucket_id,
        Key=obj.key,
        Tagging={
            'TagSet': [
                {
                    'Key': 'glacierable',
                    'Value': 'true'
                },
            ]
        }
    )

    tagged_keys.add(obj.key)

sorted_writable_lines = sorted(['%s\n' % line for line in tagged_keys])
with open(cache_file, encoding='utf_8', mode='wt', newline='\n') as f:
    f.writelines(sorted_writable_lines)

end_time = time.time()

print('Execution time, fractional seconds: ' + str((end_time - start_time)))

print('total_obj_count: ' + str(total_obj_count))
print('cached_obj_count: ' + str(cached_obj_count))
print('prev_tagged_obj_count: ' + str(prev_tagged_obj_count))
print('non_glacierable_count: ' + str(non_glacierable_count))
print('glacierable_count: ' + str(glacierable_count))