Honing the craft.

You are here: You're reading a post

The dup-composer dry run tryout

Last week, I have managed to add some missing bits, chiefly docstrings, to the prototype of dup-composer, but I didn't finish and share the post with you then, as we were moving from our weekend house that took two days of work. I am sharing it today however. Now you can do dry runs, that print the Duplicity commands to be executed with the given configuration. In this blog post I demonstrate how the tool can be configured, what features are covered, the caveats, and finally the next features I will be rolling out.

Features

The core functionality of this command line utility can be described in a single sentence: Read the backup configuration from a file and execute the Duplicity backup tool based on that information. Until the command execution code and the functional tests that go with it are not done, the utility only does dry runs, meaning that the commands are simply printed on the console.

Check out the dup-composer repo to look at the source.

To configure the locations you want to back up, called sources, you have to create one or more backup groups. These groups share the general characteristics of the backup - backup storage provider, encryption, volumes size, prefixes - and you can add your locations to these groups, so that they can share these properties.

When I was deciding on the core features, I have concentrated on the ones that I want to use myself. The functionality is rather limited, but it will grow eventually, if there is interest. The existing features will be polished further as well.

Supporting all the storage interfaces, providers, that I use in my environment was essential:

  • SCP for backing up clients on the local network.
  • Local filesystem to back up server data to external drives.
  • AWS S3+Glacier for offsite backups.

GPG signing and encryption of the backup files is a core feature of Duplicity, so it is supported by dup-composer as well-although with some limitations, as you will see in the caveats section. Proper volume sizing has proven to be useful, especially for S3 backups. Last, but not least, I also move my archives to Glacier automatically and the files need to be specifically prefixed for bucket rules to work, so file name prefixing has also been added to the core feature list.

Configuration

To use dup-composer, first of all, you need to set up the configuration file. The configuration should follow YAML 1.1 syntax, but to also understand the semantics, let me walk you through the elements and organization. Elements of different kind: scalars, sequences and mappings, are collectively called nodes in the YAML spec, so let's use this terminology going forward. Let's start at the top, with the list of backup groups, I have marked places where child nodes and scalar values will go with three dots (...):

backup_groups:
  my_first_backup_group:
    ...
  my_second_backup_group:
    ...

The parent node of the groups is called backup_groups, which is currently the root of the configuration structure, but further configuration nodes might be added on the top level in the future.

For each backup group, you have to have the following structure in place:

my_first_backup_group:
  encryption:
    ...
  backup_provider:
    ...    
  backup_file_prefixes:
    ...
  volume_size: ...
  sources:
    ...

The encryption node (mandatory) is the parent of the encryption related configuration, the children of backup_provider(mandatory) specify all the provider related properties. backup_file_prefixes is optional and contains the child nodes for archive, signature and manifest file prefix configuration. The volumes_size (mandatory) node determines the size of the backup archive file chunks in MBs.

Let's take a closer look at each of these nodes and how they can be configured. There are primarily two ways to set up the encryption node at the moment.

Encryption is turned off:

encryption:
   enabled: no

Encryption is turned on:

encryption:
  enabled: yes
  gpg_key: 123456789ABCDEFF
  gpg_passphrase: examplepassphrase123

If the enabled node is set to no, encryption is disabled, there is no need to configure the gpg-key and gpg-passphrase nodes. When encryption is enabled however, they are mandatory.

The backup-provider configuration to be configured largely depends on the type of the provider, determined by the URL scheme:

backup-provider:
  url: file://

This configuration sets dup-composer up to save the backup files on the local filesystem. There is no need to specify a concrete path here, as that will be determined by the sources section of the configuration. The URL will just set the context for those paths.

For a remote SCP backup, you need a slightly different configuration:

backup-provider:
  url: scp://myscpuser@host.example.com/
  password: examplepassword123

In this case, you need to specify the username of the remote SCP host in the first part of the SCP URL, which is what you would do using Duplicity directly as well. Use the password node to specify the password.

Finally, you have to configure AWS S3 like this:

backup-provider:
  url: s3://s3.sa-east-1.amazonaws.com/my-backup-bucket
  aws_access_key: EXAMPLEACCESSKEY
  aws_secret_key: ExAmPlESeCrEtKeY

The S3 bucket URL is configured as the url node value, while aws_access_key and aws_secret_key need to contain your AWS generated keys for the bucket. Like with the rest of the providers, the actual path, folder, within the bucket shouldn't be added to the URL.

The next feature comes handy if you want to prefix the generated backup file names in a specific way. I use this to set up bucket rules in S3, that move my archive files to Glacier. Here is an example of the configuration:

backup_file_prefixes:
  manifest: manifest_
  archive: archive_
  signature: signature_

The prefixes can be specifically set up for each file type generated at the backup location. Set these up as needed; you can leave the backup_file_prefixes node out altogether, if you don't need this feature.

The volume-size node is rather simple: a number should be given as its value; this determines the archive size in megabytes.

Under the sources node in the configuration hierarchy, you can specify a list of locations (paths) you want to back up, where to back them up and where the restored data should go. You can set up multiple sources within a single group. Here is an example set of two sources configured:

sources:
  /var/www/html:
    backup_path: /root/backups/var/www/html
    restore_path: /root/restored/var/www/html
  /home/tommy:
    backup_path: /root/backups/home/tommy
    restore_path: /root/restored/home/tommy

The source child nodes /var/www/html and /home/tommy determine the directory you want to back up, and backup_path prescribes the location the backup files will be saved to. In practice, the value of backup_path will be appended to the value of the provider url node discussed earlier; hence these two fragments give the true backup location. restore_path is not used during the backup step, but specifying it is mandatory at the moment. I will remove this requirement very soon, as it doesn't make any sense, until an actual restore has to happen.

Caveats

This is still work in progress, I just have the seed of the implementation and this also means, that there are quite a few caveats that I will get rid of eventually.

  • The path of the configuration file is hard coded as tests/fixtures/dupcomposer-config.yml relative to the Git repo root, you can change this file after cloning the GitHub repository with: git clone https://github.com/cruizer/dup-composer.git
  • Passwords and other sensitive credentials have to be stored in clear text for now.
  • Even though you want to run a backup, you still have to specify the restore_path.
  • Generic file prefixing, where you set the same prefix for all kinds of files is not supported. You can work around this by providing the same prefix for all three file types.
  • I haven't tested, nor considered paths, file and directory names with special characters, like whitespace or other characters that have a special meaning in the shell. This is high on my priority list however.
  • You have to use the same key for GPG signing and encryption.

How does the output look like at the moment?

Here is a snippet of how the output of a backup looks like, using the example configuration in the tests/fixtures/ directory of the GitHub repository:

(dup-composer) cruizer@barbwire-laptop:~/Dev/python/dup-composer (master)$ ./dupcomp.py backup
Generating commands for group my_local_backups:

duplicity --no-encryption --volsize 200 /var/www/html file:///root/backups/var/www/html
duplicity --no-encryption --volsize 200 home/tommy file://backups/home/tommy


Generating commands for group my_s3_backups:

duplicity --encrypt-key xxxxxx --sign-key xxxxxx --volsize 50 --file-prefix-archive archive_ --file-prefix-manifest manifest_ --file-prefix-signature signature_ /home/shared s3://s3.sa-east-1.amazonaws.com/my-backup-bucket/home/shared
duplicity --encrypt-key xxxxxx --sign-key xxxxxx --volsize 50 --file-prefix-archive archive_ --file-prefix-manifest manifest_ --file-prefix-signature signature_ etc s3://s3.sa-east-1.amazonaws.com/my-backup-bucket/etc


Generating commands for group my_scp_backups:

duplicity --no-encryption --volsize 200 /home/katy scp://myscpuser@host.example.com//home/katy
duplicity --no-encryption --volsize 200 home/fun scp://myscpuser@host.example.com/home/fun

What's next?

My plan for this week is to fix any issues I find and resolve most of the items in the following list:

  • Handle a reasonably wide set of characters for filenames. Research what that would be, then implement it. Aim to have tests in place for 100'% of the case.
  • Implement the code that executes the required duplicity commands with functional testing in mind. Design and write the functional tests.
  • Implement the execution of specific backup groups and sources when running dup-composer.