Make your own Wayback Machine or Time Machine in GNU/Linux with rsnapshot

A good backup system can help you recover from a lot of different kinds of situations: a botched upgrade (requiring re-installation), a hard drive crash, or even thumb-fingered users deleting the wrong file. In practice, though I've experienced all of these, it's the last sort of problem that causes me the most pain. Sometimes you just wish you could go back a few days in time and grab that file. What you want is something like the Internet Archive's "Wayback Machine", but for your own system. Here's how to set one up using the rsnapshot package (included in the Debian and Ubuntu distributions).

What you can expect

In practice, you can expect to need a backup drive approximately twice the size of the system you are backing up. Once fully established, the backup system will not continue to grow, because old data will eventually be retired and deleted automatically.

With the configuration options in rsnapshot it will be possible to create just about any tiered backup arrangement you want. For my system, I keep daily snapshots for 30 days, weeklies for just over a year, and finally, monthly backups for ten years. Presumably, I will have set up a new system long before any of the monthlies get deleted.

Using a script, I have set up my system to work like the Wayback Machine in that each backup incorporates the date in its file path. So, I can browse the available dates and pick one I think is most likely to have the data I want. Thus if today is August 20, 2009, and I've accidently deleted my Mozilla bookmarks file (I've known Sea Monkey to destroy this file if it crashes badly), I can recover it by copying from yesterday's backup:

$ cp /backup/auto/date/2009-08-19/myclient/home/terry/.mozilla/Terry/8ufwbbkq.slt/bookmarks.html /home/terry/.mozilla/Terry/8ufwbbkq.slt/

$

If I want to see for what dates I have backups, I can simply list the date directory:

myclient:/backup/auto/date$ ls

2009-06-23  2009-08-04  2009-08-12  2009-08-20  2009-08-28

2009-06-29  2009-08-05  2009-08-13  2009-08-21  2009-08-29

2009-07-08  2009-08-06  2009-08-14  2009-08-22  2009-08-30

2009-07-17  2009-08-07  2009-08-15  2009-08-23  2009-08-31

2009-07-25  2009-08-08  2009-08-16  2009-08-24

2009-08-01  2009-08-09  2009-08-17  2009-08-25

2009-08-02  2009-08-10  2009-08-18  2009-08-26

2009-08-03  2009-08-11  2009-08-19  2009-08-27

Installation of rsnapshot

Installing rsnapshot couldn't be much easier in Debian or Ubuntu, where the package is part of the main package repository (and has been for several releases):

myserver:~# apt-get install rsync rsnapshot

If you're going to keep your backups on the same computer (from one disk drive to another, for example), then this is all you will need. For a Local Area Net (LAN) installation, though, you'll most likely have a "backup server" on which the backup disk is mounted, and several "backup clients" whose data you want to back up onto the server.

For this you will need to set up remote login access for the backup user. In this article, we'll set up the backup user to be root. Although this may be a security risk in a vulnerable commercial network, it's good enough for a home or small-business LAN where you can expect to trust the people with physical access to your computers. It also has the advantage of allowing you to keep correct user identities and permissions on the backed up files.

Password-less SSH access

Probably the worst part of the install is correctly setting up password-less SSH logins on your systems. This is a potential security hole for publicly-open local area networks, although on most home networks this should be no problem. In any case, it does have to be done right for rsnapshot to work.

We will set up two-way password-less SSH access for the root users on both client and server machines. For some configurations, this is not strictly necessary, but it is very useful for intermittently-connected machines, as I'll soon show.

Client configuration for SSH access

First of all, you should use your real host names. In my examples, I will use myclient for a client computer and myserver for my backup server. On each client computer and on the server, you will need to generate a public/private key pair, using OpenSSH:

$ su

myclient:~# cd /root

myclient:~# ssh-keygen

Generating public/private rsa key pair.

Enter file in which to save the key (/root/.ssh/id_rsa): .ssh/myclient

Enter passphrase (empty for no passphrase):

Enter same passphrase again:

Your identification has been saved in /root/.ssh/myclient.

Your public key has been saved in /root/.ssh/myclient.pub.

The key fingerprint is:

        . . .

Don't type anything for the passphrase! For password-less access, you will need an empty passphrase.

The ellipsis (...) omits a representation of the key provided for confirmation.

If you look in the .ssh directory, you will now see your two keys:

myclient:/root/.ssh# ls

myclient myclient.pub  known_hosts

Use scp ("secure copy" from OpenSSH) to transfer the myclient.pub file to the backup server (and also copy the server's myserver.pub file to the client machine). When you have all of the necessary keys, you will then create (or append to) a file called /root/.ssh/authorized_keys2 in order to make the machines accessible:

myserver:/root/.ssh# cat *.pub >> authorized_keys2

Of course, you will do the same thing on each client to set up two-way access:

myclient:/root/.ssh# cat myserver.pub >> authorized_keys2

Configuring rsnapshot on the server

The default configuration file for rsnapshot, is, unsurprisingly, /etc/rsnapshot.conf. We'll configure this file to handle local backups for filesystems that are mounted on the server, and also on one reliably-connected client (for example, you might use this for a separate file server, called myfileserver here). This is also appropriate for a stand-alone computer.

There are lots of comments in the /etc/rsnapshot.conf that comes with the standard install on Debian, but I'll omit those in the listings I show here.

First, let me show the final result for my example setup:

config_version  1.2

snapshot_root   /backup/auto/

cmd_cp          /bin/cp

cmd_rm          /bin/rm

cmd_rsync       /usr/bin/rsync

cmd_ssh         /usr/bin/ssh

cmd_logger      /usr/bin/logger

cmd_du          /usr/bin/du

interval        daily   30

interval        weekly  52

interval        monthly 120

verbose         2

loglevel        3

logfile         /var/log/rsnapshot

lockfile        /var/run/rsnapshot.pid

one_fs          1

sync_first      1

backup          root@myfileserver:/project/     myfileserver/

backup          /home/                          myserver/

backup          /etc/                           myserver/

Now let's break that down. The first line (config_version) is basically boilerplate, it just tells rsnapshot what format to expect in this file (copy it exactly). The next line, snapshot_root tells where the backup volume is mounted on the server. I have a large hard drive mounted as /backup, but I use parts of it for other kinds of backup jobs, so I let rsnapshot use the /backup/auto directory on it.

The next few lines (cmd_*) are essentially more boilerplate. They tell rsnapshot what sort of utilities it has available to do its work (without them, it can fall back on internal methods -- but on any GNU/Linux system, all of these are available).

The next lines (interval) are much more interesting:

interval        daily   30

interval        weekly  52

interval        monthly 120

This is where I decide on what sort of backup schedule I want to keep. Many configurations are possible here, but in my case I've chosen to keep one backup per day for a month, weekly backups for a year before that, and monthly backups "indefinitely", here defined as 10 years (plenty of time to make permanent archives of anything I'm going to need longer than that!).

By the way, the names "daily", "weekly", and "monthly" used here have no special meaning to rsnapshot, you can use any interval name you find useful. You'll give these names meaning yourself when you set up the cron (or anacron) jobs to call rsnapshot. You're perfectly entitled to set up a "fortnightly" or "every10days" job if you want to.

On the other hand, it is critical that the intervals are listed in order from most-frequent to least-frequent. This is because rsnapshot will actually make the longer-interval backups by simply rotating an appropriate shorter-interval backup (so for example, the last daily backup will become the first weekly backup when the time comes, and the last weekly backup will become the first monthly one). This may seem a little confusing at first, but it really does work out correctly.

Following this are some more minor details: how much to include in my logs should be (verbose, loglevel), where to put them (logfile), and where to keep a lockfile to avoid collisions if I should (accidentally) attempt to run two instances of the program at once (lockfile).

The next line (one_fs) prevents backup sets from following symlinks onto other drives. This could be a problem if you make a lot of use of symlinks to aide command-line navigation on your system, and is otherwise fairly harmless.

Somewhat more arcanely, the sync_first line tells rsnapshot to operate a little differently than it normally does. Instead of making each backup and rotating backups all in one operation, it will do them in two operations: first synchronizing its backup copy, and then rotating into the daily backup set. This is extremely useful for backing up computers which are not reliably available to your network, as we will see a little later.

Finally, the backup lines tell what data should be backed up:

backup          root@myfileserver:/project/     myfileserver/

backup          /home/                          myserver/

backup          /etc/                           myserver/

In this case, I'm making backups of my /home directory (user data) and the /etc directory (configuration data) on my backup server (which also happens to be a workstation in this example). These are local backups, so no login information is required on the source column (the first argument).

For myfileserver, I have set up password-less SSH login (see the previous section), and the source column includes the login ID and hostname as well as the directory.

In all of the lines, I include a target column which is a relative path. This will be interpreted relative to the snapshot_root defined earlier. So for example, a user named joe on myserver will find yesterday's backup data beneath /backup/auto/daily.0/myserver/home/joe.

On myfileserver, I need to make sure that rsync is installed and that I have correctly set up password-less SSH login for the root user on myserver. Nothing else is required on the backup client (myfileserver).

The cron job

The final step is to set up a job in the "crontab" on the backup server (myserver) to run the rsnapshot program at the appropriate intervals to make the backups:

Cron tables are handled differently on different GNU/Linux distributions, so you may encounter a different approach on your system. However, on Debian-derived distributions, there is actually a directory, /etc/cron.d, with this information. This makes it easier to keep separate tables for different purposes. The Debian rsnapshot package creates a file here for rsnapshot, predictably named `/etc/crond.d/rsnapshot'.

It is in this file that you will define the meaning of your rsnapshot backup intervals. Omitting comments, here's my sample setup:

30  4   * * *           root    /usr/bin/rsnapshot daily

0   4   * * 1           root    /usr/bin/rsnapshot weekly

30  3   1 * *           root    /usr/bin/rsnapshot monthly

The syntax at the beginning of each line is unique to cron. For a more complete understanding, I'll just refer you to the man page (type man 5 crontab to find "crontab(5)" which documents this format). In this case, the lines say to run the program with the argument daily at 4:30 AM every day, to run with weekly at 4:00 AM each Monday morning, and to run with monthly at 3:30 AM on the first day of each calendar month.

The reason the times are spaced as they are is to avoid collisions between the different invocations (e.g. whenever Monday falls on the first of the month -- rsnapshot will be called three times in succession in that case, and you want it to have enough time to finish each rotation before beginning the next).

Dealing with intermittent clients

Now I mainly use rsnapshot to back up personal workstations, not servers. These machines are not necessarily always turned on or always connected to my LAN (a couple of them are laptops, for example). If rsnapshot attempts to run when one of them is down, it will simply crash during the backup process and no further backups will be made. This is ungraceful and results in missed daily backups (not just for the machine that is off, but for any others that follow it in /etc/rsnapshot.conf).

So for these machines, I use the two-way SSH password-less login that I have set up, in combination with the anacron package, to create a call-back system. Instead of having the backup server try to back up these machines on its own schedule, these machines run a script that allows them to ask for a backup whenever they are ready. The resulting backups get collected along with the others into the daily/weekly/monthly backup rotation.

Setting up the server for intermittent clients

For each intermittent client, we'll use a separate /etc/rsnapshot_myclient.conf configuration file. These are essentially just copies of the main configuration file, with a few minor changes. Here's an example for a host named myclient:

config_version  1.2

snapshot_root   /backup/auto/

cmd_cp          /bin/cp

cmd_rm          /bin/rm

cmd_rsync       /usr/bin/rsync

cmd_ssh         /usr/bin/ssh

cmd_logger      /usr/bin/logger

cmd_du          /usr/bin/du

interval        daily   30

verbose         2

loglevel        3

logfile         /var/log/rsnapshot

lockfile        /var/run/rsnapshot.pid

one_fs          1

sync_first      1

backup          root@myclient:/home/     myclient/

backup          root@myclient:/etc/      myclient/

Note that for the intermittent case, the "sync_first" option is mandatory. Since the backup synchronization will be asynchronous to the rotation process, they need to happen as separate steps. This file obviously only deals with the backups from this one client. Any other clients will have their own files.

Setting up the anacron job on intermittent clients

The anacron package was made with the idea of intermittently operating computers in mind. It's the usual choice for laptop or desktop computers that aren't always running. Instead of being invoked at rigid, regular intervals, it is simply invoked according to a set of constraints: so many minutes after the computer has been turned on, provided that the last time it was run exceeded some given interval. Again, if you want to know more about how anacron works, I'll just refer you to its man page.

The invocation of the backup is run from the client computer, using its anacron table. A command will be executed which in turn makes a remote call to the backup server and runs rsnapshot there. After that, the job runs essentially as it does for normal server-initiated backups.

To do this, we simply add the following line to /etc/anacrontab:

1       60      backup  ssh root@myserver /usr/bin/rsnapshot -c /etc/rsnapshot_myclient.conf sync

This command will be executed no more than once per day, no sooner than one hour after the computer is turned on. The command is called backup, and the actual command to run is a secure-shell connection to the server, which remotely executes rsnapshot, with a switch telling it to use the correct configuration file for this client, and the argument sync.

The sync argument for rsnapshot does the data synchronization step only. It does not initiate backup rotations. This is because we simply want this script to get the client's data into the correct directory (which, by the way, is /backup/auto/.sync in this example) along with the other daily backups.

The rsnapshot daily command (which will actually move this backup data into the correct backup directory, /backup/auto/daily.0) will be run by the server's normal server-driven cron job (the client requires no special treatment for this to work, except that sync_first must be set to 1 for all of the rsnapshot configurations used).

Some caveats about backups for intermittent clients

It's worth pointing out a couple of details about how the intermittent system will work in practice.

First of all, note that no backups will get made if the machines are not left on longer than the specified limit in /etc/anacrontab (60 minutes in the example above). This is handy, because it means that if you just turn on your laptop to check something quickly, it won't spontaneously get tied up with the backup right away. On the other hand, it means that if you want backups to happen, you need to leave your system on now and then so that the scripts will have a chance to run.

Second, unless the system is left on for backup every day, you won't really get daily backups. This is generally harmless, since if it's not on, the data is not being changed, so you aren't missing anything.

Nevertheless, daily backups for the system will appear in the backup system for days when the system was off or disconnected. They will simply be (hardlinked) copies of the last real daily backup. So the datestamps on directories in the backup may be incorrect. If you look closely, however, you'll see that the modification dates on the files are correct.

Creating a date index

There's really only one thing missing from this set up to make it really usable, and that is a better index. For me at least, trying to figure out what backup to check for a missing file would be a lot easier if I could look things up by date. Trying to do the math and figure out if that was daily.29 or weekly.2 is just way too much trouble.

The rsnapshot package unfortunately does not provide for a date-based index. However, it's not that hard to make one, and I wrote my own Python script to do it. To use the following script, save it as /usr/local/bin/datestamp_backups.py on the myserver machine:

#!/usr/bin/env python

# Copyright (C) 2009 by Terry Hancock

#---------------------------------------------------------------------------------------

# This program is free software; you can redistribute it and/or modify

# it under the terms of the GNU General Public License as published by

# the Free Software Foundation; either version 2 of the License, or

# (at your option) any later version.

#

# This program is distributed in the hope that it will be useful,

# but WITHOUT ANY WARRANTY; without even the implied warranty of

# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the

# GNU General Public License for more details.

#

# You should have received a copy of the GNU General Public License

# along with this program; if not, write to the Free Software

# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA

#---------------------------------------------------------------------------------------



#-- Datestamp Index for Backups --



from datetime import date

import os, glob



RSNAP_ROOT = '/backup/auto'

RSNAP_INTS = 'daily', 'weekly', 'monthly'



date_dir = os.path.join(RSNAP_ROOT, 'date')



# get the date

today = date.today()

stamp = today.isoformat()



# put date in the top level of the current backup (daily.0)

os.remove(os.path.join(RSNAP_ROOT, 'daily.0', 'DATE'))

open(os.path.join(RSNAP_ROOT, 'daily.0', 'DATE'), 'wt').write(stamp+'\n')



# delete the old index entries

os.system('rm -r %s' % os.path.join(RSNAP_ROOT, 'date'))

os.mkdir(os.path.join(RSNAP_ROOT, 'date'))



#for symlink in os.listdir(date_dir):

#    os.remove(os.path.join(date_dir, symlink))



# read dates in all backups and write out symlinks for them as new index

append = 'bcdefghijklmnopqrstuvwxyz'

for interval in RSNAP_INTS:

    for dirname in glob.glob(os.path.join(RSNAP_ROOT, '%s.*'%interval)):

        directory = os.path.join(RSNAP_ROOT, dirname)

        datestamp = open(os.path.join(directory, 'DATE'),'rt').read().strip()

        target = os.path.join(date_dir, datestamp)

        i=0

        while os.path.exists(target):

            target = os.path.join(date_dir, datestamp + append[i])

            i += 1

        os.symlink(directory, target)

I have (somewhat lazily, I admit) simply defined the rsnapshot_root and interval names as constants in this script. You'll have to change these two definitions (RSNAP_ROOT and RSNAP_INTS) to suit your system.

You will also need to add a line to the /etc/cron.d/rsnapshot file on myserver to make use of this script, resulting in this configuration:

30  4   * * *           root    /usr/bin/rsnapshot daily

0   4   * * 1           root    /usr/bin/rsnapshot weekly

30  3   1 * *           root    /usr/bin/rsnapshot monthly

0   5   * * *           root    /usr/local/bin/datestamp_backups.py

This causes the program to be run daily, 30 minutes after the backups are rotated. It simply builds a directory of symlinks, with date-based names, which point to the appropriate backups. Listing the resulting directory shows which backups are available:

myserver:/backup/auto/date# ls

2009-06-23  2009-08-04  2009-08-12  2009-08-20  2009-08-28

2009-06-29  2009-08-05  2009-08-13  2009-08-21  2009-08-29

2009-07-08  2009-08-06  2009-08-14  2009-08-22  2009-08-30

2009-07-17  2009-08-07  2009-08-15  2009-08-23  2009-08-31

2009-07-25  2009-08-08  2009-08-16  2009-08-24

2009-08-01  2009-08-09  2009-08-17  2009-08-25

2009-08-02  2009-08-10  2009-08-18  2009-08-26

2009-08-03  2009-08-11  2009-08-19  2009-08-27

An abbreviated long listing shows more explicitly how the symlinks relate to the actual directories created by rsnapshot.

myserver:/backup/auto/date# ls -l

total 0

lrwxrwxrwx 1 root root 21 2009-08-31 05:00 2009-06-23 -> /backup/auto/weekly.5

lrwxrwxrwx 1 root root 21 2009-08-31 05:00 2009-06-29 -> /backup/auto/weekly.4



   . . .



lrwxrwxrwx 1 root root 20 2009-08-31 05:00 2009-08-28 -> /backup/auto/daily.3

lrwxrwxrwx 1 root root 20 2009-08-31 05:00 2009-08-29 -> /backup/auto/daily.2

lrwxrwxrwx 1 root root 20 2009-08-31 05:00 2009-08-30 -> /backup/auto/daily.1

lrwxrwxrwx 1 root root 20 2009-08-31 05:00 2009-08-31 -> /backup/auto/daily.0

Note that this script doesn't interfere with the function of rsnapshot at all. It simply creates a new directory of symbolic links which refer to the directories rsnapshot makes.

Making the backups available for recovery

In order for users to make use of the backup volume, it's a good idea to mount it as a read-only NFS filesystem. It's important to limit the ability to write to the system, because, due to the nature of hardlinks, it would be very easy to create unexpected problems (editing one version of a file which appears in many backup sets would edit them all!).

I simply made /backup available in this way. So anyone on the LAN who wants to fix a missing or corrupted file, can simply visit /backup/auto/date/ and look for an appropriate date to find a saved copy of the file. Just like the "Wayback Machine" does for the internet.

How the magic works

This technique is originally due to Mike Rubel who suggested using the rsync mirroring tool and hardlinks to create incremental backups with a minimum of space used. The magic here is that only files which have changed are stored in each backup, and yet from the users' point of view, it's exactly like browsing complete independent file systems for each backup.

The rsync program was designed to save bandwidth when making a mirror of a remote site. As such, it attempts to complete the copy while making (and communicating) the minimum amount of information. The client asks the server first for verification that files aren't changed (filesizes and checksums are compared to determine this). When files have changed, they are copied from the source to the target for the mirror.

That's the first part of the trick.

The second part is the use of "hardlinks". When you initially create a file on a Unix or Linux filesystem, you are storing data, but also a directory entry which links to that data. There is, however, no fundamental reason why there can't be more than one such link, possibly appearing in multiple directories.

Each of these links has equal claim to the file data. The filesystem software will keep the data as long as at least one link still points to it (when the last link is removed, the system will "garbage collect" the data blocks on disk, adding them to the available free space).

What rsnapshot does on each backup is to first create an entire directory system, entirely using such links to the previous backup data. At the beginning of the back up, there are therefore two directory structures pointing to the same data.

Then it runs rsync on the copied directory. Every time a file is encountered which has changed on the source filesystem (the directory you are backing up), a new data block will be created replacing the old data. Similarly, any new file appearing on the source system will be created in the new backup. However, any unchanged files will be unaffected (because of the lazy way that rsync conducts its backup), and as a result, will still be hardlinked to the original data.

In the end, you wind up with two directory systems which appear to be independent copies of the source at different times. In fact, however, you are only using the space of a single copy, plus the space taken up by the files that have changed. In practice, this is usually only a few percent more than the size of the single copy.

Of course, all of this would break down if you needed to write to the system (you'd appear to change both files at once, because actually they are the same file). However, when accessing a backup system, we aren't interested in writing, just reading. And for that, the rsnapshot approach is perfect!

Your own Wayback Machine

This is a capability I've been wanting to set up ever since I read Mike Rubel's original paper on the idea. The rsnapshot package took care of everything except the date index, and that wasn't that hard to script.

I've been running this system for about six weeks now, without a hitch, so I think it's pretty solid. It should work on just about any GNU/Linux system, although I've only provided instructions for Debian derived systems (Including Ubuntu, for example).

I hope it improves your peace of mind as much as it did mine!

Licensing Notice

This work may be distributed under the terms of the Creative Commons Attribution-ShareAlike License, version 3.0, with attribution to "Terry Hancock, first published in Free Software Magazine".

License

Verbatim copying and distribution of this entire article are permitted worldwide, without royalty, in any medium, provided this notice is preserved.