Backups and Removing Duplicates

I regularly run backups of my data. I have specific Raspberry Pis connected to external USB drives and my house network, and they back each other up. I have two drives that are each 1 terabyte and one that is 5 terabytes. I’ve got the 5 terabyte drive backing up each of the other two drives. That way all of my data is in at least two separate drives, besides being on whichever device created or downloaded it. I’ve also put many of my important documents, especially things like family photographs, online somewhere backed up. I really don’t like the idea of losing any of it! 🙂

I followed an article I found on Wikipedia of how to set up incremental backups that would scale very nicely, though I had to modify it very slightly to match how I’m doing this. The very first thing that I need to do every time is run screen so that it will keep running in the background, even if I disconnect. Since I only log into these machines via SSH and the backups can be very long running (multiple days for the first run) I don’t want to have to leave my laptop on just to hold a connection open.

#!/bin/sh

date=`date "+%Y-%m-%d"`
sudo rsync -aPH --info=progress2 --no-inc-recursive \
--link-dest=/*destination1*/current/ /*source*/ \
/*destination1*/back-$date/
rm -f /*destination1*/current
ln -s /*destination1*/back-$date /*destination1*/current
sudo mount -t cifs //*host/folder* /mnt/source-temp -o username=*username*,password=*password*
sudo rsync -aPH --info=progress2 --no-inc-recursive \
--link-dest=/*destination2*/current/ /mnt/source-temp/ \
/*destination2*/back-$date/
rm -f /*destination2*/current
ln -s /*destination2*/back-$date /*destination2*/current
sudo umount /mnt/source-temp

The code above tells rsync to do an incremental backup. With the switches, it will do a recursive backup from source to destination, while also using the destination/current folder as a source for hard links for already existing files, and if it gets interrupted it will keep partial files. If I have to run it again it will just pick up where it left off. (Though it does re-scan the folders.) With the hard links it will not use up any extra space for any files that have no changes, and each folder is self-contained. If I’m sure one of the older folders is no longer needed, I can get rid of it without having any affect at all on the backup folders from a new day.

I have two different styles here for two different use cases. The first one is where both drives are connected to the same Raspberry Pi. This is an older first generation, so it’s slower than the new ones, and also only has two USB ports. This is no problem for it being used just as a network storage drive, though, and it works great for this. The second one is where I mount a SAMBA folder that lives on my media center Raspberry Pi. It has all my media on its external drive for easy access, but I also want to back it up, so I use the mount command to simplify the connection. It might not be the most optimized way of doing it, but so far it has worked well, and when it runs at 1 or 2am there’s no one else using it anyway, so a little inefficiency is not a worry. I have all of this automated via a crontab entry so that it will run on its own and I don’t have to think about it. One of these days I will code up a little script that will remove old folders, keeping something like one per week for a few months, and then one per month indefinitely, or something like that. I don’t have enough months with this yet to worry about that, and so far it hasn’t been too bad to just rm -rf the backup folders individually, but it does take a long time. I think that removing all those hard-links takes some processing, and there are many tens to hundreds of thousands of files for it to do every time I do that.

One VERY nice program that I have found on Linux is rdfind to remove duplicates. I set it so that it will also make hard links to files instead of removing them completely. (The default setting is to keep the first one found and discard the rest. The search order is very configurable to be able to finely control which one is found first.) My preferences are to instead keep the files in the separate folders, but make hard links so that the files can be found where they are expected. Especially with both my wife and I creating files and storing pictures and home movies in ways that make sense to us, there have at times gotten to be many duplicates spread out over the different devices. Using this program I can first just straight out copy everything to the backup drive from every device, and then run it to remove all the duplicates. (Well, I first went through by hand and organized what I could. With my wife’s backup folders I didn’t want to change them too much, in fear of making anything hard to find.) rdfind will go through and check the file sizes of every file in the folder(s) to be compared. For any files that have the same sizes it will then compare the first X bytes. (I don’t know how many.) If the first bytes match then it will compare some number of the bytes from the end of the file. If those also match it will finally compare the entire files, byte by byte. If everything matches then it picks one and the other(s) get replaced by a hard link to the first. In this way I can have total confidence that the files are in fact exact duplicates and have no worries about things going missing. It’s a wonderful thing for peace of mind, as well as a huge reduction in waste and duplication across my drive.


rdfind -makehardlinks true /*folder-to-be-checked*

I do not automate the rdfind program, since the rsync setup will also create hardlinks. (Well, and rdfind on this size folder is also a very long process. It took a full day or two.) I’m sure on a more powerful processor/computer it would be night and day difference, but this is what I have handy, and it’s something that only needs to be run infrequently.

I love being able to let the computer keep track of the filesystem for me and make it so that even if we have extra cruft that we’ve caused, it can deal with it and keep it from actually being a problem. This way it also makes it so that if we have a favorite image that we need to do a bit of editing to, that editing will be saved in all places that it is found, once they’re all hardlinked to the same base file. Of course, this also means that we need to be aware of that if we think that the file in two separate locations is a good backup from one to the other. That should generally be safe, since I only rarely run the rdfind program, and on top of that the old versions are kept by the backup drive. Overall we’re very well covered. The only thing better would be to have a separate drive that I only connect to a machine one a week or something to update and the rest of the time was in a fireproof vault or something, or maybe one of those encypted cloud services. At least for now this level of effort/expense matches well with my comfort level for risk.

Leave a Reply