amXor: Arbor Redundancy Manager

** UPDATE! This code will potentially create false hard-links between files that are very similar. I suddenly have many iTunes album covers that are from the wrong artist. I'm not sure if this is a result of how cavalier Apple is about messing with low level UNIX conventions, or if my code is faulty. Beware! **

We all have redundant data on our computers, especially if we have multiple computers. My new tool is aimed at making backups for this kind of thing simple.

I'll take the simplest practical example: You have two computers and one backup hard drive. You want full backup images of both of these systems. Even if you're just backing up the 'Documents' folder, there is a good chance that there is a lot of redundant files. What my little program does is checks the contents of these files, and if two are the same, it creates a hard link between them.

So, if your backup folder looks like this:

Backup
Backup > System 1 > Documents > somefile.txt
Backup > System 2 > Documents > somefile_renamed.txt

It will look exactly the same after, but there will only be one copy of the file, if the contents are identical.

I have dozens of backup CD's that contain a lot of the same information, now I don't need to sort through them and reorganize or delete duplicate files. I can leave them just as they are and any duplicate files will be linked under the hood. Here is the current state of the code, discussion to follow:

arbor.py - v. 0.02

#!/usr/bin/env python
import os, sys, hashlib

arg_error = False
if len(sys.argv) == 2:
   src = sys.argv[1]
   srcfolder = os.path.abspath(src)
   if not os.path.isdir(srcfolder):
       arg_error = True
else: arg_error = True

if arg_error:
   print "Usage: arbor [directory]"
   sys.exit()

backupfolder = os.path.join(srcfolder, ".arbor")
if not os.path.isdir(backupfolder):
   os.mkdir(backupfolder)

skipped_directories = [".Trash", ".arbor"]
skipped_files = [".DS_Store"]
size_index = {}
MAX_READ = 10485760

def addsha1file(filename, size):
   if size > MAX_READ:
       f = open(filename, 'r')
       data = f.read(MAX_READ/2)
       f.seek(size/2)
       data = data + f.read(MAX_READ/2)
       sha1 = hashlib.sha1(data).hexdigest()
   else:
       sha1 = hashlib.sha1(open(filename, 'r').read()).hexdigest()
   backupfile = os.path.join(backupfolder, sha1)
   try:
       if os.path.exists(backupfile):
           os.unlink(srcfile)
           os.link(backupfile, srcfile)
       else:
           os.link(srcfile, backupfile)
   except:
       print "Unexpected error: ", sys.exc_info()[0], sys.exc_info()[1]

fcount = 0
for root, dirs, files in os.walk(srcfolder):
   for item in skipped_directories:
       if item in dirs:
           dirs.remove(item)
   for name in files:
   fcount += 1
   if fcount % 500 == 0:
       print fcount, " files scanned"
       if name in skipped_files:
           # print name
           continue

       srcfile = os.path.join(root, name)
       size = os.stat(srcfile)[6]
       if size_index.has_key(size):
           if size_index[size] == '':
               addsha1file(srcfile, size)
           else:
               addsha1file(size_index[size], size)
               addsha1file(srcfile, size)
               size_index[size] = ''
       else:
           size_index[size] = srcfile

Here are the added features of this version:

Folder to backup is passed as command-line argument
Backup files are placed at the top level of that folder in the .arbor directory. These are just hard links so they don't really add any to the size of the directory.
Ability to skip named folders or files
Only calculates a checksum if two files are the same size.
Only calculates a partial checksum if a file is over 10mb. Checks 5mb from start and 5mb from middle.
Prints a running tally of files checked (per 500 files)
Doesn't choke on errors: some files don't like to be stat'ed or unlinked, permissions issues.

I'm not sure about the partial checksum option, but it was really bogging down on larger inputs. It's not really practical to do a SHA1 checksum on a bunch of large files, and i think it's safe to say that two very large files can be assumed to be the same if the first 5mb, the middle 5mb and the overall size are exactly the same. Perhaps I will add an option later for strict checking, if someone is highly concerned about data integrity. But the practical limitations are there, I'm 30,000 files into a scan of my ~200GB backup folder and I certainly wouldn't have gotten that far without the file size limiting.

Update: The scan was almost done when I wrote this. Here is the tail end of the log:

31000  files scanned
31500  files scanned
32000  files scanned
32500  files scanned

real 43m40.531s
user 7m25.593s
sys 3m35.540s

So 233 GB over 32500 items took about 45 minutes to check and it looks like I've saved about 4 GB. Upon further inspection, it seems that most media files save their metadata in the file contents, so the checksum is different. Hmmm....

3.08.2010

Arbor Redundancy Manager

arbor.py - v. 0.02

1 comment:

Twitter

Blog Archive

About Me

Labels

Followers