amXor

Blogging about our lives online.

3.06.2010

File Management Tool

I ended up sketching out the details of how my file managing tool will work, it's kind of like a virtual librarian that removes redundant files without deleting the file hierarchy. My method is a bit of a mashup of how other tools work, so I'll give credit where credit is due.

This is how it works so far:

  • All files in a tree have their SHA1 hash-value computed (Git)
  • A hard link is created in the backup folder with the SHA1 name (Time Machine) ...
  • unless: the file exists already, then it is hard-linked to the existing SHA1 (...)

There is no copying or moving of files, simply linking and unlinking, so 99.9999% of the time is spent computing the hash values of the files. Here's the python version of this:

import os
import hashlib

backupfolder = os.path.abspath('./backup')
srcfolder = os.path.abspath('./working')
srcfile = ''
backupfile = ''

for root, dirs, files in os.walk(srcfolder):
    for name in files:
        if name == '.DS_Store':
            continue
        srcfile = os.path.join(root, name)
        sha1 = hashlib.sha1(open(srcfile, 'r').read()).hexdigest()
        backupfile = os.path.join(backupfolder, sha1)
        if os.path.exists(backupfile):
            os.unlink(srcfile)
            os.link(backupfile, srcfile)
        else:
            os.link(srcfile, backupfile)
        # print backupfile

This folder contains about 5 GB of info and I thought that the SHA1 calculations might take a couple weeks, but as it turns out, it only takes a couple minutes. What you end up with is a backup folder that contains every unique file within this tree named by it's sha1 tag, and the source folder looks exactly as when you started, but every file is a hard link.

So, what are the benefits?

Filenames are not important

Because the SHA1 only calculates the contents of a file, filenames are not important. This is important in two ways, if a file has been renamed in one tree, yet remains physically the same, you only have one copy and the unique names are preserved. And more importantly, if you have two files in separate trees that are named the same, (ie. 'Picture 1.png'), you keep the naming, yet have different files.

If you have some trees of highly redundant data, this is the archive method for you. My test case was a folder of 15 direct copies of backup CD's that I have made over the years and I have saved about 600M across 5GB. And the original file hierarchies look exactly the same as they did before running the backup.

What is wrong with it?

As it stands, it messes with Spotlight and Finder's heads a little bit. Finder isn't computing correct size values for the two folders. du prints the same usage whether I include both folders or one at a time, which is pretty clever: total:5.1GB, working:5.1GB, backup:5.1GB. Finder on the other hand prints Total: 5.1GB, working: 5.1GB, backup: 4.22GB.

Spotlight

Some very wierd stuff happens with Spotlight.

A Spotlight search in the working directory will show mostly files from the backup directory, which isn't convenient. The files in the backup dir have no file extension so they're essentially unopenable by Finder. Here's what i found using the command-line mdfind:

mdfind -onlyin ./working "current"
/Users/.../backup/5f5b587eb07ee61f15ab0a032ca564a17ff461e9
/Users/.../backup/0f3f769000f164b2e30bb7b3f09482e8cc244135
and so on ...

mdfind -onlyin ./backup "current"
nothing found

For some reason, when searching the working directory it finds the information, yet always resolves the name of the file to a directory it's not supposed to be searching. And if you search the backup directory, it doesn't even bother reading the files, because it assumes from the name that they are unreadable by it.

I'm starting to wish that Steve Jobs hadn't caved and given in to the file extension system.

Time Machine

Okay, it's useful but how is it similar to Time Machine? Time Machine creates a full copy of the tree when it first backs up the system. From then on it creates the full hierarchy of directories but all the files that haven't changed are hard links to the original backup. Each unique file is a new inode created in time, whereas in my system each unique file is a new inode created in space. All duplicates in time are flattened by Time Machine and all duplicates in space are flattened by my system.

Note: To copy folders from the command line and preserve as much as possible for metadata use `cp -Rp`.

No comments:

Post a Comment

Twitter

Labels

Followers

andyvanee.com

Files