Blogging about our lives online.


From Command-line to XCode


I'm a command-line guy at heart when it comes to programming. It seems much more organic to know what every project file is doing and how each one contributes to the whole. Which files might break things, which files are optional and which files can be refactored or optimized.

"The Pragmatic Programmer" challenged this notion. They recommend diversity. If you only know and IDE, try command-line and if you know command-line, try an IDE. At first I was skeptical of this advice. I've come to know my command-line toolset fairly well, and IDE's feel like they add too much baggage and inflexibility. I've built Xcode projects before, and there is a lot of behind the scenes voodoo attached to the projects.

But, after pushing through my initial concerns, I've come to really enjoy using Xcode for my projects. Here are some command-line programmer concerns, and how to alleviate them in Xcode. Some may be obvious to those who are at home in their IDE, but they weren't to me.

  • Problem: I'm going to have to fill my project folder with unnecessary Xcode-specific files.
  • Solution: Make a workspace (eg. Documents > Web_development.xcworkspace). Add your project folders to this workspace(without copying the files).

  • Problem: My project is unique, won't I have to use the command line anyway?

  • Solution: This is where command-line knowledge really pays off. Just create a new "Scheme" and you can run scripts pre/post Build pre/post Run. For a simple HTML site, I just set as the executable and my index.html file as the argument. The build scheme is your hook into all the command-line utilities at your disposal.

  • Problem: I prefer editing with [vim, emacs, ...].
    Solution: Well, I can't really help you there. Personally, I like Vim, but I'm not a total fanboy. I'm quite happy using the standard OSX keybindings, with a few tweaks for one's I use a lot. Find & Replace works well. Split view and file navigation work well, and syntax-highlighting is fully configurable and works with every file I've thrown at it.

  • Problem: I can't manage my git(svn, whatever) repositories from there.

  • Solution: This was my own ignorance. I use git for all my projects, and Xcode support for git is actually pretty awesome. It automatically knows if a file is in a git repo, and puts an M beside the file if it's been modified.

All in all, I've really enjoyed programming in Xcode. It does take a small investment of time and imagination to build custom projects, but I think it's worth it.


Skratch Beta - Call for Testers


I've finally got my online coloring book ready to unleash into the wild and I would like to ask you to try it out!

If you're willing I would ask that you answer all/some/none of the following questions and e-mail them to me (1andyvanee _at_ gmail _dot_ com )or just leave a comment.

  1. What hardware/browser(s) did you try it on?
  2. What was your overall impression of the app?
  3. Have you used any comparable applications? If yes, what did you like/dislike about them?
  4. Was there any features that seemed unexpected or strange to you?
  5. If you could add one killer feature, what would it be?

Thanks guys, have fun with Skratch!

Download it Here

P.S. If you answer Internet Explorer to the first question, you probably won't get to the rest...

P.P.S. Confirmed, Absolutely NO IE SUPPORT.


Endangered Species and UX


I've just picked up Gene Wolfe's collection of short stories, "Endangered Species" for a second read. I stopped after the introduction to think about it for a while. It's that good.

I'll outline his points first for context. Wolfe starts by saying that the most important aspect of a story is that it have a reader. He explains that he wrote these stories for me, the reader. He then goes on to describe me, the reader, in great detail. My hopes and fears, my character flaws, virtues, vices and deepest thoughts. He also tells me the reason he wrote a few of the stories particularly for me.

His examples are obviously generic enough to cover a lot of people, but it's truly remarkable how specifically he targets his audience. Here's an excerpt to give you the flavour.

"...At certain times you have feared that you are insane, at others that you are the only sane person in the world. You are patient, and yet eager..."

The reason I have found this so refreshing is that I've read a lot on User Experience(UX) lately. There is a fair amount of hype and self-importance in a lot of UX articles. The stated goal of UX is providing a pleasant experience for "The User", but most of the literature drives home the point that the average user is ill-tempered, ignorant and just plain stupid.

Gene Wolfe truly cares about his readers. He thinks highly of them. He also expects creative engagement and effort from his readers.

UX design seems to be focused on placating unwilling participants. I would argue that it should be focused on delighting the willing participants. Valuing them.


SVG Explorations


SVG seems to have got buried under all the hype about the HTML5 canvas. But it's really quite cool and I'm surprised more sites aren't using it for dynamic content.

Part of the problem seems to be early and inconsistent implementation. A lot of the demos that are floating about the web are old, uninspiring and bland. Some good examples can be found on the raphael.js site and a decent example of a full page built with SVG is

I've built a small example using CSS styling and JavaScript mouseover events to randomly rotate each path in an image I "traced" using Illustrator.

Take a look. (Hint: Try to click the pink element!)

Update: Apparently Google Graphs(a spreadsheet option) are rendered with SVG. Cool!


Chaining in JavaScript


It's been bugging me for a while that I didn't understand the method behind jQuery's function chaining. So I built the smallest functional version of it. I call it aQuery (for andy, or awesome).

This is the calling code:

window.onload = function(){
    aQ("#content").height().background().elem.innerHTML = "hello";

I have made two functions .height() and .background(), which do pretty much what their names suggest. The elem is the original object and so I use it to call "native" code. Maybe should have called it "target" to be consistent.

Here's the meat and potatoes of the "library":

(function() {   
    var aQ = function(selector) {
        return new aQuery.fn(selector);

    var aQuery = {};
    aQuery.fn = function(selector) {
        this.elem = document.getElementById(selector.split("#")[1]);
        this.cssText = "";
        this.height = function(){
            this.cssText = this.cssText + "height: " + window.innerHeight + "px;";
   = this.cssText;
            return this;
        this.background = function(){
            this.cssText = this.cssText + "background-color: #444;";
   = this.cssText;
            return this;
        return this;
    return (window.aQ = aQ);

Each call to aQ makes a new object, primed with the selector string. This object has several functions and attributes, and all the functions return "this", so they can be chained together. As it stands, it's not dynamic at all but you can use your imagination!


jQuery Mobile Data Attributes

In testing out the newly released jQuery Mobile library, I'm left with strong but mixed feelings about it.

First, it's a surprisingly unique solution. The jQuery team is very good at simplifying hard problems, but in some ways you come to expect that the solutions will be quite similar. I expected a bunch of jQuery-type functions that would transform your markup into the mobile look and feel. I expected jQueryUI for mobile, and was ready to start moaning about how inflexible and unsemantic it was.

Instead, their solution is to add HTML syntax and automatically wire up all of the functionality for you. Not what I was expecting. You don't have to write a single line of JavaScript for the default functionality. This has got to be the easiest solution for porting HTML content to a device. Your "pages" can all live inside one HTML file or be loaded dynamically with AJAX. All the URL's get dynamically referenced to the main page and resolve correctly. And no handwritten JavaScript or CSS in sight.

A lot of this functionality comes through the use of custom data attributes in your HTML. In my perusing of the HTML5 features I have to admit: this one seemed fairly boring. So what if I can add custom attributes and my page still validates? Why should I care?

But the guys at jQuery Mobile have capitalized heavily on this notion. By using custom attributes quite liberally to alter the semantic meaning of lists, headers and anchors, they can automate all the behind the scenes JavaScript and CSS necessary to make it work.

It's a strange reversal of the web technology where the withering old HTML document is front and center again and the CSS and JavaScript hide in the back room, pulling on the invisible strings.

For a "Navigate to Text Content" site, this is a dead-simple solution. I think a 'Project Gutenberg' site is at the top of my list. Or maybe an RSS reader.
Some negatives that I see:

  1. Building or altering UI components is going to be a pain.
  2. Really only scales to iPhone-size devices. (They support iPad, but who uses full-width buttons on iPad?)
  3. The UI is a tad bland. (Erring on the side of caution, I guess)
  4. I'm undecided on all of those data-role tags.
  5. It's new and strange. I don't like new things. I'm scared of it. What if it doesn't like me?

My real takeaway is not the tool itself, but understanding how it works and how I can incorporate those ideas into my own projects. There's a lot of smart people working on projects like this, and you really only get to know them by reading their code.


Boxeddet - Another Wireframing Tool


After reading the article on about wireframing I thought to myself: "What the world needs now is another wireframing tool!"

Yes, the options are staggering. But in a way they all seemed very similar. My feeling is that if you want to sell a tool to a designer, the primary goal is not adding new functionality but simply adding a bit of unique visual flair to make it "pop".

Last night I hacked up a bit of code for a different type of tool. This one produces actual HTML layouts in the grey-box style. You just draw your div's inside the browser, and the DOM is dynamically updated.

I've put the code up on github, so go ahead and clone/fork/prod it and let me know what you think!

Get The Code!

Or you can Test drive it!


jQuery Extensions?


I've been working with jQuery a fair bit lately. Although it gives you a fair number of handy shortcuts, I think it's true value is as an educational tool on how to use JavaScript effectively.

This means diving in to the jQuery sources. This means learning what anonymous functions and closures are all about, letting it settle, and then learning what they're about again. It also means taking 98% of all the JavaScript tutorials on the internets and throwing them out. Well, maybe not throwing them out, but putting them in to context.

What I'm digging in to right now is programming in the jQuery paradigm. What this means is bolting all my own functions onto the jQuery object through an anonymous function. I can then call these functions just as I would call the jQuery functions, making for a cleaner and more reusable codebase.

The basic pattern is this:


The jQuery library.


This is where I attach my functions to the $ object. The template goes like this:

(function ($) {
    var private_var = 1;
    var private_function = function() {return "Only this closure can call me, I'm special."};
    $.cool_function = function(){
      return "Hello jQuery.";


$().ready(function() {

That's it in a nutshell. For more info on wrapping your brain around why this works, see this article. I wish I would have read it earlier.

You'll note i'm not handling input from jQuery selectors. I'm going to have to let that one bounce around in my brain a bit longer...

Happy coding!


Call For Web Standard


Developing native apps is a pain. Yes, you have the luxury of fixed screen dimensions and access to hardware features that may not be available elsewhere, but when you're done, you have an app that only works on one platform.

Web apps, on the other hand, work on a lot more platforms straight out of the box. The performance may not be the greatest, but the technology is advancing at breakneck speed.

The single largest drawback of web apps is working with files. The browser has no access to the filesystem, which makes sense in a lot of ways, but I'm surprised that this hasn't been addressed more thoroughly.

What we need is a new web standard and API for remote filesystems. Unlike the popular storage solutions like Dropbox, and MobileMe, this would be a remote hard-drive that you and your apps have access to. Apps would need to authenticate to access zones of your storage space; you could make some zones public and other zones publicly accessible via login.

Again, this goes back to my idea of content control. I would far sooner use an app that writes to a simple, accessible file-type than one that uses a locked down and proprietary format. The same standard applies to web apps, but the current assumption is that your web-app files live and die with the service that created them. Even Google Docs, probably the best of it's breed, only operates ideally if you only access and edit the files through their service (downloading/uploading strips date, author, etc. ). Interoperability with other web apps? Well, that's just crazy talk!


But why is it crazy? What is it about web-services that makes us freely give up the rights to our data? If I downloaded a word processor that password encrypted every file so that only that app could read it, I certainly wouldn't use it! But we are so blinded by the sharing and publishing features of these apps that we throw interoperability and future-proofing out the window.

In fact, I think there's some merit to this idea even in the desktop computing world. By zoning the filesystem, you could download and run apps in a sandboxed environment without worrying that they are messing with your files.

Maybe something like this exists? It seems like the IETF group working on WebDAV has something like this in mind, but it still seems pretty system level, like Samba and afp.


Divide And Conquer


School has started and I've been very busy with projects, but haven't had time to write. And in fact there's not a lot to write about. Sure, I've been working on a lot of cool stuff, but the big picture hasn't come into focus.

Along with getting a grip on the Adobe tools (Photoshop, Illustrator, Flash/AS), there's the programming.

One of the best insights that I've got so far is how powerful pencil and paper can be. Storyboarding, brainstorming and sketching use cases are a great way to think about a project from all angles. If you sit down and start coding without any clear direction, you are wasting a lot of thought on design and abstract principles, when you need to focus that brainpower on making sensible code. Working this way, a lot of coding design patterns start to emerge before you even begin coding.

And maybe that's the one bigger picture that is coming into view: learning to divide and conquer is vital to any project, even if it's just a one man show.




Technology is advancing, but is it really advancing as fast as it seems?

I might come across as a bit of a Luddite here, but I think that a lot of the emerging software and hardware is ephemeral. And maybe that's the way consumer electronics has to be. In the pre-dawn of the internet, software was developed for aerospace, medicine and government. It had to be robust, provable and enduring.

The landscape has changed completely. Today, developers have embraced the ephemeral nature of apps, iPads and cell phones. They're not worried about building enduring artifacts and code that will be solid and transparent for a generation.

Perhaps hindsight is always myopic. There are plenty of criticisms that judge the UNIX philosophy as being hackish; a virus that spread the "Worse is Better" philosophy of the the open source movement in general. But today, the tools have been honed, pruned and documented so well that it all works together as an organic whole. An organic whole that can be extended or repurposed in ever new ways.

It is this sense of growth that seems lacking in the software and hardware landscape of today. Code and hardware now go into the landfill in a few years by default. Without the organic usefulness of command-line programs, standalone apps have no value beyond a very short life-cycle.

In a way, I think it's what had to happen. For most consumers, the only interface that will get used is the one that is blatantly obvious. They feel smart when the computer doesn't make them feel stupid. And so user interface must appeal to the lowest common denominator. I just feel a tiny shiver of remorse when I see the amount of energy poured into technology that will be junked and forgotten in months or years.


Git Whyto


There are plenty of Git howto's on the internet, instead I think many people need a Git whyto. This guide is specifically focused on using Git and GitHub for personal projects like writing, publishing and creative work.

1. Complete Content Control

Git basically tracks any folder you tell it to. As you make changes to files, you can commit the changes, roll back the changes or create branches to try out alternate paths without messing with the main branch. You then push these changes to GitHub, which is simply a mirror of the files on-disk.

2. Responsibile Collaboration

It's easy to add others as collaborators on your project, and easy to see what, when and how much they changed. Any changes that you don't agree with are easily rolled back to previous versions.

3. Tracking Intentions

Each time you commit a change, you write a short description of the change. This helps you to get a sense of where the project is at, what you were thinking, and what still needs work.

4. Scalable Projects

Personal projects start out small, but there is a chance one of them may catch fire and become a big idea. Git works well with a single contributor or thousands of contributors. And if the project takes off, you will have the track record of the entire project right from the first commit.

5. Universal Appeal

Git was designed for software projects but it really has many compelling features for any project, from thesis to portfolio to presentation. System-wide backup plans like Time Machine are great on the large scale, Git allows the same kind of snapshots on a per-folder basis but adds the ability to merge, clone and modify these snapshots across many computers.

One slight problem with GitHub is that the free version only allows public projects (view and download).

If your project is private you can always just use git on your personal computer, zip the folder and back it up to Google Docs. You can then unzip the folder to any other computer and it will work with git exactly as it should. Or, alternatively, you could keep your master on an external hard drive and merge changes from your laptop, desktop and work computers without a problem.


Photo Archiving


Backups and Archives are two different things. Ideally you should have a good archiving system, and your backup would be a redundant copy of that.

In photography, a robust archiving system has between two and four copies of any file. These copies are:

  • Camera Card
  • Working Copy
  • Source
  • Backup

In this case, Source and Backup are read-only copies that are exact duplicates of the file off the camera. The Backup copy should be on a seperate physical media and ideally in a seperate physical location.

The working copy is the one you edit, view and share. Once a file is edited, it is important to make a Source and Backup copy of them as well.

It is good to view these copies as layers of mutability.

Camera Card: Changes the most. Every shooting session will create and erase content from the card.

Working Copy: Can change with each editing session and these files can be safely erased once backups are made.

Source Copy: Should never change, it is the copy that is accessed for viewing and making working copies. Any Source copy that is deleted will be deleted forever.

Backup Copy: A mirror image of your Source copy. It will only be accessed in the rare event that your source copies are compromised (HD failure, fire, theft, etc.).


The way to implement this structure will vary with the tools that you are using. The first step might be to disable automatic importing when you connect your camera. A better solution when connecting your camera is to make source and backup copies immediately.

If you are using DVD's as your Backup copy, it probably makes sense to use a USB thumb drive as an intermediate backup until you have enough to fill a DVD.

It is important though to do regular integrity checks on your Backup copies. You need to make sure that they aren't compromised without you knowing. If you use an external hard drive, and are command-line savvy you might do "diff -r PictureSource/ PictureBackup/" to compare the entire contents of the two folders. If you are using DVD's you either have to trust the quality of your media, or do periodic checks of your media.

  1. Connect Camera.
  2. Make Source copy, check file count.
  3. Organize Source photos into events, deleting any Absolute Garbage.
  4. Make Backup copies, check file count.
  5. Make Working Copy from Source copy by importing into your editor of choice.

Resist the temptation to organize your Source folder too much. Use only large, time-based, linear chunks.

  • DO: Pete's Wedding, Reception, Saloman Bay Beach...
  • DON'T: Flowers, Trees, Close-ups, Saloman Bay Beach...

The attentive reader will notice that the beach name is both in the do and don't list. This is because in the first, I've assumed that it was an afternoon of shooting at the beach. In the second it is a couple shots of that subject interspersed with shots of other subjects.

On a related note, don't rename your files. The filenames from your camera are a very good shooting record of your camera. If you want to attach descriptions and tags, do it in the file's metadata.

Now you are ready to edit your files. A typical editing session might look like this.

  1. Make a Working Copy from Source copy (if it's not already done).
  2. Edit the file(s).
  3. Make a new source copy next to the original (eg. DSC1255-edit1.jpg)
  4. Make a Backup of what has changed in the Source.

If you are using DVD's, you probably want to make a new folder for any edited files, because the original folder might already be burned to disc. Again, use the original name with a standard suffix (Pete's Wedding-edited).

Generalized Strategy

Photos are actually quite easy to organize, compared to the myriad of other file types out there. But the systematic logic can be transferred to other types of files as well.

By being systematic, a lot of confusion can be avoided. It also makes it possible to work toward automating the process entirely. But as the number of file types increases, so does the tendency to use different sorting methods or mixing methods.

What The Internet Needs Now


I've been blogging a fair bit lately about metadata, archiving and dealing with the masses of information that we produce and consume today.

It all comes from an idea that has been alternatively percolating and distilling in my mind. It's the idea of stability.

The internet age is highly energized and highly transient. The next big thing comes along, is adopted by huge masses of people, and is discarded in a matter of months or years. Examples abound, but my working example is waning interest in Facebook.

I recently printed a PDF of my entire facebook history because I'm a dork and find that type of thing interesting. I just kept scrolling down and clicking "Older Posts" until I reached the "Andy joined Facebook" post. Here are my findings:

  • It's only a 36 page PDF (small font though). Though it would be more.
  • I joined in June 2007.
  • My first friend is someone I haven't spoken to since.
  • My first status update is still one of my favorite quotes: "The chief enemy of creativity is good taste. - Pablo Picasso"
  • I was really into "Graffiti" in the early days. I printed those off too.
  • I "connected" with many people that I had forgotten about for years. Then promtly forgot about them again! Some I genuinely want to keep in touch with though, so that's good.
  • I wish there was a graphic scale of friend numbers. I think it would resemble a logarithmic curve, rapidly rising, but plateauing at around 220.

Okay, it's a silly game, but I find it alarming that we invest so much time into something that is probably doomed to extinction or, more likely, habituation. Facebook has become an e-mail replacement and when is the last time you were truly excited about e-mail?

The tool has become stable: less exciting, but stable. The end result is a tool that is far less robust and stable technically than the one it replaced. E-mail may be getting archaic, but it is a completely open and robust mechanism that is largely independant of any specific implementation. Facebook on the other hand is completely tied to corporate interest and a locked down API.

I think we must all make a conscious effort to aim for stability in the turbulent age we live in. To research the tools we use and what the long term strategy is for them. And for those who are in the business of creating the tools, to think long and hard about how robust the system is in the long term. Is a locked and site-specific format necessary? (no, never!) Is your user policy going to limit the long-term viability of the system?

In the case of this blog, I've made a conscious decision to keep control of the content. The site is just the publishing medium, the actual product, for me, is just a folder of plain-text files on my hard drive in chronological order. It may not look flashy with fixed-width fonts and markdown formatting, but it's simple, stable and guaranteed to work long into the future.


Auto Metadata


One interesting part of the weekend photographic seminar I just got back from was the emphasis on metadata.

It's important to take great photos, but it's just as important to know how you got those results. In the days of film, A photographer would keep a shooting log, recording aperature, speed, lens, etc. A decent digital camera will record all that information for you as long as you know where to look for it.

Photography is one of the best examples of how useful metadata can be. The reason it is so useful is that it is unambiguous and immutable. Any metadata that has these qualities is easy to work with.

Pure Metadata

This is the simplest kind of metadata to work with and often the most useful. It usually has a direct connection to something tangible and concrete. Some types of pure metadata:

  • File size
  • File Attributes (resolution, color space)
  • Physical Settings (ISO, aperature, shutter speed)
  • Date*

Dates are very useful but only if they are relevant. If the date is not the actual shooting date it is actually less help than no date at all. This can happen with downloaded files or files that are saved to a new location.

Impure Metadata

File names are probably the least valuable indexing tool out there. They don't have to have any relevance to the content, an identical file can have different names and totally different files can have the same name.

Tags are good, but I don't find them as useful as they could be. Again, there isn't necessarily any correlation with the content, there is no standard set of tags or naming conventions enforced. But the primary reason it doesn't work for me is that it's a manual process. Any metadata that isn't automatically applied to all content is too much work.

Folder heirarchies are actually metadata and we use them every day. It's important to understand this when backing up or rearranging files because the heirarchy probably has meaning that would be lost if you moved those files.

The biggest problem with impure metadata, like the one's I've mentioned, is they can be very hard to normalize. There may be information attached as file names, tags and folder structures but it's all been manually assigned by what makes sense at the time. This manual metadata will always be very fragmented and incomplete.


Meota Summer Photographic Seminar


Just finishing up the weekend photographic seminar at the lake. Got some good shots and some good practice at composition, lighting and rhythm.


Chaotic Data


Data is chaotic. Our attempts to tame it are largely attempts at dehumanizing ourselves. Here's a transcript of a piece of cardboard on my Grandfathers wall:

|                         |
|  WALLET    LIST         |
|            ------       |
|   EARS                  |
|             CANE        |
|  WATCH                  |
|                         |
|   SWIM                  |
|                         |

Even by typing it here, I am imposing a fair bit more order than the original had. The original was written with a Sharpie and, although it was very legible, the intent and structure was difficult to parse. "List" may or may not have been the title or one of the items to remember. "Ears", I assume, meant his hearing aid. All the items seemed like a general checklist for leaving the house until "Swim", which makes the list seem very specific to a certain day, or day of the week. He has had this list posted by the door for quite some time.

Grandpa's list was not very Search Engine Optimized, but he didn't seem to bothered by the shortcoming.

This list has a few other interesting qualities that don't transfer well to the digital domain:

  1. It's taped to the door. Location and size mean a lot in the physical world.
  2. It has no definite structure. Anything could be added or crossed off the list with ease.
  3. It's author is obvious. The handwriting is an echo of the mind of the author.

My point is not that my Grandpa has a habit of making strange artifacts, my point is that we are more unique, creative and human when we're not forced to order our data along the way.

Our attempts to structure and control our data are effectively dehumanizing us. Yes, parsing the messy world of human language and thought is not a simple task for computer systems. It is much easier to build rules and frameworks and force humans to fit their ideas into them; to meet the computer halfway. This has a few effects:

  1. it allows the less creative minds to be efficiently less creative and feel organized
  2. it encourages independant thinkers to constantly sabotage the system for entertainment
  3. it erodes the capacity for beautiful, unstructured and creative enterprises.

I think we too often forget the distinction between the technologies that are built for enterprise and those built for personal use. This is probably because all of the tools are built for enterprise. All of them. The ones that aren't are garbage. Web2.0 or 3.0 or whatever, only really make sense for businesses and consumers. Not for people.


Champions Of Order


Software Engineers assume that your data can be ordered logically. This assumption is built in to the file system. They assume that it's an easy task, but that it's best left to the user. Better than imposing an order that doesn't make sense for some users.

But the software engineers are wrong and here's why:

Case Study: Plain Text Library

This is the simplest of all systems to represent digitally. All the items are in the same format, each is a distinct entity. So, you start ordering your items by Author into a heirarchy. The index itself is also an item, but not like the other items, so you decide to put it at the top of the heirarchy.

It is simple and unambiguous. But, like all libraries, there will come a day when you want to extend it a bit. You might want to add a scientific paper, a scientific journal, a DVD box-set or an untitled and anonymous poem. You might want to seperate fiction from non-fiction or keep a commentary together with it's source. At every step, you must decide how to incorporate these new elements and whatever decision you make must be applied unerringly in the future.

From the start, the heirarchy must be robust enough to handle any future additions. The end result - it never works. It's hard enough for libraries with full-time staff, training and documented processes to keep things in order. The average computer user has no chance of making a logical heirarchy that will make sense now or into the future.


UNIX and the Emerging Web


After reading "The Art Of UNIX Programming" again, it's got me thinking about the nature of the technological explosion that has happened since the quaint days of mainframes and text-based consoles.

The two ideas that these legendary computer scientists cared greatly about and seem to be dissolving today are ownership and simple formats.


Web-based services take a certain level of trust. The user gives control of their data to a service provider in trust. By doing so, there is a subtle loss of control by the user. We end up with bits of content all over the web that may or may not be accessible or editable by us.

There has been some recent development to build a unified "user space" like OpenID, but in my alternate history of the internet, user profiles would have expanded beyond the workstation into a worldwide user space. Up until now, the user space was determined by email accounts, but with the shifting landscape of email services this was not really a practical solution.

In my alternate history, instead of email accounts, user profile would be generated like IP addresses, and could be linked to email accounts or any other service. Maybe this just seems too big brother for the average user. It would enable much easier tracking of the web services and content that is linked to your profile. I think this would be incredibly convenient for managing your online history; the services you've used and the content you've generated.

Simple Formats

In the UNIX days, plain text was the format of choice for almost anything. Today, raw emails remain more-or-less human readable, but that's about it. Every new web service and application introduces their own formats and interfaces.

Maybe I'm just dreaming, but I think that with some work, text formats could work in the context of most popular web services. Twitter, for example, is simply a collection of text notes with the date as metadata. So why not integrate it with the desktop workflow?

The reason is that web browsers have been designed to be very isolated from the desktop, and the only way to break this isolation is with downloadable formats or task-specific applications that can bridge the gap. Again, email clients are the classic example of a task specic application that grabs a stream of information with a secret handshake and delivers content to the user.

What I envision here is a two-way RSS folder using a universal keychain and managed with something like Git. But what this requires is that the web service and browser be given read/write access to the folder. I think this model is fairly incompatible with the current state of web technology.


In some ways, I think that the values of the UNIX crowd are coming full circle with emerging web trends. ReadWriteWeb has been blogging for years about the potential of a web that's as easy to write as it is to read. This post is a good summary of emerging trends like structured data. But I think that to really acheive this flexibility, internet and desktop architecture needs to adapt and build robust standards to allow it to happen.

The problem isn't access to services, it's the completely sandboxed environment of the web browser. If we truly want to see the potential of web applications and services, we need a way to let the web in.


Browser Review: Arora


The Browser War is an ongoing battle with no obvious winners, although there is a seemingly never ending stream of competitors.

The big players at the moment seem to be converging on their basic feature set while edging out their opponents in a certain area. Firefox has extensibility, Chrome has speed, Safari has stability, Opera has security, and IE has users (yes, IE bashing is mandatory in any browser debate!).

Beyond these, there is an interesting subset of purpose built browsers: Flock does social browsing in an interesting way. Sea Monkey includes many tools specific to developers and advanced users. Dillo boasts an extremely small footprint and fast loading.


But the new browser I've been testing lately is Arora. It is still a very immature version, but I think some of the features are really well done. First of all, it looks really nice on Mac. This may seem like a silly comment, but most apps that have been developed with a "cross-platform" toolkit come out looking like cheap knockoffs with cartoon icons and ugly screen proportions. Even Chrome, with it's oh-so-clever tab bar and Firefox with it's monster icons. This is fine for Linux, which doesn't have clearly established UI conventions, but for Mac development: don't break the rules unless you have good reason.

In this sense, Arora does it right. Not that they use Mac icons, but they are minimalistic and clear, and the layout doesn't add visual clutter or break established conventions. It doesn't shout "Look at this groundbreaking layout!", and that's a good thing as far as I'm concerned. It let's you focus on the task at hand.

I downloaded the .10.1 version which had some serious bugs on the Mac: many SSL errors, no two-finger scrolling. But this is an alpha version so I gave it the benefit of the doubt. I installed the Qt SDK and built .10.2 from source and now it works great. Not really for the faint of heart, so hopefully there will be a Mac binary update soon.

Give me the Keyboard!

The first shock was that it was very keyboard friendly. This is like a +5 million points for me. The way it works is that when you're on a site, if you press the Command key all the links get letter popups beside them. You press the key and it goes to the link. How cool is that! Generally, all the links get letters based on the first letter in the link name, so if a site's top bar has links named "Home", "Reviews" and "Downloads", these will be mapped to the keys H, R and D. All other keyboard shortcuts work as you would expect on a Mac.

These guys have done their homework when it comes to keyboard accessibility. The only exception is using F11 for fullscreen, my macbook has F11 mapped to Volume-DN and fn-F11 mapped to Expose. Not a big deal, considering the wealth of things that they got right, even compared to the big browsers. For example, Cmd-L puts the cursor in the Location bar and Cmd-K jumps to the Search bar. In Safari, you have to Cmd-L to the location and then tab to the search for keyboard only input. Yes, it might be a small thing, but it's good attention to detail.

So far, there's really nothing keeping me from using this as my main browser, which is an amazing feat considering how lightweight and early in development it is. Page loading is fast and application loading is the fastest of any full-featured browser I've used. It renders with qt-webkit, which was a bit rocky to begin with, but has come a long way in a short time.

And you gotta love the polar bear on the globe icon!


Command Line Google


Google is available for the command line!

Install this and the gdata library and your good to go. This post is my first test of the blogger post command. I'm hoping that HTML publishes correctly. This is, of course, related to my previous post about command line productivity.

Now if I could install this on my Android phone...


New Templates!


If you've been paying attention, which you haven't, I've changed my site layout. Now it kind of looks like my site and my site are kind of the same thing.

I did this by yanking the guts out of both and using inline HTML for everything. The template headers and layouts weren't very flexible in Google Sites or Blogger, so I finally scrapped them entirely and put my own header in the content section. I also found a little css hack to get rid of the Blogger nav bar. By what I've read it's not against the rules but, we'll see.

Also I've generated some nifty images for the headers by playing around with Gimp's IFS Fractal patch. What can't fractals do?

Applescript Mindbender


This is a crazy hack!!!

How to have the same file contain both a shell script and a compiled AppleScript

% /usr/bin/osacompile -o shebang.scpt
on run argv
end run

% chmod +x shebang.scpt
% echo \#\!'/usr/bin/osascript' > shebang.scpt
% ./shebang.scpt a b c
a, b, c

This puts a compiled applescript in the resource fork and a shell script in the data fork... mind exploded!

Why The Command Line?



The software arms race is heating up. At the forefront are apps, which are small, single purpose programs that are very stylish, but generally have very little functionality. The reason I say that they have very little functionality is because they are built as small standalone, self-contained programs designed for a very specific purpose.

The iPhone needs multitasking because every little bit of functionality is wrapped in it's own app. The average computer user has many windows open and many applications installed because we need many apps to complete one task.

In all these things, I believe that software developers have lost their way. The user is forced to multitask extensively and perform every task one step at a time.

The Killer App

For those that do not understand the UNIX command line, it looks like a deviously complex and foreign language. There are no visual clues about how to accomplish a certain task, no tooltips or interactive dialogs to walk you through.

What they don't see is thousands of purpose-built apps all designed to work seamlessly with one another and to be completely automated. Using a command line is not just a bunch of typing to achieve the same result as a mouse command line, it is a bit of typing to achieve what a thousand mouse clicks could do.

The standard question I get when people see me working in a fullscreen terminal is "Wow, are you a hacker?" No, I'm not a hacker in the sense you mean. I'm not decyphering encrypted passwords and finding loopholes in government databases or anything like that. I am a hacker in the sense that I'm not constrained by the functionality offered by graphical interfaces. I am a hacker in the sense that if I have repetitive or menial tasks, I'd rather let the computer do the work for me.

Learning the basics of the command line takes some work. Learning how to use the powerful text editors like Vim or Emacs takes some work. Learning how and when to use the versatile processors like Sed and Awk takes work. Learning how and when to tie these commands together into shell scripts or functions takes some work. But in all these things, it is work that pays off in higher productivity and far fewer hours spent repeating tasks that are better automated.

But the true value of the command line comes when you can effectively monotask. When you can focus on the task at hand and automate all the housekeeping. And when I say "housekeeping" I probably mean a lot more than you think.

For example, here's my current workflow for blogging:

  1. blog why_the_command_line? >> opens a new file in my blog folder in the Vim text editor.
  2. Type post using vim and the simple Markdown formatting.
  3. %s/CLI/command line/g >> Replace all occurrences of 'CLI' with 'command line', cause I knew I'd be typing it a few times.
  4. "publish why..." >> run the file through the Markdown to HTML formatter and email it to my blogger page.

I'm using two simple scripts here, one called blog and another called publish, and for most workflows that's all it takes. I now have a local copy in the easy to read markdown format and one on my blog in HTML format.

The value in this is partly the shortcuts, yes, but mostly the lack of context switching. There is no need for multitasking, because everything I need is a few keystrokes away, which allows me to focus. And that's a good thing.

Building Toolchains


A toolchain can be a very subtle thing.

We make them all the time, a lot of times without even realizing that we are doing it and very often without realizing that there might be a better way. In order to really use computers effectively, we must pay attention to the toolchain, or workflow that we are using.

A simple example could be task lists. Your toolchain could be as simple as: "Every todo list goes on a Sticky note". This works but it isn't terribly versatile. You have to manually clean them up when you are done with them, you can't access them from your phone or another computer, and there's no way of sorting them by date unless you sort them visually by placement.

This could be improved by using a spreadsheet or database like Excel or Bento. Then you have the option of adding priority, type and date fields and sorting them accordingly. You could then use Google Docs and make this spreadsheet accessible online. It's definitely better, but in my opinion it's a bit over the top just for a todo list.

My established toolchain is the task list in Google Calendar. It's simple; the only necessary info is a title which gets a checkbox beside it to check when the task is done. You can optionally add a due date, notes about the task and you can move them to another list or delete them when finished.

But the new toolchain that I'm playing around with is even simpler, it's simply a private Twitter feed. Instead of being limited to todo's, I post anything that I want to remember. I can delete the posts if it's a completed task or I can simply let it fade into the timeline. I can retweet if it's something important that might be getting stale and I can include links if it's related to something on the web.

This is a simple example, but I think it highlights a few important ideas. Anything you do with a computer can probably be done tens or hundreds of different ways. With that in mind, is your current toolchain the best for you? Some things to keep in mind when building a toolchain:

  • Is the content properly accessible (over time and space)?
  • Does the content need the draft/publish distinction?
  • Is there a single endpoint (don't repeat yourself!)?
  • Is there a concern about losing your data to a cloud service?
  • Do you need a revision history?
  • Is there other related tasks that can use the same toolchain?
  • Is the endpoint shared, collaborative or private?

Many applications, including most cloud services, manage the data for you. This can be convenient for simple tasks, but can be troublesome if you want to extend or customize your toolchain beyond what the service offers. It's wise to think for a moment whether such programs should be the single home to your content, or whether they should be an endpoint of a larger toolchain.

When I started using blogger, I used the online editor, publishing as I went. After a little while I thought, "Hmm, it would be nice to have a local copy of my blog." It is possible to export your entire blog, but this is not practical on a day to day basis. The better solution is to use the cloud service as an endpoint, not the only point. By working on a local copy of my blog and publishing it to Blogger, I'm able to do some things that I can't with the Blogger tools. I now use Markdown formatting to simplify markup, and I can archive this version with version control using Git.

It's not a complex toolchain, but it's nice to know that I have a platform neutral copy of my blog that I can reuse however I see fit.

Apple's Flash Killer

There has been a lot of buzz lately about Apple's sudden and powerful push for HTML 5 adoption in place of Flash content. I think that the real issue is lost in most of the discussion. Here's my take on why they're doing it.

1. The "Silver Bullet" Toolchain
Apple has been moving a lot of it's development tools and platforms towards web-based models. Dashboard widgets are all built with HTML/CSS/JavaScript, embedded iTunes rich content such as the iTunes LP format are all thinly wrapped web pages. The only thing standing in the way of the web-based design flow becoming the ubiquitous silver bullet is rich, interactive content. For that you need Flash, and Apple is not willing or able to integrate Flash development into Dashcode's toolchain.

2. Integrated Advertising
Flash content bogs down page loads, and advertisers have paid big money to produce those bandwidth hogs. The average user doesn't come across enough useful Flash content to warrant it's use, especially on mobile devices. This is a big problem for advertisers who spend huge sums of money on dynamic content only to be blocked by a rapidly expanding portion of the market.

Apple's HTML 5 showcase is unambiguously directed towards advertisers. Apple is, in a sense, scrathching their own itch here. They are at the forefront of "cool" advertising, and they don't want to rely on a third party platform to deliver their slick visualisations.

3. Big Content
Apple has emerged as one of the major digital content providers in the world. They have done this by ensuring that content providers can deliver their content through Apple's platform in a safe and very profitable manner. Adobe has not offered content providers equal assurance with Flash content.

YouTube has already announced that they will be offering streaming movie rental options. I'm sure that Apple has plans to do similar things with streaming H.264 content, but they have no intention of doing it within a Flash container.


Thought Experiment: JavaScript POSIX Shell


Okay, this one is a bit of an odd idea. My thinking is that you build a shell that will work in a browser window. The 'filesystem' is created in the browser's localStorage. Only the basic IO system and OS facilities are hardwired, the rest is implemented through executable 'files' in the filesystem.

The general idea is to make a lightweight shell that is programmable and could do some really cool stuff.

echo "Hey, I'm in a browser terminal!" | twitter

I've already hacked together a demo just to see how all the parts would work. Here are my findings.

User Interface

A text area is editable or not, whereas a terminal window is only editable at the cursor point. I puzzled over this one for a bit, wondering how to mimic this behavior in a browser, and then I dropped it and just did what was convenient: one editable textarea for commands, one for output. This is what that looks like:

At first I thought this was kind of a lame hack, but as I use it a bit more it feels quite intuitive. It's like posting on Twitter; you type on top and the command gets pushed onto the stack. But, you might ask, what about programs that take over the full terminal window? Well, in this case you just pass the lower window to the program and keep the upper for commands to the program. In the Vim editor, and most others, you always have a status/command line used for issuing commands anyway, so why not keep it on top?

Calling Convention

This is the one place that I wanted to stay very UNIXy. The commands are called with arguments and have access to stdin, out and err and everything is a 'file', meaning everything is a key/value pair in localStorage.


In my first draft, the files are just key/value pairs with the file name being the key and the text being the value. There are a few problems with this.

First, file permissions/types are not implemented, which is mostly a problem because there needs to be a mechanism to determine what are executable files. Remember: everything is a file; all the commands 'echo, emacs (!)' will be JavaScript strings stored in the browser. Any user can make their own executables.

Second, a file heirarchy would be nice. Timestamps would be nice.

I think the best solution is to format the 'name' string with all the metadata, which would get stripped when accessed by the Terminal. So a file like /bin/echo would become < name='echo' path='/bin/echo' mode='-r-x' > if we want to use XMLish formatting.


I really like this idea. When I first got it up and running I was a bit giddy. It is browser based, but I also want to make it very Internet based. I would like to make it possible to wire in any site that has a public API; Social Networking, Mail, RSS, the works.

Skratch Development Update

"Make one to throw away"

I forget who said this, but it has been good advice for my little web-based sketching app called Skratch. After the first version was up and running I took a step back and had a hard look at the overall structure.

Overall it wasn't too bad. It has been a work in progress to figure out how to modularize and encapsulate. In the alpha stages it was a bunch of global functions and variables going every which way. By version 0.1, I had realized that I needed to introduce some structure and so I grouped the drawing code into objects with attributes. Most of the event handling code was still in global functions. At this point I took a break and did some reading.

First I read Douglas Crockford's JavaScript: The Good Parts, and none of it really sunk in. It's an excellent book for the theory of how to use the language functions and how to avoid bad practices, but it doesn't really give the reader a sense of how they should write JavaScript. All of the examples are very short and specific to a certain language feature.

Then I read the jQuery source code and realized that I hadn't got it at all. It took me a while to really grasp what was going on here, and most of that time was spent trying to figure out what it was that I didn't understand. The thing that I didn't understand was simple but elusive: closure.

I have heard of closures as a programming concept a number of times but I never really grasped why they would be practically useful or how I would recognize one when I saw one. Reading the jQuery source, it took me a while to realize that this was a ~6000 line closure staring me right in the face. So I started coding to see how I could use an anonymous function to my benefit.

It's really quite simple: any vars, functions or objects declared inside a function are only accessible within that function. So if an anonymous function invokes itself, it's totally possible to not add to the global namespace at all. Nothing within that function is accessible unless it is attached somehow to the global namespace. But those functions that ARE attached to the global namespace can make use of all the private variables and state within the anonymous function.

This is the little testing function that I hacked together to prove this to myself:
(function () {
    var obj = {}, 
        privateData = 42,
        privateMethod = function (arg) {
            return "Private method: called with "+ arg + 
                    "\n" + "The holy grail: " + privateData + "  sees:" + obj.globalData;
    //Priveledged: public methods which rely on private members.
    obj.protectedMethod = function (arg) {
        return "Priveledged method: called with " + arg + "\n" +
                    privateMethod("really " + arg);
    // Public: members that are publicly accessible, rewritable, etc...
    obj.globalData = "I am the viewable data of the globalobject.";
    obj.generateString = function () {
        return "Global method: " + obj.globalData + "\n" +
                obj.protectedMethod("do this");
    genericObj = obj;

Only in the last line do I "publish" the object I've created, anything else is encapsulated in the closure and remains private.

Once I grasped this concept, it was immediately apparrent how it would be useful to my drawing app. I created an object, called "skratch" and attached only the functions and data that need to be accessible from outside to it; basically just the UI controls and a .setupCanvas method. The drawing engine and event handlers are now completely wrapped inside the closure.

This way of programming seemed kind of ugly to me until I looked at the DOM tree for my app. There was only one global object called "skratch" with a couple methods for triggering UI events and that was it.

So, I now know what closure is and what it can do for me and I think my scripts will be better for it.

Oh, and FYI, after reading the jQuery source, I couldn't help but include it in my app, even though all it does right now is make the toolbar go swoosh! Totally worth it!


Skratch - HTML5 Version


I've rewrote my drawing engine to run entirely in javascript and render to an HTML canvas. Since all the control logic was already sketched out and working in Quartz Composer, I simply had to translate that into javascript equivalents.

Instead of drawing with a colored image, this version draws individual pixels for every point along the path with a random scattering algorithm. It does this by computing a random angle and a random distance from the drawing position.

The controls are fairly rough at the moment. I think I'll probably scrap all the HTML inputs and integrate the controls into the drawing surface itself; but this works for testing purposes.

The background is drawn using tips from this great tutorial.

The code is in one HTML file, which you can download and run locally. Get it here, In the folder "Web Development" > Pencil_Drawing.html.


Skratch Rendering Engine



Skratch is a Quartz Composer based rendering engine for painting and drawing applications. The primary design considerations are speed, smooth dynamics, a simple interface and extensibility.


  • Quadratic Input Smoothing
  • 3 Mode, Fullscreen Color Chooser
  • Pressure Sensitive Brush
  • Selectable Brush Types and Repeat Rate
  • Transparent Drawing Layer
  • Modular Design


Skratch is designed to be highly modular in order to make modification, automation and extensibility simple.

Smoothing Functions

One of the primary design obstacles was the relatively slow sampling rate of the mouse or tablet (approx. 16.7 ms). This results in jagged lines at even moderate brush speeds. To overcome this limitation I have implemented a smoothing algorithm using a quadratic Bézier curve in Javascript. The function takes three sampling points input(3) from the mouse. It then calculates three control points, one halfway between in0 and in1, one at in1, and one halfway between in1 and in_2. The Bézier curve is then calculated based on these points and the number of intermediate points is determined by the "Granularity" input. The smoothing function looks like this:

(((1-t)*(1-t)) * p0) + ((2 * (1-t)) * t * p1) + ((t*t) * p2)

Where t is the point to produce in the range 0.0 - 1.0 and p0, p1 and p2 are the control points.ézier_curve

On top of this rapid smoothing, there is also the option of low-speed interpolation which makes it much easier to draw smooth curves and lines by hand. This makes it easy to switch between quick sketching / hatching and smooth deliberate strokes.


The original goal was to develop a simple sketching program, but due to it's modular nature, it has been quite simple to connect it to function generators and produce auto-drawings with interesting results. This modularity should make it quite simple to extend it beyond Quartz Composer in XCode and produce various types of drawing programs.

Color Chooser

The color chooser that I've implemented is somewhat different than most. When you press the 'c' key you are presented with a full-screen CMYK palette to pick a color from. If you use the right arrow key, it will change the palette to a number of alternate palettes. There's also an RGB palette as well as an option to pick the color right from the current drawing (in a slightly pixellated version). Further images or palettes could be added with minimal effort.


The drawing layer is transparent, allowing full color drawing and erasing and the potential of stacking layers with blending. The background is opaque.

Further Directions

  • Canvas Sizing (Zoom & Pan)
  • Custom Brushes & Brush Selector
  • HUD of all drawing state info


Guided Auto-Drawing.


Copy From iPod - Mini App


I packaged the previous post into an app. All you have to do is select your iPod folder and iPod Copy will copy it all to a folder on your desktop.

Heres the link:

iPod Copy

It's a pretty trivial Automator script, but it got me thinking about interaction. In the previous post there was about 10 lines of navigation/folder creation and one payoff line. To my friend that had never used a terminal the 10 lines were mystic voodoo, but they are really basic stuff if you've used the command line at all. And then I thought, "Whatever happened to Midnight Commander?"

Midnight Commander To The iPad

Midnight Commander enabled you to navigate visually in side-by-side views and run commands of your own or from a dropdown menu. All that painful navigation cured, but the command line was still one keystroke away. The problem was, as soon as we had visual interaction there was no going back. The average user cannot shift modes between the padded walls and unlimited undo of GUI's to the underground streetfighting of CLI's.

The announcement of the iPad is one more step into the walled garden of safe and fun computing for the masses, but it worries me that the average user knows less and less about how the technology actually functions. Even developers are becoming much more insulated from the workings of the machine. Like the Galactic Empire in Asimov's Foundation series, if we don't have knowledgable people working at every level our systems will grow unmanageable and unmaintainable.


Copy From iPod

I've had a number of requests about how to copy music from an iPod back to your computer. It's fairly straightforward if you know how to get around in Terminal, if you don't here's a walkthrough, if you do you might just want to skip ahead to the copy command, it's the only one doing anything special.
1. Open Terminal (Applications/Utilities/Terminal/)
2. Navigate to /Volumes/Your iPod/iPod_Control/Music/
 cd /Volumes/
 >>> Andrew Vanee’s iPod    BOOTCAMP    Backup    My Public Folder

 cd "Andrew Vanee’s iPod"/
 >>> Calendars Contacts  Desktop DB Desktop DF Notes    iPod_Control

 cd iPod_Control/
 >>> Artwork  Device  Music  iPodPrefs iTunes

 cd Music/
 >>> F00 F03 F06 F09 F12 F15 F18 F21 F24 F27 F30 F33 F36 F39 F42 F45 F48
 F01 F04 F07 F10 F13 F16 F19 F22 F25 F28 F31 F34 F37 F40 F43 F46 F49
 F02 F05 F08 F11 F14 F17 F20 F23 F26 F29 F32 F35 F38 F41 F44 F47

3. Make a folder on the desktop to copy to:
 mkdir ~/Desktop/ipod

4. Copy all files in folders, without Mac resource forks (-rX options)
 cp -rX ./* ~/Desktop/ipod/

5. Wait for a while. Terminal just sits there blankly, but you can go to the folder
in Finder and see the copying happening. Once you get a command prompt in 
Terminal, you're done.
Gotchas: Any folder name with spaces or weird characters needs to be surrounded by double-quotes. (eg. "Andrew Vanee's iPod"/) To do: on the fly file renaming. mdls -name kMDItemName will display the actual song name.. Not sure what to do with that.


Arbor Redundancy Manager


** UPDATE! This code will potentially create false hard-links between files that are very similar. I suddenly have many iTunes album covers that are from the wrong artist. I'm not sure if this is a result of how cavalier Apple is about messing with low level UNIX conventions, or if my code is faulty. Beware! **

We all have redundant data on our computers, especially if we have multiple computers. My new tool is aimed at making backups for this kind of thing simple.

I'll take the simplest practical example: You have two computers and one backup hard drive. You want full backup images of both of these systems. Even if you're just backing up the 'Documents' folder, there is a good chance that there is a lot of redundant files. What my little program does is checks the contents of these files, and if two are the same, it creates a hard link between them.

So, if your backup folder looks like this:

Backup > System 1 > Documents > somefile.txt
Backup > System 2 > Documents > somefile_renamed.txt

It will look exactly the same after, but there will only be one copy of the file, if the contents are identical.

I have dozens of backup CD's that contain a lot of the same information, now I don't need to sort through them and reorganize or delete duplicate files. I can leave them just as they are and any duplicate files will be linked under the hood. Here is the current state of the code, discussion to follow: - v. 0.02

#!/usr/bin/env python
import os, sys, hashlib

arg_error = False
if len(sys.argv) == 2:
   src = sys.argv[1]
   srcfolder = os.path.abspath(src)
   if not os.path.isdir(srcfolder):
       arg_error = True
else: arg_error = True

if arg_error:
   print "Usage: arbor [directory]"

backupfolder = os.path.join(srcfolder, ".arbor")
if not os.path.isdir(backupfolder):

skipped_directories = [".Trash", ".arbor"]
skipped_files = [".DS_Store"]
size_index = {}
MAX_READ = 10485760

def addsha1file(filename, size):
   if size > MAX_READ:
       f = open(filename, 'r')
       data =
       data = data +
       sha1 = hashlib.sha1(data).hexdigest()
       sha1 = hashlib.sha1(open(filename, 'r').read()).hexdigest()
   backupfile = os.path.join(backupfolder, sha1)
       if os.path.exists(backupfile):
 , srcfile)
 , backupfile)
       print "Unexpected error: ", sys.exc_info()[0], sys.exc_info()[1]

fcount = 0
for root, dirs, files in os.walk(srcfolder):
   for item in skipped_directories:
       if item in dirs:
   for name in files:
   fcount += 1
   if fcount % 500 == 0:
       print fcount, " files scanned"
       if name in skipped_files:
           # print name

       srcfile = os.path.join(root, name)
       size = os.stat(srcfile)[6]
       if size_index.has_key(size):
           if size_index[size] == '':
               addsha1file(srcfile, size)
               addsha1file(size_index[size], size)
               addsha1file(srcfile, size)
               size_index[size] = ''
           size_index[size] = srcfile

Here are the added features of this version:

  • Folder to backup is passed as command-line argument
  • Backup files are placed at the top level of that folder in the .arbor directory. These are just hard links so they don't really add any to the size of the directory.
  • Ability to skip named folders or files
  • Only calculates a checksum if two files are the same size.
  • Only calculates a partial checksum if a file is over 10mb. Checks 5mb from start and 5mb from middle.
  • Prints a running tally of files checked (per 500 files)
  • Doesn't choke on errors: some files don't like to be stat'ed or unlinked, permissions issues.

I'm not sure about the partial checksum option, but it was really bogging down on larger inputs. It's not really practical to do a SHA1 checksum on a bunch of large files, and i think it's safe to say that two very large files can be assumed to be the same if the first 5mb, the middle 5mb and the overall size are exactly the same. Perhaps I will add an option later for strict checking, if someone is highly concerned about data integrity. But the practical limitations are there, I'm 30,000 files into a scan of my ~200GB backup folder and I certainly wouldn't have gotten that far without the file size limiting.

Update: The scan was almost done when I wrote this. Here is the tail end of the log:

31000  files scanned
31500  files scanned
32000  files scanned
32500  files scanned

real 43m40.531s
user 7m25.593s
sys 3m35.540s

So 233 GB over 32500 items took about 45 minutes to check and it looks like I've saved about 4 GB. Upon further inspection, it seems that most media files save their metadata in the file contents, so the checksum is different. Hmmm....


File Management Tool - Part 2


I have worked with my program a bit more and there are some interesting aspects of this kind of backup.

At the end of the backup all duplicate files point to the same inode, so the SHA1 version can be deleted.


inode   name
1299    workingdir/folder1/file1.txt
1299    workingdir/folder5/file1.txt
1299    workingdir/folderx/file1_renamed.txt
1299    backupdir/54817fa363dc294bc03e4a70f51f5411f4a0e9a9

All these files now point at the same inode and so the backup directory can be erased and no file has executive control over this inode. All three files would have to be deleted to finally get rid of inode 1299. Generally it seems that programs save files with new inodes (Text Edit ...), so editing any of the versions breaks the links. It seems that UNIXy programs respect the inode better, vim saves with the same inode and so editing any version edits every version.

Removing the "backup" directory also helps Spotlight resolve the names and filetypes. Deleting that folder and running `mdimport ./workingdir` complained mightily but more or less re-indexed the folder. Here is a quick slice of the errors it produced, I'm not going to try to make sense of them, but think they're interesting; maybe Spotlight encounters these kinds of problems always and just keeps silent about them.

$mdimport ./workingdir
font `F88' not found in document.
font `F82' not found in document.
font `F88' not found in document.
font `F82' not found in document.
font `F88' not found in document.
font `F82' not found in document.
font `F88' not found in document.
encountered unexpected symbol `c'.
encountered unexpected symbol `c'.
encountered unexpected symbol `c'.
encountered unexpected symbol `c'.
encountered unexpected symbol `c'.
encountered unexpected symbol `c'.
encountered unexpected symbol `c'.
choked on input: `144.255.258'.
choked on input: `630.3.9'.
choked on input: `681.906458.747'.
choked on input: `680.335458.626'.
choked on input: `682.932458.507'.
choked on input: `530.3354382.624'.
font `Fw' not found in document.
font `Fw8' not found in document.
encountered unexpected symbol `w6.8'.
encountered unexpected symbol `w0.5'.
font `Fw8' not found in document.
encountered unexpected symbol `w0.5'.
encountered unexpected symbol `w6.8'.
choked on input: `397.67.'.
choked on input: `370.5g'.
choked on input: `370.5g'.
choked on input: `D42.32 m
314.94 742.32 l
306.06 751.2 m
306.06 7...'.
choked on input: `67.l'.
choked on input: `67.l4'
failed to find start of cross-reference table.
missing or invalid cross-reference trailer.

To reiterate, this is a funny trick that my program is doing. It builds a list of SHA1 named files from the source directory and then you just delete the index it just made and you're left with all the duplicates hard linked. I think that's pretty cool.


One stated aim of this backup tool was to preserve metadata. So far this tool preserves the time stamps and metadata of whatever file it indexes first and the filename of every file it indexes. I'm not sure how to implement any more than this in a transparent way. As far as I can tell from the documentation, you can't have a single inode with multiple access and modification times. And building an external database of that kind of information would not get used.

File Management Tool


I ended up sketching out the details of how my file managing tool will work, it's kind of like a virtual librarian that removes redundant files without deleting the file hierarchy. My method is a bit of a mashup of how other tools work, so I'll give credit where credit is due.

This is how it works so far:

  • All files in a tree have their SHA1 hash-value computed (Git)
  • A hard link is created in the backup folder with the SHA1 name (Time Machine) ...
  • unless: the file exists already, then it is hard-linked to the existing SHA1 (...)

There is no copying or moving of files, simply linking and unlinking, so 99.9999% of the time is spent computing the hash values of the files. Here's the python version of this:

import os
import hashlib

backupfolder = os.path.abspath('./backup')
srcfolder = os.path.abspath('./working')
srcfile = ''
backupfile = ''

for root, dirs, files in os.walk(srcfolder):
    for name in files:
        if name == '.DS_Store':
        srcfile = os.path.join(root, name)
        sha1 = hashlib.sha1(open(srcfile, 'r').read()).hexdigest()
        backupfile = os.path.join(backupfolder, sha1)
        if os.path.exists(backupfile):
  , srcfile)
  , backupfile)
        # print backupfile

This folder contains about 5 GB of info and I thought that the SHA1 calculations might take a couple weeks, but as it turns out, it only takes a couple minutes. What you end up with is a backup folder that contains every unique file within this tree named by it's sha1 tag, and the source folder looks exactly as when you started, but every file is a hard link.

So, what are the benefits?

Filenames are not important

Because the SHA1 only calculates the contents of a file, filenames are not important. This is important in two ways, if a file has been renamed in one tree, yet remains physically the same, you only have one copy and the unique names are preserved. And more importantly, if you have two files in separate trees that are named the same, (ie. 'Picture 1.png'), you keep the naming, yet have different files.

If you have some trees of highly redundant data, this is the archive method for you. My test case was a folder of 15 direct copies of backup CD's that I have made over the years and I have saved about 600M across 5GB. And the original file hierarchies look exactly the same as they did before running the backup.

What is wrong with it?

As it stands, it messes with Spotlight and Finder's heads a little bit. Finder isn't computing correct size values for the two folders. du prints the same usage whether I include both folders or one at a time, which is pretty clever: total:5.1GB, working:5.1GB, backup:5.1GB. Finder on the other hand prints Total: 5.1GB, working: 5.1GB, backup: 4.22GB.


Some very wierd stuff happens with Spotlight.

A Spotlight search in the working directory will show mostly files from the backup directory, which isn't convenient. The files in the backup dir have no file extension so they're essentially unopenable by Finder. Here's what i found using the command-line mdfind:

mdfind -onlyin ./working "current"
and so on ...

mdfind -onlyin ./backup "current"
nothing found

For some reason, when searching the working directory it finds the information, yet always resolves the name of the file to a directory it's not supposed to be searching. And if you search the backup directory, it doesn't even bother reading the files, because it assumes from the name that they are unreadable by it.

I'm starting to wish that Steve Jobs hadn't caved and given in to the file extension system.

Time Machine

Okay, it's useful but how is it similar to Time Machine? Time Machine creates a full copy of the tree when it first backs up the system. From then on it creates the full hierarchy of directories but all the files that haven't changed are hard links to the original backup. Each unique file is a new inode created in time, whereas in my system each unique file is a new inode created in space. All duplicates in time are flattened by Time Machine and all duplicates in space are flattened by my system.

Note: To copy folders from the command line and preserve as much as possible for metadata use `cp -Rp`.


File Backup And Synchronization

In my previous post I had mentioned that I was looking for a backup/file synchronization tool.
I don't think Git is it and neither is dropbox. Both these are useful in that they are format transparent, which most database software is not. But what they are lacking is a way to deal with a large variety of file and folder hierarchies and seamlessly compress without losing transparency and semantic meaning.
So here is my list of requirements from a backup tool:
  1. Preserves any time-stamp information, even conflicting
  2. Distributed (decentralized)
  3. Minimizes redundant data
  4. Preserves hierarchies for semantic meaning
  5. Hides hierarchy clutter
  6. Preserves every bit of metadata, even if it's not explicit
  7. Accessible and platform neutral
  8. Makes data integrity paramount
It may seem like I have requirements that conflict with each other, but I will try to explain what I mean. I have loaded four of my backup CD's onto my laptop. I know there are duplicate files and I know there are time-stamps that disagree with one another.
I want to be able to view these files in a number of ways:
  • In their original on-disk hierarchy.
  • By file type, date, tags or physical description.
And I want to be able to synchronize all or part of these folders between machines, in addition to making zip/tar archive of them to a backup machine.
Any suggestions, or shall I start coding?


Git And The Future Of The Internet.

I've recently taken a detour into philosophizing about where technology is going. What does the future look like and what does it mean for humanity, life and the current business models as we know them?

It all started with a bit of research into Linus Torvalds latest project, Git. I've been thinking about trying some kind of content management system for personal use. I've looked at a lot of personal database type stuff (Bento, FileMaker, MySQL, ...) and they just seem like format specific black holes to drop your content into. I'm still not sure Git is right for what I'm thinking, but I watched Linus' Google tech talk followed by Kevin Kelly's TED talk and had a vision of a web that is so much more than what it is right now.

They're both pretty long, but I've had a bit of time on my hands lately... Linus brings up two important points in his talk: one is the notion of working in a "network of trust" and the other is the sacredness of one's own data. Both of these are extremely important and often lacking components in the emerging technologies of our day. The network of trust is the only way to do collaborative work on open source development right now.

I think this is hitting a critical mass and will soon be the only way to do any kind of work. Monolithic organizations cannot keep up with the changing landscape of information growth. Git is a very interesting project because it takes this model and implements it in a very practical way. It employs a lot of very technical algorithms to allow software projects to grow very organically in a social environment. A lot of the metaphors that surround software development are hard, physical metaphors like construction, building and engineering, but the emerging metaphors are about growth, evolution and adaptation to environment. 

The benefits of collaborative networked projects are obvious but the sacredness of one's data is a bit more of a veiled concept. Linus outlines the use of the SHA1 algorithms as a means to ensure that the entire history of a project, or set of data, can be verified to be accurate and traceable throughout it's lifespan. This has obvious benefits when dealing with buggy network connections or failing hard drives, but it's more interesting to me in it's wider application.

Where's My Information?

As a person that has used a computer for a number of years I'm already seeing the breakdown of continuity in my archived information. As data gets moved around, archived to CDROM, uploaded to Google Docs, downloaded to PDF's and transferred to different operating systems, it all ends up in a soup of data without context or history. I have no idea if the timestamps are accurate, or what the context and related content might be. As soon as you add cloud computing to the mix, the problems amplify greatly.

This very blog post is being submitted to the vast expanse of content controlled and managed by the cloud. I have no simple way of traversing the internet and picking up all the odds and ends that I have put there.
This is the real direction of Git I think, and I want to figure out how to use it for more than just source code management because I think it could change the way the internet works. What if this blog was simply a mirror of the "Blog" folder on my hard drive, which was mirrored on every machine I use and was also shareable to other collaborators who mirrored their own unique versions? And what if my photo page on flickr and Facebook were simply mirrors of a folder called "Published Photos" on my hard drive which were mirrors of... and so on.

Vapor Trails

The fundamental problem of cloud computing is the owners right to content and tracking. This is generally possible with today's technology, but never practical. I have 65 documents in Google Docs at the moment and I could download all of them in one go into plain text files, but all the metadata would be garbage, and I couldn't easily merge them with the existing contents of my hard drive. Sure, I could spend a bit of time diff-ing them with my files and organizing them into logical places, but imagine if I was talking about the entire contents of my home directory. du | wc -l command shows 5,627 files in my home directory and I don't even have my music collection on this computer! Yes, the data is basically safe in the cloud, but what if I want to take it with me or move it elsewhere? What if I want to host this blog from my own server, how would I transfer it? The current cloud model only takes uploading and viewing seriously and neglects personal ownership rights. Google docs has special code written for exporting, blogger doesn't, facebook and flickr don't, youtube doesn't.

They are all greedy information gathering tools. They are only concerned with gathering your information and storing it on their sites. There are "sync" tools for most platforms, but their only intent is to gather your content with more ease and transparency.

Git looks promising in that it allows you to publish your information, yet still control the source of it.


Assembly Language For Mac

I'm away from my Linux box and want to do some assembly programming. Mac installs GCC with the developer tools, but there are enough differences that I haven't bothered to work through them until now. Here's a decent tutorial, although it focuses on PPC assembly and I'm using an Intel Mac. The thing that frightened me about the Mac assembler was the default output of gcc -S. There is some strange optimizations and flags in the resulting assembly code. The key, as the tutorial points out, is in the compiler options. Here's what I used on the ubiquitous "Hello World" program:
gcc -S -fno-PIC -O2 -Wall -o hello.s hello.c
And here's the assembly code it spit out:
   .ascii "Hello World!%d\12\0"
   .align 4,0x90

.globl _main
   pushl   %ebp
   movl    %esp, %ebp
   subl    $24, %esp
   movl    $12, 4(%esp)
   movl    $LC0, (%esp)
   call    _printf
   xorl    %eax, %eax
This is more familiar territory, the only differences being the .cstring directive instead of .section .text, the leading underscore on printf, and the .subsections_via_symbols directive. The general naming of sections is outlined on the Mac Assembler Reference, and the .subsections_via_symbols explanation is interesting. I'm already used to using many labels in my code; does this mean that the named sections would be ripped out because they are not "called" by any other code? I tested this out in the previous example, just adding a second call to _printf in a labelled section and the code worked just fine. It seems that labels don't count, they have to be declared sections like .globl, .section or whatever. That seems fair, I haven't yet made a habit of calling sections that are supposed to flow naturally into other sections. Maybe there is some instance where this might be a useful optimization? I will be looking into Position Independent Code(PIC) a bit more, it seems that it's similar in theory to how the latest Linux kernel runs code at randomized memory locations to prevent hardcoded attacks, but I don't know if that's the extent of it.


RPN Calculator - v0.02

My calculator code was quite easily polished up. Here's the revised code which stacks operands properly and supports the main arithmetic operators +,-,*,/. If you flush the stack completely, you get a "nan" warning, which seems reasonable. Here's the code: (gas, x86)
.section .data
expr_length:    .int 128
ADD:            .ascii "+"
SUB:            .ascii "-"
MUL:           .ascii "*"
DIV:            .ascii "/"
null:           .ascii "\0"
disp_float:     .ascii "%f\n\n\0"
.section .bss
    .lcomm expr, 128
.section .text
.globl main

    leal    null, %esi          #Clear the expr buffer
    leal    expr, %edi
    movl    expr_length, %ecx
    rep     stosb
    addl    $4, %esp
    pushl   stdin               # Read an expression
    pushl   $64
    pushl   $expr
    call    fgets
    addl    $12, %esp

    movb    ADD, %ah            # Test For Operators
    movb    expr, %bh
    cmp     %ah, %bh
    je      addFloat
    movb    SUB, %ah
    cmp     %ah, %bh
    je      subFloat
    movb    MUL, %ah
    cmp     %ah, %bh
    je      mulFloat
    movb    DIV, %ah
    cmp     %ah, %bh
    je      divFloat

    pushl   $expr               # Must be a number
    call    atof
    addl    $4, %esp
    jmp 1b

        fstl   (%esp)
        jmp     disp_answer

        fstl   (%esp)
        jmp     disp_answer

        fstl    (%esp)
        jmp     disp_answer

        fstl   (%esp)

        pushl   $disp_float
        call    printf
        addl    $8, %esp
        jmp     1b

    movl $1, %eax
    movl $0, %ebx
    int $0x80

8 Geek Tools For Your Mac

Here's a quick roundup of the applications I have found most useful since switching to Mac (from Linux, although I still ssh into my Arch Linux box regularly). Most of these are included but some need the Developer Tools installed and really, even if you're not using Xcode, you should install the Developer Tools and Optional Installs. You want them.
Big deal, it comes with a calculator. But this calculator has some barely hidden super powers lurking in the menu options. Paper tape to show your calculation history, scientific and programmer modes, an RPN mode, and a whole load of unit conversions. When you start this calculator it gives the impression of being a $10 Toys-R-Us thing, but there's way more than meets the eye. Grapher
The one thing Calculator doesn't do is graphs, but if you have the Developer Tools installed, you have a program that will do far more than your average graphing calculator. And with style. Located at /Applications/Utilities/Grapher, this app is a graphing machine! From simple parametric equations to complex 3-D differential equations, this thing will graph it. The equation templates and examples are quite nice too if you're not sure where and how to begin. Spotlight I could recommend Quicksilver, which does way more than Spotlight could dream of, but when it comes right down to it I only really used Quicksilver for application and file launching anyway. Spotlight does this really well. There are no applications, other than droplets (see Automator) in my dock anymore. Cmd-Space your way to a clean dock. Icon Composer I don't know why, but I like to create my own icons for the stuff in my dock. Maybe I'm alone in this. But if you want to give it a go here's how I do it:
1. Draw the icon using GIMP / Photoshop, at 512 x 512px (transparency and gloss are your friends!)
2. Drop the saved png into Icon Composer (/Developer/Applications/Utilities/ )
4. Drop the icns file onto icns2icon, and the file will become the icon...
5. Cmd-i (Get Info) the icns file and whatever file/folder you want to apply the icon to.
6. Select the source icon in the "Get Info" window and Cmd-C, select the destination icon and Cmd-V.
Automator If you have a repetitive task there's probably an Automator script to do it. There are only three that I use regularly and these are all in my dock as droplet style applications. They are:
Renamer: This is just the "Rename Finder Items" plugin, with "Show this action when workflow runs" checked so that you can choose the options whenever you run it.
Comment: The "Set Spotlight Comments" plugin, again with "Show..when run" checked so it's completely generic. Once in a while I get in a phase where I feel the need to tag all my files.
Desktop Alias: Rather than store my current project folders on my desktop I just drop them on this and it creates a desktop shortcut. "New Aliases" patch. I have done some more elaborate Automator scripts, but these are the ones that are general enough for daily use. Megazoomer
Okay, this one isn't included with your Mac, but it really should be. You have to download Megazoomer as well as SIMBL.
What this does is add a "Mega Zoom" option to every app so you can fullscreen anything, as you can see from the screenshot, I'm typing this with vim on a fullscreen terminal with slight transparency (just enough to read a website in the background).
This screenshot is not cropped at all.
It's kind of like a bit of 'ratpoison' window manager for your Mac. This is true fullscreen, no dock, no menu bar, no distractions...
Quartz Composer This is an amazing bit of software located at /Developer/Applications/Quartz Composer. See the examples at /Developer/Examples/Quartz Composer/Compositions/Conceptual for some idea of what it can do. This can create some really cool 2D / 3D visuals. Also see for some cool visual designs using Quartz Composer. Terminal Finally I must recommend that you pimp your Terminal. If your going to have any geek cred at all you need a sweet terminal. Here's my preferences, mix to your own tastes: Background color: 91% Black, 96% Opacity. Font: Bitstream Vera Sans Mono 13pt, greyish-green, antialiased vimrc: ...maybe i'll get into settings in another post. I have done quite a bit of monospace font testing and this is my favorite. Very close competitors are: BPmono, Terminus and Droid Sans Mono.
And the next step, after you get your Terminal looking really cool, is to learn how and why to use it. Happy Hacking!