babak farhang - 2008-09-15

Hi everyone,

I've been banging out the code for Skwish for a few weeks now. The skeleton has a little meat now, but no so much that it can't easily change direction. So now seemed liked a good time to release: there's enough code there to show it rather than tell it.

But let me tell you why I started this project. My primary interest is in search (indexing), both structured (e.g. rdf), unstructured (e.g. full text search) data. This involves processing and indexing a lot of source files, and you typically need to store these files *somewhere*. Usually that *somewhere* ends up being a deep, heavily populated directory structure on the file system. But this directory structure approach doesn't usually scale well, and it becomes especially annoying when the blobs (files) you're storing have no inherent hierarchical structure. Worse, when the blobs (files) are small and many, and access is heavy, all that file I/O begins to take a toll.

A well-known, simple solution to this, is to throw all the blobs in a same file and maintain their offset boundaries in another file. This way, you keep only a few files open and you let the file system and the underlying device controller do their work, e.g. paging data in and out of a block device. And if you append related blobs in sequence, you get better locality of reference and hence less paging.

So this blob storage has been a recurring problem for me, and I have come to appreciate that blob storage is (and should be) completely orthogonal to indexing. (It's a good thing that Lucene, for example, does not dictate where and how the source documents are stored.)

The Skwish library, then is an attempt at a clean and simple store on which some other index can be built.

Hope you try Skwish and find it worthwhile! I know I'll be using it.. hope you put it to use, too! And hopefully you'll join me developing this promising little tool.