[sleuthkit-developers] Further work with io-subsystems: compressed image format
Brought to you by:
carrier
From: Michael C. <mic...@ne...> - 2004-02-08 01:44:41
|
OK I'll try sending this again, the first time it was blocked by the list server for being too big, so i compressed the patch. Hopefully this is ok. (also there are some small changes since the original patch namely decompress now workes with sgzip). Hi List members, Thank you for positive response on the recently proposed IO subsystem patch for sleuthkit. Although discussions are still underway as to how exactly this should be integrated into the sk (the way this is done now may not fit very well in the overall architecture), I have been experimenting with adding more subsystems besides the existing "advanced" subsystem. The latest addition is a sgzip io subsystem. Background, In todays forensic work, one regularly needs to work on very large hard disks, many new systems are sold with 80+GB hdd, so even the most trivial forensic analysis workstation needs to be able to accomodate huge dd images. This is obviously not practical. Secondly, storage and archiving of such images is difficult. This is why most people compress their images for storage. Currently people need to uncompress their images so that the sleuthkit can deal with them which is a major pain in the proverbial. Solution: The previous io-subsystem patch allows the easy integration of different io decoders for dealing with image files of verious formats. This current version (its a cumulative patch replacing the previous patch as well), implements an interface into the sgzip library (also supplied in the patch). sgzip is (seekable gzip) a file format allowing to implement fask seekable compressed images. The problem with regular gzip file is that one can not seek in them in a reasonable time (typically it involves decompressing the whole file for each seek). sgzip allows very fast seek times, and a very small loss of compression (up to about 2%). The details about the sgzip file format can be read in the sgzlib.h for interested developers. Basically after applying the patch to sleuthkit-1.67, and compiling normally, you will get a new binary in bin/ called sgzip. This is very similar to gzip. You use it like this: bash$ sgzip honeypot.hda8.dd This will create a file called honeypot.hda8.dd.sgz. Unlike gzip it does not unlink the original file, but it will overwrite a honeypot.hda8.dd.sgz if its there. (you can also use sgzip by piping into it etc, just like gzip. Usefull for grabbing images from netcat straight into sgzip). The filesize of the new compressed image is a little larger than a corresponding gzip file because it has more indexes and the compression algorithm is suboptimal when its broken across several blocks, but the difference is too small to care about. You can use the sleuthkit to directly work with this new compressed file by calling on the right subsystem: bash$ fls -r -i sgzip -f linux-ext2 honeypot.hda8.dd.sgz Thats it. complex hey? The sgzip subsystem also takes on an offset arguement so we can compress a whole hdd image and then work with individual partitions. Performance: For detailed documentation of the sgzlib implementation read the sgzlib.h file and the source code. Suffice it to say that sgzip works by breaking the uncompressed file into blocks, and compressing each block seperately. Then, when we need to uncompress a random bit of data we find the right block and decompress it. Hence if we want to read very small runs of data we are better off having smaller blocks so we do need to decompress data that we dont need. The user has control over the block size when compressing the file by using the -B argument to sgzip. Smaller blocks means faster sleuthkit, but less efficient compression. For example I took the honeypot.hda8.dd.sgz produced above (btw this is a disk from the honeynet forensic challenge) and did some benchmarking: Timing the uncompressed version (filesize=272,465,920=272MB) bash$ time fls -r -f linux-ext2 honeypot.hda8.dd > /dev/null real 0m0.050s user 0m0.010s sys 0m0.040s Hardly any time at all.... Now I compressed the file with a blocksize of 100kb using -B 100000 as arguement to sgzip (filesize=25,237,474=25MB): bash$ time fls -r -i sgzip -f linux-ext2 honeypot.hda8.dd.sgz > /dev/null real 0m12.600s user 0m9.590s sys 0m2.560s And the default blocksize (which is 512kb): (filesize=25,008,704=25MB) real 1m32.119s user 1m10.190s sys 0m12.870s Just for comparison the size of a normal gzip file is 24968226 bytes. So the sgzip file with 100kb blocksize is 1.07% bigger, and the speed is still acceptable and about 8-10 times faster than the default 512kb blocksize. But this is probably because fls is reading lots of very small runs of data scattered all over the whole disk. Just to be rediculous, I repeat the test with a block size of 10kb: real 0m1.421s user 0m1.310s sys 0m0.000s However, the filesize has dramatically increased now to 27,435,391 (27MB), which is about 10% bigger than pure gzip. That may be ok in some circumstances, but it must be remembered that in this case this particular hdd is mostly empty so it compresses very well. I expect the expansion of size to be more noticable in a more full disk. Future developments: - I am planning on writing a python module for sgzip (prob just use swig to access the c library). If people are interested i might look at how you make perl modules from swig. (I dont do much perl nowadays, since flag is rewritten in python.) |