[sleuthkit-developers] Further work with io-subsystems: compressed image format

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

OK I'll try sending this again, the first time it was blocked by the list 
server for  being too big, so i compressed the patch. Hopefully this is ok. 
(also there are some small changes since the original patch namely decompress 
now workes with sgzip).

Hi List members,
   Thank you for positive response on the recently proposed IO subsystem patch 
for sleuthkit. Although discussions are still underway as to how exactly this 
should be integrated into the sk (the way this is done now may not fit very 
well in the overall architecture), I have been experimenting with adding more 
subsystems besides the existing "advanced" subsystem. The latest addition is 
a sgzip io subsystem.

	Background,
 	In todays forensic work, one regularly needs to work on very large hard 
disks, many new systems are sold with 80+GB hdd, so even the most trivial 
forensic analysis workstation needs to be able to accomodate huge dd images. 
This is obviously not practical. Secondly, storage and archiving of such 
images is difficult. This is why most people compress their images for 
storage. Currently people need to uncompress their images so that the 
sleuthkit can deal with them which is a major pain in the proverbial.

        Solution:
	The previous io-subsystem patch allows the easy integration of different io 
decoders for dealing with image files of verious formats. This current 
version (its a cumulative patch replacing the previous patch as well), 
implements an interface into the sgzip library (also supplied in the patch). 
sgzip is (seekable gzip) a file format allowing to implement fask seekable 
compressed images. The problem with regular gzip file is that one can not 
seek in them in a reasonable time (typically it involves decompressing the 
whole file for each seek). sgzip allows very fast seek times, and a very 
small loss of compression (up to about 2%). The details about the sgzip file 
format can be read in the sgzlib.h for interested developers.

     Basically after applying the patch to sleuthkit-1.67, and compiling 
normally, you will get a new binary in bin/ called sgzip. This is very 
similar to gzip. You use it like this:
	bash$ sgzip honeypot.hda8.dd

	This will create a file called honeypot.hda8.dd.sgz. Unlike gzip it does not 
unlink the original file, but it will overwrite a honeypot.hda8.dd.sgz if its 
there. (you can also use sgzip by piping into it etc, just like gzip. Usefull 
for grabbing images from netcat straight into sgzip).

	The filesize of the new compressed image is a little larger than a 
corresponding gzip file because it has more indexes and the compression 
algorithm is suboptimal when its broken across several blocks, but the 
difference is too small to care about.

	You can use the sleuthkit to directly work with this new compressed file by 
calling on the right subsystem:

	bash$  fls -r -i sgzip -f linux-ext2 honeypot.hda8.dd.sgz

	Thats it. complex hey?

	The sgzip subsystem also takes on an offset arguement so we can compress a 
whole hdd image and then work with individual partitions.

	Performance:
	For detailed documentation of the sgzlib implementation read the sgzlib.h 
file and the source code. Suffice it to say that sgzip works by breaking the 
uncompressed file into blocks, and compressing each block seperately. Then, 
when we need to uncompress a random bit of data we find the right block and 
decompress it. 
	Hence if we want to read very small runs of data we are better off having 
smaller blocks so we do need to decompress data that we dont need. The user 
has control over the block size when compressing the file by using the -B 
argument to sgzip.
	Smaller blocks means faster sleuthkit, but less efficient compression. For 
example I took the honeypot.hda8.dd.sgz produced above (btw this is a disk 
from the honeynet forensic challenge) and did some benchmarking:

	Timing the uncompressed version (filesize=272,465,920=272MB)
	bash$ time fls -r -f linux-ext2 honeypot.hda8.dd > /dev/null

	real    0m0.050s
	user    0m0.010s
	sys     0m0.040s

	Hardly any time at all....
	Now I compressed the file with a blocksize of 100kb using -B 100000 as 
arguement to sgzip (filesize=25,237,474=25MB):

bash$ time fls -r -i sgzip -f linux-ext2 honeypot.hda8.dd.sgz > /dev/null
	real    0m12.600s
	user    0m9.590s
	sys     0m2.560s

	And the default blocksize (which is 512kb):
	(filesize=25,008,704=25MB)
	real    1m32.119s
	user    1m10.190s
	sys     0m12.870s

	Just for comparison the size of a normal gzip file is 24968226 bytes.

	So the sgzip file with 100kb blocksize is 1.07% bigger, and the speed is 
still acceptable and about 8-10 times faster than the default 512kb 
blocksize. But this is probably because fls is reading lots of very small 
runs of data scattered all over the whole disk.

	Just to be rediculous, I repeat the test with a block size of 10kb:
	real    0m1.421s
	user    0m1.310s
	sys     0m0.000s

	However, the filesize has dramatically increased now to 27,435,391 (27MB), 
which is about 10% bigger than pure gzip. That may be ok in some 
circumstances, but it must be remembered that in this case this particular 
hdd is mostly empty so it compresses very well. I expect the expansion of 
size to be more noticable in a more full disk.

	Future developments:
	- I am planning on writing a python module for sgzip (prob just use swig to 
access the c library). If people are interested i might look at how you make 
perl modules from swig. (I dont do much perl nowadays, since flag is 
rewritten in python.)