Re: [sleuthkit-users] handling compressed data in tsk_fs_file_walk callback

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> 	
> 
> 
> On Jul 17, 2011, at 9:22 PM, Simson Garfinkel wrote:
> 
>> I'm having a problem with the file_acct callback function called from tsk_fs_file_walk when processing compressed files. That is, when the flag TSK_FS_BLOCK_FLAG_COMP is set.
>> 
>> An existing problem that I've previously noted is that the callback provides the location on the disk of the compressed data, but TSK doesn't provide an exposed function which allows me to provide the on-disk data and decompress it.
> 
> The internal function is ntfs_uncompress_compunit() in tsk3/fs/ntfs.c.  It currently has a static scope in the file, but I could change that. Can you look at that method to see if it is what you are looking for?

This is precisely what I am looking for. Being able to call this function is good, but there needs to be a way for easily creating the NTFS_COMP_INFO structure that it requires. 

> 
>> For blocks that are not compressed, I know that it can be extended if the image offset plus the length of the previous block is equal to the image offset of the following block. But I don't see any obvious way to do that with compressed files. 
> 
>> The problem that I am having now has to do with the coalescing of compressed blocks. To which, the callback is called for each 512-byte block in the file. How can I tell if it is safe to combine two blocks into a single run that is decompressed as a whole?
> 
> 
> Here is how it currently works:
> * NTFS breaks the original data up into compression units (I think 16 clusters is the default).  It tries to compress them and if it compresses to less than 15 clusters, it stores those clusters compressed. Let's say that it compresses to 10 clusters.  The NTFS attribute is saved as having 10 clusters with compressed data and then 6 sparse clusters (which don't point to a specific cluster address). 
> * The file_walk callback will get called 16 times for this unit.  Each time it will return a chunk equal to the block size. The first 10 times the callback will get a cluster address that stored file content. The last 6 times it will continue to get uncompressed file content, but it will have an address of 0.  If you are seeing behavior different from this, let me know.
> 
> So, your previous approach of saving the previous cluster address and making sure that it increments by 1 each time should still work. You'll just need to handle the extra 0s in there.   You can also manually parse the TSK_FS_ATTR structures if you want. Those are stored as run lists, so it will give you the extent information.

This is in fact what I am seeing and I did not know what to do with the extra 0s.  Here is my problem with the current architecture, however. How do I distinguish between these two cases:
	* I am called as the second cluster in a run of 16 clusters that should be decompressed together, 
	* I am called as the first cluster in a run of 16 clusters that just happens to follow a run of 1 cluster that was decompresed by itself.

That is, can I be assured that I the 16 data allocation block callbacks I get will be for the same run, and not for two runs?

Is there a way to determine the number 16 rather than having it hard-coded?

I'm very excited about getting a solution here because it seems that I'm very close to realizing the goal of being able to know where the data is on the disk, read it from the disk, and then decompressing it myself by calling SleuthKit. 

Is there ever a chance that the compressed clusters will themselves be fragmented?

Looks like the DFXML will need to be able to represent that one or more  runs need to be concatenated together and then decompressed.