Re: [sleuthkit-developers] Re: IO Subsystem patch for fstools

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Paul,

> Well I do see advantages.... I already wanted to ask this...
>
> The problem with the current code is that it is not possible to
> "read_random" an image efficiently because it cannot check the current
> offset in the image.. This results in unnecessary seeks.. And seeks are
> very expensive if they come in millions....

I agree - this is particularly bad if the underlying image is a compressed 
format like encase or sgzip because then each seek/read corresponds to a 
decompression of at least one block.

> For Indexed Searching it would be very handy if their would come either: a
> generic fs_read_random() function.
>
> If this function would check for the current offset in the image and thus
> not seek if the reads where all in succession, whis would be great...

This really depends on the specific subsystem, for example when reading an 
encase file you need to decompress at least one chunk for each seek so if you 
read lots of little runs of data all over the file its gonna run slow.

The solution to this problem, i think, is to implement some kind of caching in 
memory. A cache system can solve all those problems very efficiently,  
particularly for the case where you make lots of small reads, very close 
together (i.e. no seeks). A simple cache (with a simple policy) can be 
implemented quite easily i think, and will be effective for the scenario you 
are describing.

What kind of IO do you do for indexing? Is it very localised? If you were to 
cache a block into memory, what would be the optimal size of the block? (say 
1 mb or more like 32kb?) If you were to cache 1 mb in memory, how many reads 
would you get out of it on average?

Michael