Re: [sleuthkit-users] hashing a file system

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

On Sep 4, 2014, at 9:11 PM, RB <ao...@gm...> wrote:

> On Thu, Sep 4, 2014 at 7:01 PM, Simson Garfinkel <si...@ac...> wrote:
>> This doesn't work unless you are prepared to buffer the later fragments of a file when they appear on disk before earlier fragments. So in the worst case, you need to hold the entire disk in RAM.
> 
> Perhaps I'm being dense, but "dd if=file | md5sum - " in no way holds
> the entire file in RAM, and the process can be slept/interrupted/etc;
> all this means that md5 can be calculated over a stream.

You are confused between the physical layout of the disk and the logical layout of the files.

You have proposed reading the disk in physical block order. If you are reading the disk in block order, what happens if you have a 30GB file where the first block of the file is at the end of the disk and the rest of the file is at the beginning? You have to buffer the portions of the file that come first on the disk but logically later in the file. Then when you reach the beginning of the file (at the end of the disk) you can start hashing.

The problem is that files are fragmented, and frequently the second fragment of a file comes earlier on the disk than the first fragment.

> 
> Looking at the API for Perl & Python MD5 libraries (expected to be the
> simplest), they have standard functionality for adding data to a hash
> object, and I don't expect it holds that in memory either.  This would
> mean you should be able to make a linear scan through the disk and, as
> you read blocks associated with a file, append them to the md5 object
> for that file, and move on.  You'd have a lot of md5 objects
> in-memory, but it shouldn't be of a size equivalent to the entire
> [used] disk.