Re: [sleuthkit-developers] First Draft - Layout Hash Database

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi All,

> > File entry:
> > - sha1
> > - md5
> > - os
> > - application
> > - filename
> > - filesize
>
> It would be nice if each entry had a static size, so that we could jump
> around the text file of the database easily.  Therefore, there would be
> an index that correlates an application type to an integer.  I would
> think that doing integer comparisons would be faster than string
> comparisons though when looking entries up.  That maybe a pain to
> manage though.

I find this is an important requirement, particularly for sql databases. The 
os and applications should be short ints so that an index may be built on 
them making it faster to search. Also I found that building a partial index 
on the md5 column itself speeds things up several orders of maginitude, but 
still keeps the index size reasonable so it fits well in ram.

> > Application entry:
Are you suggesting to not name the application product at all? but rather only 
contain information on the category of the application? So for example in the 
table "msword.exe" will have office tools as application, but not refer to 
microsoft word as a product? I really think that you still need to classify 
the hash set with the commercial name of the application, otherwise you would 
not know which specific application xyz.dll belongs to.

In general I think the approach taken by NSRL is not a bad one. I sympathise 
with the dillema of not being able to rely on the hashes to get a quick yes/
no answer as to whether a disk contains "bad files". I think the task set out 
for by the NSRL is to merely identify the files. Classifying them into 
categories is a purely subjective decision, based in the most part on the 
circumstances of the case. The NSRL is used to see what applications/
packages/products are installed, the decision of those applications which are 
bad should be done in a seperate table altogether. So I suggest to make 
another table where you classify the products into categories etc. e.g.:

product_code/application_code/package whatever code is appropriate
product category

So the hash table should have information relating a specific hash to MSword 
for example, and this new table tells us that msword is an office app. 
Similarly if we see a hash matching back orifice, we consult this new table 
to find that back orific is a hacker app. This is much more effective than 
having to redo the entire nsrl.

> MacOS probably shouldn't get a separate category from OSX unless Win
> '98 is also separated from Win XP.  The specific types in BSD should be
> defined (since OS X is actually a variant of BSD).  The Solaris
> category should also include SunOS.
I think that OSs should be granulated down as much as practically possible. So 
I would give win98 a different category than winXP. Maybe not so much as to 
seperate the different service packs, but  its often very evident what kind 
of os you are working on, and it would speed things out considerably if the 
database could be split into different tables, depending on the OS. This 
effect can be achieved by building an index on the OS column, this severely 
lightens the load on the query if we restrict our searches to particular 
os's.

> > Questions so far:
> > Do we need a separate architecture field for a hashsum entry ? This
> > will
> > require an additional search parameter later.
I think we do, for the reason i mentioned above- no point searching all those 
spark entries when we are clearly working on an intel box.

> Is the file size needed?  I'm trying to think of a scenario where that
> would be needed.
Sometimes its usefull to see the filesize if the file is extremely small, e.g. 
1 byte or 2 bytes - its very easy to get hash collisions on these files and 
the database is not reliable - in fact i think hashes should not be taken of 
such small files, but NSRL is full of 0 byte files.

> This looks good.  I think more requirements for each app category would
> be useful though.
It would be useful to design the hash database in a way that can leverage off 
NSRL, since NSRL is the richest source of hashes at the moment.

Regards.
Michael.