Re: [sleuthkit-users] Good vs. Bad Hashes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

[I was hoping you would be interested in this topic in light of your 
new database :)]

On Thursday, January 22, 2004, at 01:33  PM, Matthias Hofherr wrote:

> Logical we need
> to maintain a potential huge amount of data and categorize every single
> hash entry. Furthermore, we have to decide for each entry if it is a
> known-bad or a known-good. I think a useful solution is to maintain a
> global database with both freely available hashsums like 
> NSRL,KnownGoods
> combined
> with selfmade hash set (md5sum/graverobber ...).

I'm assuming that you are referring to a global database in the local 
sense.  That each person has their own "global" database that they 
create and can add and remove hashes from.  Not a global database in 
the Solaris Fingerprint DB sense.

> The interface to
> autopsy and sleuthkit should allow to query only certain categories, 
> only
> known bads, a certain category as known bad or not(-> e. g. remote
> management tools). The biggest problem here is to manage the category
> mapping table for all the different tools.

I agree.  Especially when you start merging the home made hashes with 
those from the NSRL and hashkeeper.   I guess we could have a generic 
category of 'Always Good' or 'Always Bad'.

> The technical problem is to manage such a huge amount of raw data. With
> NSRL alone, we have millions of hash sets. This requires a new query
> mechanism. With a RDBMS, we need persistent connections and the
> possibility
> to bulk query large data sets very fast. With the current sorter|hfind
> design, sorter calls hfind one time per hash analyzed. This is 
> definitely
> a big bottleneck.

Yea, I have no problem if the end solution requires a redesign of hfind 
and sorter.

I'm just not sure what the end solution should be.  Some open questions:
- what application categories are needed?  Are the NSRL ones sufficient 
or are there too many / too few of them?
- How do you specify in the query which cat are bad and which are good?
- How do you specify to 'sorter' which cat are bad and which are good?
- Do we want to require a real database (i.e. SQL) or should there also 
be an ASCII file version?

thanks,
brian