Re: [sleuthkit-developers] First Draft - Layout Hash Database

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Michael Cohen said:
[...]
> I find this is an important requirement, particularly for sql databases=
.
> The
> os and applications should be short ints so that an index may be built =
on
> them making it faster to search. Also I found that building a partial
> index
> on the md5 column itself speeds things up several orders of maginitude,
> but
> still keeps the index size reasonable so it fits well in ram.

Performance will not be one of our bigger problems. Even with, say 20
million entries (NSRL alone has nearly 18 mio.), we should get reasonable
search times, provided we use some clever indexing.
Sure, one problem will be to import 20 mio. entries. But with index dropp=
ing
and setting it after the import we will gain much time.

The performance question is not important as long as we do not have a goo=
d
data model. To add performance features is simple textbook work.

>> > Application entry:
> Are you suggesting to not name the application product at all? but rath=
er
> only
> contain information on the category of the application? So for example =
in
> the
> table "msword.exe" will have office tools as application, but not refer=
 to
> microsoft word as a product? I really think that you still need to
> classify
> the hash set with the commercial name of the application, otherwise you
> would
> not know which specific application xyz.dll belongs to.

I think we have to decide if we want kind of a full management database
with all possible kind of information for a hash set or if we need a
database with a relatively small number of categories for excluding
knowngoods and alerting on knownbads. For the later, we do not need to
know if "msword.exe" is from the Package "Microsoft Office 2000 SP 3
Hotfix 2a".
For the former, we need the detailed information.

Which brings us to an other problem:
Do we allow duplicate entries for hashsums in the database ? The former
solution will allow this, the later probably doesn't require it.

> In general I think the approach taken by NSRL is not a bad one.
[...]
> This is much more effective than
> having to redo the entire nsrl.

The problem is, that it is absolutely no problem the make a database
structure for NSRL. In fact, NSRL already has a full generic database
structure which could be easily adapted.
But this was, so far, not my intention (see above)
Yet, we do not have to redo the NSRL database. We only have to define
a mapping (once) for NSRL categories. Automatic import with a parser scri=
pt
is not problem. Since NSRL categories do not change too much, maintainanc=
e
should be no problem.

>> MacOS probably shouldn't get a separate category from OSX unless Win
>> '98 is also separated from Win XP.  The specific types in BSD should b=
e
>> defined (since OS X is actually a variant of BSD).  The Solaris
>> category should also include SunOS.
> I think that OSs should be granulated down as much as practically
> possible. So
> I would give win98 a different category than winXP. Maybe not so much a=
s
> to
> seperate the different service packs, but  its often very evident what
> kind
> of os you are working on, and it would speed things out considerably if
> the
> database could be split into different tables, depending on the OS. Thi=
s
> effect can be achieved by building an index on the OS column, this
> severely
> lightens the load on the query if we restrict our searches to particula=
r
> os's.

Same problem as above: either we use small categories with a usable
interface or we define huge categories with a VERY large interface.
Agreed, the later will result in a faster performance due to more detaile=
d
constraints in the query. But with good indexing and persistent database
connections, speed should be reasonable with small categories as well.

Regards,

Matthias