Thread: [sleuthkit-developers] First Draft - Layout Hash Database

Brought to you by: carrier

sleuthkit-developers

[sleuthkit-developers] First Draft - Layout Hash Database

From: Matthias H. <mat...@mh...> - 2004-01-27 18:13:14

Hi list,

in cooperation with David Barroso I compiled a first  proposal
for the structure of a hash database:

File entry:
- sha1
- md5
- os
- application
- filename
- filesize

Application entry:
- remote management
- office tools
- database
- desktop
- server daemons
- web
- multimedia
- drivers
- development
- sysutils
- security
- known-bad
- other

Operation system entry:
- Linux
- Windows
- BSD
- Mac
- MacOSX
- Solaris
- DOS
- Handheld OS
- AIX
- HP-UX
- Other

The fields per category should be easily manageable with a web based
analysis gui (autopsy). Usually, only one of the categories should be
required for a forensic analysis step ("filter all linux hashsums from my
image", "identify application xyz on my image" ...).

Questions so far:
Do we need a separate architecture field for a hashsum entry ? This will
require an additional search parameter later.
Does anyone need a crc32 entry with the hashsum ?
Did we miss important fields ?
Did we miss important questions ;-)

Feedback for this proposal is welcome and encouraged.

Regards,

Matthias


--=20
Matthias Hofherr
mail: mat...@mh...
web: http://www.forinsect.de
gpg: http://www.forinsect.de/pubkey.asc

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Brian C. <ca...@sl...> - 2004-01-27 23:15:40

> in cooperation with David Barroso I compiled a first  proposal
> for the structure of a hash database:

Great.

I thought about what software I have on my systems and tried to fit it 
in, so there are some questions about what goes where.  Could you maybe 
provide requirements for software to fit into each category?

> File entry:
> - sha1
> - md5
> - os
> - application
> - filename
> - filesize

It would be nice if each entry had a static size, so that we could jump 
around the text file of the database easily.  Therefore, there would be 
an index that correlates an application type to an integer.  I would 
think that doing integer comparisons would be faster than string 
comparisons though when looking entries up.  That maybe a pain to 
manage though.

> Application entry:
> - remote management
> - office tools

Would adobe acrobat reader and calendars fit into this category?

> - database
> - desktop

What are examples of this category?  games?

> - server daemons
> - web

A general name like network may scale better.  Would email tools fit in 
here too?

> - multimedia
> - drivers
> - development
> - sysutils
> - security

Would this include tools that are frequently called "hacker" tools too? 
  This category could be difficult and controversial to maintain, but I 
don't know of a better way to do it...

> - known-bad

Should there be a known-good too?  I can imagine a situation where 
someone hashes his /bin/, /sbin/, /usr/local/bin ... directories and 
doesn't want to have to identify the category of each file.

> - other

Where would child-porn fit into this?  known-bad?  That seems to be one 
of the biggest categories of hashes and may warrant its own category.


> Operation system entry:
> - Linux
> - Windows
> - BSD
> - Mac
> - MacOSX
> - Solaris
> - DOS
> - Handheld OS
> - AIX
> - HP-UX
> - Other

MacOS probably shouldn't get a separate category from OSX unless Win 
'98 is also separated from Win XP.  The specific types in BSD should be 
defined (since OS X is actually a variant of BSD).  The Solaris 
category should also include SunOS.


> Questions so far:
> Do we need a separate architecture field for a hashsum entry ? This 
> will
> require an additional search parameter later.

Probably not.

> Does anyone need a crc32 entry with the hashsum ?

I don't think it is needed.  It is not best practice to  use CRC, so 
there isn't much point in including them.

> Did we miss important fields ?

SHA-2 maynot be a bad idea.  I recall threads in the past on other 
lists about using SHA-2, so we may want to make a field for it (even 
though the public DB don't use it  yet).  It can take the place of 
CRC32.

Is the file size needed?  I'm trying to think of a scenario where that 
would be needed.

> Did we miss important questions ;-)

This looks good.  I think more requirements for each app category would 
be useful though.

thanks,
brian

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Michael C. <mic...@ne...> - 2004-01-28 08:37:37

Hi All,

> > File entry:
> > - sha1
> > - md5
> > - os
> > - application
> > - filename
> > - filesize
>
> It would be nice if each entry had a static size, so that we could jump
> around the text file of the database easily.  Therefore, there would be
> an index that correlates an application type to an integer.  I would
> think that doing integer comparisons would be faster than string
> comparisons though when looking entries up.  That maybe a pain to
> manage though.

I find this is an important requirement, particularly for sql databases. The 
os and applications should be short ints so that an index may be built on 
them making it faster to search. Also I found that building a partial index 
on the md5 column itself speeds things up several orders of maginitude, but 
still keeps the index size reasonable so it fits well in ram.

> > Application entry:
Are you suggesting to not name the application product at all? but rather only 
contain information on the category of the application? So for example in the 
table "msword.exe" will have office tools as application, but not refer to 
microsoft word as a product? I really think that you still need to classify 
the hash set with the commercial name of the application, otherwise you would 
not know which specific application xyz.dll belongs to.

In general I think the approach taken by NSRL is not a bad one. I sympathise 
with the dillema of not being able to rely on the hashes to get a quick yes/
no answer as to whether a disk contains "bad files". I think the task set out 
for by the NSRL is to merely identify the files. Classifying them into 
categories is a purely subjective decision, based in the most part on the 
circumstances of the case. The NSRL is used to see what applications/
packages/products are installed, the decision of those applications which are 
bad should be done in a seperate table altogether. So I suggest to make 
another table where you classify the products into categories etc. e.g.:

product_code/application_code/package whatever code is appropriate
product category

So the hash table should have information relating a specific hash to MSword 
for example, and this new table tells us that msword is an office app. 
Similarly if we see a hash matching back orifice, we consult this new table 
to find that back orific is a hacker app. This is much more effective than 
having to redo the entire nsrl.

> MacOS probably shouldn't get a separate category from OSX unless Win
> '98 is also separated from Win XP.  The specific types in BSD should be
> defined (since OS X is actually a variant of BSD).  The Solaris
> category should also include SunOS.
I think that OSs should be granulated down as much as practically possible. So 
I would give win98 a different category than winXP. Maybe not so much as to 
seperate the different service packs, but  its often very evident what kind 
of os you are working on, and it would speed things out considerably if the 
database could be split into different tables, depending on the OS. This 
effect can be achieved by building an index on the OS column, this severely 
lightens the load on the query if we restrict our searches to particular 
os's.

> > Questions so far:
> > Do we need a separate architecture field for a hashsum entry ? This
> > will
> > require an additional search parameter later.
I think we do, for the reason i mentioned above- no point searching all those 
spark entries when we are clearly working on an intel box.

> Is the file size needed?  I'm trying to think of a scenario where that
> would be needed.
Sometimes its usefull to see the filesize if the file is extremely small, e.g. 
1 byte or 2 bytes - its very easy to get hash collisions on these files and 
the database is not reliable - in fact i think hashes should not be taken of 
such small files, but NSRL is full of 0 byte files.

> This looks good.  I think more requirements for each app category would
> be useful though.
It would be useful to design the hash database in a way that can leverage off 
NSRL, since NSRL is the richest source of hashes at the moment.

Regards.
Michael.

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Brian C. <ca...@sl...> - 2004-01-28 14:41:33

>
>>> Application entry:
>  So I suggest to make
> another table where you classify the products into categories etc. 
> e.g.:
>
> product_code/application_code/package whatever code is appropriate
> product category
>
> So the hash table should have information relating a specific hash to 
> MSword
> for example, and this new table tells us that msword is an office app.
> Similarly if we see a hash matching back orifice, we consult this new 
> table
> to find that back orific is a hacker app. This is much more effective 
> than
> having to redo the entire nsrl.

That is a really good point.  The only problem we are trying to solve 
is the number of application categories.  We could even use all of the 
fields that the NSRL uses and write a program to read in the NSRL and 
output the NSRL with the new categories.


With regard to separating by platform and more granular OS, I think 
that is useful for the operating system binaries. But, for applications 
that could be harder.  Many windows apps run on different versions.  If 
it has to be tied to every new Windows version, then it might be a pain 
to maintain.

thanks,
brian

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Matthias H. <mat...@mh...> - 2004-01-28 18:12:03

Michael Cohen said:
[...]
> I find this is an important requirement, particularly for sql databases=
.
> The
> os and applications should be short ints so that an index may be built =
on
> them making it faster to search. Also I found that building a partial
> index
> on the md5 column itself speeds things up several orders of maginitude,
> but
> still keeps the index size reasonable so it fits well in ram.

Performance will not be one of our bigger problems. Even with, say 20
million entries (NSRL alone has nearly 18 mio.), we should get reasonable
search times, provided we use some clever indexing.
Sure, one problem will be to import 20 mio. entries. But with index dropp=
ing
and setting it after the import we will gain much time.

The performance question is not important as long as we do not have a goo=
d
data model. To add performance features is simple textbook work.


>> > Application entry:
> Are you suggesting to not name the application product at all? but rath=
er
> only
> contain information on the category of the application? So for example =
in
> the
> table "msword.exe" will have office tools as application, but not refer=
 to
> microsoft word as a product? I really think that you still need to
> classify
> the hash set with the commercial name of the application, otherwise you
> would
> not know which specific application xyz.dll belongs to.

I think we have to decide if we want kind of a full management database
with all possible kind of information for a hash set or if we need a
database with a relatively small number of categories for excluding
knowngoods and alerting on knownbads. For the later, we do not need to
know if "msword.exe" is from the Package "Microsoft Office 2000 SP 3
Hotfix 2a".
For the former, we need the detailed information.

Which brings us to an other problem:
Do we allow duplicate entries for hashsums in the database ? The former
solution will allow this, the later probably doesn't require it.


> In general I think the approach taken by NSRL is not a bad one.
[...]
> This is much more effective than
> having to redo the entire nsrl.

The problem is, that it is absolutely no problem the make a database
structure for NSRL. In fact, NSRL already has a full generic database
structure which could be easily adapted.
But this was, so far, not my intention (see above)
Yet, we do not have to redo the NSRL database. We only have to define
a mapping (once) for NSRL categories. Automatic import with a parser scri=
pt
is not problem. Since NSRL categories do not change too much, maintainanc=
e
should be no problem.

>> MacOS probably shouldn't get a separate category from OSX unless Win
>> '98 is also separated from Win XP.  The specific types in BSD should b=
e
>> defined (since OS X is actually a variant of BSD).  The Solaris
>> category should also include SunOS.
> I think that OSs should be granulated down as much as practically
> possible. So
> I would give win98 a different category than winXP. Maybe not so much a=
s
> to
> seperate the different service packs, but  its often very evident what
> kind
> of os you are working on, and it would speed things out considerably if
> the
> database could be split into different tables, depending on the OS. Thi=
s
> effect can be achieved by building an index on the OS column, this
> severely
> lightens the load on the query if we restrict our searches to particula=
r
> os's.

Same problem as above: either we use small categories with a usable
interface or we define huge categories with a VERY large interface.
Agreed, the later will result in a faster performance due to more detaile=
d
constraints in the query. But with good indexing and persistent database
connections, speed should be reasonable with small categories as well.

Regards,

Matthias

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Brian C. <ca...@sl...> - 2004-01-31 05:14:01

[the list server is so slow this week.  I forwarded a message this 
morning and it still hasn't been posted].

So, after thinking about this thread some more, there are two problems 
that are being addressed at the same time and I think they can be more 
independent and I think the merging has caused some confusion.

1.  A small set of application categories for any hash database.

2.  An implementation of a database that can import hashes from 
multiple sources.

As I mentioned before, the categories are a problem with all databases 
and I think it would be useful if we could publish a list with 
requirements for each category.  From Doug's email, it sounds like NIST 
would be interested in such categories (assuming that they are 
comprehensive and make sense).

For the implementation, it seems that we need to have a clear goal for 
the DB.  Is it for a comprehensive DB or is it just for quick good vs 
bad lookups.  Both are needed, but can we satisfy both goals with one 
DB?  Or, could that be an option at install time.  They can chose the  
quick / dirty / less data version or the full version.  I'm not a DB 
guy, so I have no clue what the answers for this are.

It has occurred to me that there should be a 'source' column in the 
database, so that the entry can be attributed to the NSRL, hashkeeper, 
custom etc.  A version may also be useful.  This is also useful so that 
you can remove the hashes from the DB at a later point.

thanks,
brian

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Matthias H. <mat...@mh...> - 2004-01-30 14:25:55

Brian Carrier said:
[...]
> So, after thinking about this thread some more, there are two problems
> that are being addressed at the same time and I think they can be more
> independent and I think the merging has caused some confusion.
>
> 1.  A small set of application categories for any hash database.
>
> 2.  An implementation of a database that can import hashes from
> multiple sources.
>
> As I mentioned before, the categories are a problem with all databases
> and I think it would be useful if we could publish a list with
> requirements for each category.  From Doug's email, it sounds like NIST
> would be interested in such categories (assuming that they are
> comprehensive and make sense).

Ok, then let's treat the list of applications separately. We can later
decide if/how we want to implement this in our database. I'll compile a
list with examples out of our recent discussion and post it this weekend
for further discussion.

> For the implementation, it seems that we need to have a clear goal for
> the DB.  Is it for a comprehensive DB or is it just for quick good vs
> bad lookups.  Both are needed, but can we satisfy both goals with one
> DB?  Or, could that be an option at install time.  They can chose the
> quick / dirty / less data version or the full version.  I'm not a DB
> guy, so I have no clue what the answers for this are.

After thinking about the recent discussion and your comments, I would
prefer not to separate the database but instead the interface:

- we use a comprehensive database with a large set of information for eac=
h
hash set
- upon importing, everybody can decide for himself how much data to
include into the database
- we provide a mapping table in order to map the very detailed categories
to a small set of super-categories
- we provide 2 interfaces: "quick&dirty" (->super-categories) and
"long&detailed"

The biggest part of the database are the hashsets themself. The
organization of comprehensive add-on information doesn't use much
ressources, it requires only a good data model. So we gain not much by
using two different database models.


> It has occurred to me that there should be a 'source' column in the
> database, so that the entry can be attributed to the NSRL, hashkeeper,
> custom etc.  A version may also be useful.  This is also useful so that
> you can remove the hashes from the DB at a later point.

Good idea, I do use this already (without a version) in my forensic hash
database.


Regards,

Matthias

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: Matthias H. <mat...@mh...> - 2004-01-28 17:38:43

Brian Carrier said:
[...]
> I thought about what software I have on my systems and tried to fit it
> in, so there are some questions about what goes where.  Could you maybe
> provide requirements for software to fit into each category?

Ok, I'll fill the categories with descriptions.

> It would be nice if each entry had a static size, so that we could jump
> around the text file of the database easily.

How about this: we use fields with dynamic length in the database and use
an export tool for exporting with static sizes ? We could set the maximal
length with datatypes like "varchar(40)".

> Therefore, there would be
> an index that correlates an application type to an integer.  I would
> think that doing integer comparisons would be faster than string
> comparisons though when looking entries up.  That maybe a pain to
> manage though.

Sure, we need integer identifiers for performance. I deliberatley didn't
mention them because I think we first have to agree on the data model.
Things like primary keys, foreign keys, indices etc. should follow when
we find a good data model.

>> Application entry:
>> - remote management
>> - office tools
>
> Would adobe acrobat reader and calendars fit into this category?

I would place adobe acrobat and calendars in the desktop category.

>> - database
>> - desktop
>
> What are examples of this category?  games?
>
>> - server daemons
>> - web
>
> A general name like network may scale better.  Would email tools fit in
> here too?
>
>> - multimedia
>> - drivers
>> - development
>> - sysutils
>> - security
>
> Would this include tools that are frequently called "hacker" tools too?
>   This category could be difficult and controversial to maintain, but I
> don't know of a better way to do it...

Sure, the problem we have is with tools like nmap,nemesis, hping etc. (to=
ols
both used for good and bad things).
I like Matt McMillon's idea to search categories both as knowngood and
knownbad. So everybody can decide for himself during search-time how
to handle this.
I think, operation system categories should be per default known-good.
Each application categories should get an individual default setting for
knowngood/knownbad.

>> - known-bad
>
> Should there be a known-good too?  I can imagine a situation where
> someone hashes his /bin/, /sbin/, /usr/local/bin ... directories and
> doesn't want to have to identify the category of each file.

Known-bad was kind of a catch-all for all possible known-bad files.
Problem is, if we segment known-bad, we'll get dozens of subcategories.
While this is no problem in the database, it will be difficult to handle
for autopsy.


>> - other
>
> Where would child-porn fit into this?  known-bad?  That seems to be one
> of the biggest categories of hashes and may warrant its own category.

Yes, I thought it should be known-bad. During my forensic analyses, my
main objectives so far were hacking-related, not child-porn. So it may be
that I have kind of a blind spot for this problem.
Ok, let's add a separate category "child-porn" with "known-bad" as defaul=
t.

>
>> Operation system entry:
>> - Linux
>> - Windows
>> - BSD
>> - Mac
>> - MacOSX
>> - Solaris
>> - DOS
>> - Handheld OS
>> - AIX
>> - HP-UX
>> - Other
>
> MacOS probably shouldn't get a separate category from OSX unless Win
> '98 is also separated from Win XP.  The specific types in BSD should be
> defined (since OS X is actually a variant of BSD).  The Solaris
> category should also include SunOS.

Ok, so BSD would include: (Free|Open|Net-BSD|BSD/OS|OS X)
What about IRIX,TRUE64 etc ? Did we forget a category with many entries ?
Problem is, we should hold the number of OS's low for a better
usability of the search interface. Out of the box I find about three doze=
n
operation systems and probably forgetting some other dozen.


> SHA-2 maynot be a bad idea.  I recall threads in the past on other
> lists about using SHA-2, so we may want to make a field for it (even
> though the public DB don't use it  yet).  It can take the place of
> CRC32.

Good point here.

> This looks good.  I think more requirements for each app category would
> be useful though.

I'll compile a new draft with some more flesh to each category. This
should help us for a more detailed discussion of the categories.

Regards,

Matthias

Re: [sleuthkit-developers] First Draft - Layout Hash Database

From: David B. <to...@so...> - 2004-01-28 18:48:45

* Brian Carrier (ca...@sl...) wrote:

[snip]

> It would be nice if each entry had a static size, so that we could jump 
> around the text file of the database easily.  Therefore, there would be 
> an index that correlates an application type to an integer.  I would 
> think that doing integer comparisons would be faster than string 
> comparisons though when looking entries up.  That maybe a pain to 
> manage though.

Yes it is a good idea to map applications type to an integer; I even
think that the OS field should be an integer too. It shouldn't be a pain
to manage them if the import tools make the task easier. (the problem
then is to develop proper import tools ;))

> 
> >Application entry:
> >- remote management

Thinking about this category, perhaps it is included in the server
daemons category (for the servers) and network category (for the
clients)

> >- office tools
> 
> Would adobe acrobat reader and calendars fit into this category?

Yes, and even a mail client.

> 
> >- database
> >- desktop
> 
> What are examples of this category?  games?

Proper examples for this category would be games, IM, screensavers,
iconsets, wallpapers... but perhaps this category should be merged with
the multimedia category ¿?

> 
> >- server daemons
> >- web
> 
> A general name like network may scale better.  Would email tools fit in 
> here too?

I prefer network too, but take into account that also all the web
scripts (CGI, php, perl, ...) should fit in this category.

> >- multimedia
> >- drivers
> >- development
> >- sysutils
> >- security
> 
> Would this include tools that are frequently called "hacker" tools too? 
>  This category could be difficult and controversial to maintain, but I 
> don't know of a better way to do it...

I would split perhaps this category in two other categories:
security(whitehat) and malware(exploits, rootkits, ...) I know that
malware is not the right word for them, but it is the name that gather
more different types of such files. Other approach is to include only
the 'whitehat' security tools in this category and the 'blackhat' tools
in the next category (known-bad)

> >- known-bad
> 
> Should there be a known-good too?  I can imagine a situation where 
> someone hashes his /bin/, /sbin/, /usr/local/bin ... directories and 
> doesn't want to have to identify the category of each file.

Both known-bad and known-good could be a 'wrapper' for other categories.

> >- other
> 
> Where would child-porn fit into this?  known-bad?  That seems to be one 
> of the biggest categories of hashes and may warrant its own category.

According to the above, it should fit in both malware (replace this word
with other more suitable) and known-bad.

> >Operation system entry:
> >- Linux
> >- Windows
> >- BSD
> >- Mac
> >- MacOSX
> >- Solaris
> >- DOS
> >- Handheld OS
> >- AIX
> >- HP-UX
> >- Other
> 
> MacOS probably shouldn't get a separate category from OSX unless Win 
> '98 is also separated from Win XP.  The specific types in BSD should be 
> defined (since OS X is actually a variant of BSD).  The Solaris 
> category should also include SunOS.

Then we'd add OpenBSD, FreeBSD and NetBSD, and delete OSX. SunOS is
included in the Solaris category.

[snip]

> >Did we miss important fields ?
> 
> SHA-2 maynot be a bad idea.  I recall threads in the past on other 
> lists about using SHA-2, so we may want to make a field for it (even 
> though the public DB don't use it  yet).  It can take the place of 
> CRC32.

I have never used SHA-2 nor CRC32. If SHA-2 is being currently
used, we should definitely add it.

> Is the file size needed?  I'm trying to think of a scenario where that 
> would be needed.

Hmm not sure about that, but what happens when an application has
several files with the same name in different directories (and
different hashes)?. In addition, we should specify the application
language in some field, because for instance the nt.dll file is
different for Windows 2000 English version and Windows 2000 Spanish
version, both with the same patches applied.