Thread: [sleuthkit-developers] First Draft - Layout Hash Database
Brought to you by:
carrier
From: Matthias H. <mat...@mh...> - 2004-01-27 18:13:14
|
Hi list, in cooperation with David Barroso I compiled a first proposal for the structure of a hash database: File entry: - sha1 - md5 - os - application - filename - filesize Application entry: - remote management - office tools - database - desktop - server daemons - web - multimedia - drivers - development - sysutils - security - known-bad - other Operation system entry: - Linux - Windows - BSD - Mac - MacOSX - Solaris - DOS - Handheld OS - AIX - HP-UX - Other The fields per category should be easily manageable with a web based analysis gui (autopsy). Usually, only one of the categories should be required for a forensic analysis step ("filter all linux hashsums from my image", "identify application xyz on my image" ...). Questions so far: Do we need a separate architecture field for a hashsum entry ? This will require an additional search parameter later. Does anyone need a crc32 entry with the hashsum ? Did we miss important fields ? Did we miss important questions ;-) Feedback for this proposal is welcome and encouraged. Regards, Matthias --=20 Matthias Hofherr mail: mat...@mh... web: http://www.forinsect.de gpg: http://www.forinsect.de/pubkey.asc |
From: Brian C. <ca...@sl...> - 2004-01-27 23:15:40
|
> in cooperation with David Barroso I compiled a first proposal > for the structure of a hash database: Great. I thought about what software I have on my systems and tried to fit it in, so there are some questions about what goes where. Could you maybe provide requirements for software to fit into each category? > File entry: > - sha1 > - md5 > - os > - application > - filename > - filesize It would be nice if each entry had a static size, so that we could jump around the text file of the database easily. Therefore, there would be an index that correlates an application type to an integer. I would think that doing integer comparisons would be faster than string comparisons though when looking entries up. That maybe a pain to manage though. > Application entry: > - remote management > - office tools Would adobe acrobat reader and calendars fit into this category? > - database > - desktop What are examples of this category? games? > - server daemons > - web A general name like network may scale better. Would email tools fit in here too? > - multimedia > - drivers > - development > - sysutils > - security Would this include tools that are frequently called "hacker" tools too? This category could be difficult and controversial to maintain, but I don't know of a better way to do it... > - known-bad Should there be a known-good too? I can imagine a situation where someone hashes his /bin/, /sbin/, /usr/local/bin ... directories and doesn't want to have to identify the category of each file. > - other Where would child-porn fit into this? known-bad? That seems to be one of the biggest categories of hashes and may warrant its own category. > Operation system entry: > - Linux > - Windows > - BSD > - Mac > - MacOSX > - Solaris > - DOS > - Handheld OS > - AIX > - HP-UX > - Other MacOS probably shouldn't get a separate category from OSX unless Win '98 is also separated from Win XP. The specific types in BSD should be defined (since OS X is actually a variant of BSD). The Solaris category should also include SunOS. > Questions so far: > Do we need a separate architecture field for a hashsum entry ? This > will > require an additional search parameter later. Probably not. > Does anyone need a crc32 entry with the hashsum ? I don't think it is needed. It is not best practice to use CRC, so there isn't much point in including them. > Did we miss important fields ? SHA-2 maynot be a bad idea. I recall threads in the past on other lists about using SHA-2, so we may want to make a field for it (even though the public DB don't use it yet). It can take the place of CRC32. Is the file size needed? I'm trying to think of a scenario where that would be needed. > Did we miss important questions ;-) This looks good. I think more requirements for each app category would be useful though. thanks, brian |
From: Michael C. <mic...@ne...> - 2004-01-28 08:37:37
|
Hi All, > > File entry: > > - sha1 > > - md5 > > - os > > - application > > - filename > > - filesize > > It would be nice if each entry had a static size, so that we could jump > around the text file of the database easily. Therefore, there would be > an index that correlates an application type to an integer. I would > think that doing integer comparisons would be faster than string > comparisons though when looking entries up. That maybe a pain to > manage though. I find this is an important requirement, particularly for sql databases. The os and applications should be short ints so that an index may be built on them making it faster to search. Also I found that building a partial index on the md5 column itself speeds things up several orders of maginitude, but still keeps the index size reasonable so it fits well in ram. > > Application entry: Are you suggesting to not name the application product at all? but rather only contain information on the category of the application? So for example in the table "msword.exe" will have office tools as application, but not refer to microsoft word as a product? I really think that you still need to classify the hash set with the commercial name of the application, otherwise you would not know which specific application xyz.dll belongs to. In general I think the approach taken by NSRL is not a bad one. I sympathise with the dillema of not being able to rely on the hashes to get a quick yes/ no answer as to whether a disk contains "bad files". I think the task set out for by the NSRL is to merely identify the files. Classifying them into categories is a purely subjective decision, based in the most part on the circumstances of the case. The NSRL is used to see what applications/ packages/products are installed, the decision of those applications which are bad should be done in a seperate table altogether. So I suggest to make another table where you classify the products into categories etc. e.g.: product_code/application_code/package whatever code is appropriate product category So the hash table should have information relating a specific hash to MSword for example, and this new table tells us that msword is an office app. Similarly if we see a hash matching back orifice, we consult this new table to find that back orific is a hacker app. This is much more effective than having to redo the entire nsrl. > MacOS probably shouldn't get a separate category from OSX unless Win > '98 is also separated from Win XP. The specific types in BSD should be > defined (since OS X is actually a variant of BSD). The Solaris > category should also include SunOS. I think that OSs should be granulated down as much as practically possible. So I would give win98 a different category than winXP. Maybe not so much as to seperate the different service packs, but its often very evident what kind of os you are working on, and it would speed things out considerably if the database could be split into different tables, depending on the OS. This effect can be achieved by building an index on the OS column, this severely lightens the load on the query if we restrict our searches to particular os's. > > Questions so far: > > Do we need a separate architecture field for a hashsum entry ? This > > will > > require an additional search parameter later. I think we do, for the reason i mentioned above- no point searching all those spark entries when we are clearly working on an intel box. > Is the file size needed? I'm trying to think of a scenario where that > would be needed. Sometimes its usefull to see the filesize if the file is extremely small, e.g. 1 byte or 2 bytes - its very easy to get hash collisions on these files and the database is not reliable - in fact i think hashes should not be taken of such small files, but NSRL is full of 0 byte files. > This looks good. I think more requirements for each app category would > be useful though. It would be useful to design the hash database in a way that can leverage off NSRL, since NSRL is the richest source of hashes at the moment. Regards. Michael. |
From: Brian C. <ca...@sl...> - 2004-01-28 14:41:33
|
> >>> Application entry: > So I suggest to make > another table where you classify the products into categories etc. > e.g.: > > product_code/application_code/package whatever code is appropriate > product category > > So the hash table should have information relating a specific hash to > MSword > for example, and this new table tells us that msword is an office app. > Similarly if we see a hash matching back orifice, we consult this new > table > to find that back orific is a hacker app. This is much more effective > than > having to redo the entire nsrl. That is a really good point. The only problem we are trying to solve is the number of application categories. We could even use all of the fields that the NSRL uses and write a program to read in the NSRL and output the NSRL with the new categories. With regard to separating by platform and more granular OS, I think that is useful for the operating system binaries. But, for applications that could be harder. Many windows apps run on different versions. If it has to be tied to every new Windows version, then it might be a pain to maintain. thanks, brian |
From: Matthias H. <mat...@mh...> - 2004-01-28 18:12:03
|
Michael Cohen said: [...] > I find this is an important requirement, particularly for sql databases= . > The > os and applications should be short ints so that an index may be built = on > them making it faster to search. Also I found that building a partial > index > on the md5 column itself speeds things up several orders of maginitude, > but > still keeps the index size reasonable so it fits well in ram. Performance will not be one of our bigger problems. Even with, say 20 million entries (NSRL alone has nearly 18 mio.), we should get reasonable search times, provided we use some clever indexing. Sure, one problem will be to import 20 mio. entries. But with index dropp= ing and setting it after the import we will gain much time. The performance question is not important as long as we do not have a goo= d data model. To add performance features is simple textbook work. >> > Application entry: > Are you suggesting to not name the application product at all? but rath= er > only > contain information on the category of the application? So for example = in > the > table "msword.exe" will have office tools as application, but not refer= to > microsoft word as a product? I really think that you still need to > classify > the hash set with the commercial name of the application, otherwise you > would > not know which specific application xyz.dll belongs to. I think we have to decide if we want kind of a full management database with all possible kind of information for a hash set or if we need a database with a relatively small number of categories for excluding knowngoods and alerting on knownbads. For the later, we do not need to know if "msword.exe" is from the Package "Microsoft Office 2000 SP 3 Hotfix 2a". For the former, we need the detailed information. Which brings us to an other problem: Do we allow duplicate entries for hashsums in the database ? The former solution will allow this, the later probably doesn't require it. > In general I think the approach taken by NSRL is not a bad one. [...] > This is much more effective than > having to redo the entire nsrl. The problem is, that it is absolutely no problem the make a database structure for NSRL. In fact, NSRL already has a full generic database structure which could be easily adapted. But this was, so far, not my intention (see above) Yet, we do not have to redo the NSRL database. We only have to define a mapping (once) for NSRL categories. Automatic import with a parser scri= pt is not problem. Since NSRL categories do not change too much, maintainanc= e should be no problem. >> MacOS probably shouldn't get a separate category from OSX unless Win >> '98 is also separated from Win XP. The specific types in BSD should b= e >> defined (since OS X is actually a variant of BSD). The Solaris >> category should also include SunOS. > I think that OSs should be granulated down as much as practically > possible. So > I would give win98 a different category than winXP. Maybe not so much a= s > to > seperate the different service packs, but its often very evident what > kind > of os you are working on, and it would speed things out considerably if > the > database could be split into different tables, depending on the OS. Thi= s > effect can be achieved by building an index on the OS column, this > severely > lightens the load on the query if we restrict our searches to particula= r > os's. Same problem as above: either we use small categories with a usable interface or we define huge categories with a VERY large interface. Agreed, the later will result in a faster performance due to more detaile= d constraints in the query. But with good indexing and persistent database connections, speed should be reasonable with small categories as well. Regards, Matthias |
From: Brian C. <ca...@sl...> - 2004-01-31 05:14:01
|
[the list server is so slow this week. I forwarded a message this morning and it still hasn't been posted]. So, after thinking about this thread some more, there are two problems that are being addressed at the same time and I think they can be more independent and I think the merging has caused some confusion. 1. A small set of application categories for any hash database. 2. An implementation of a database that can import hashes from multiple sources. As I mentioned before, the categories are a problem with all databases and I think it would be useful if we could publish a list with requirements for each category. From Doug's email, it sounds like NIST would be interested in such categories (assuming that they are comprehensive and make sense). For the implementation, it seems that we need to have a clear goal for the DB. Is it for a comprehensive DB or is it just for quick good vs bad lookups. Both are needed, but can we satisfy both goals with one DB? Or, could that be an option at install time. They can chose the quick / dirty / less data version or the full version. I'm not a DB guy, so I have no clue what the answers for this are. It has occurred to me that there should be a 'source' column in the database, so that the entry can be attributed to the NSRL, hashkeeper, custom etc. A version may also be useful. This is also useful so that you can remove the hashes from the DB at a later point. thanks, brian |
From: Matthias H. <mat...@mh...> - 2004-01-30 14:25:55
|
Brian Carrier said: [...] > So, after thinking about this thread some more, there are two problems > that are being addressed at the same time and I think they can be more > independent and I think the merging has caused some confusion. > > 1. A small set of application categories for any hash database. > > 2. An implementation of a database that can import hashes from > multiple sources. > > As I mentioned before, the categories are a problem with all databases > and I think it would be useful if we could publish a list with > requirements for each category. From Doug's email, it sounds like NIST > would be interested in such categories (assuming that they are > comprehensive and make sense). Ok, then let's treat the list of applications separately. We can later decide if/how we want to implement this in our database. I'll compile a list with examples out of our recent discussion and post it this weekend for further discussion. > For the implementation, it seems that we need to have a clear goal for > the DB. Is it for a comprehensive DB or is it just for quick good vs > bad lookups. Both are needed, but can we satisfy both goals with one > DB? Or, could that be an option at install time. They can chose the > quick / dirty / less data version or the full version. I'm not a DB > guy, so I have no clue what the answers for this are. After thinking about the recent discussion and your comments, I would prefer not to separate the database but instead the interface: - we use a comprehensive database with a large set of information for eac= h hash set - upon importing, everybody can decide for himself how much data to include into the database - we provide a mapping table in order to map the very detailed categories to a small set of super-categories - we provide 2 interfaces: "quick&dirty" (->super-categories) and "long&detailed" The biggest part of the database are the hashsets themself. The organization of comprehensive add-on information doesn't use much ressources, it requires only a good data model. So we gain not much by using two different database models. > It has occurred to me that there should be a 'source' column in the > database, so that the entry can be attributed to the NSRL, hashkeeper, > custom etc. A version may also be useful. This is also useful so that > you can remove the hashes from the DB at a later point. Good idea, I do use this already (without a version) in my forensic hash database. Regards, Matthias |
From: Matthias H. <mat...@mh...> - 2004-01-28 17:38:43
|
Brian Carrier said: [...] > I thought about what software I have on my systems and tried to fit it > in, so there are some questions about what goes where. Could you maybe > provide requirements for software to fit into each category? Ok, I'll fill the categories with descriptions. > It would be nice if each entry had a static size, so that we could jump > around the text file of the database easily. How about this: we use fields with dynamic length in the database and use an export tool for exporting with static sizes ? We could set the maximal length with datatypes like "varchar(40)". > Therefore, there would be > an index that correlates an application type to an integer. I would > think that doing integer comparisons would be faster than string > comparisons though when looking entries up. That maybe a pain to > manage though. Sure, we need integer identifiers for performance. I deliberatley didn't mention them because I think we first have to agree on the data model. Things like primary keys, foreign keys, indices etc. should follow when we find a good data model. >> Application entry: >> - remote management >> - office tools > > Would adobe acrobat reader and calendars fit into this category? I would place adobe acrobat and calendars in the desktop category. >> - database >> - desktop > > What are examples of this category? games? > >> - server daemons >> - web > > A general name like network may scale better. Would email tools fit in > here too? > >> - multimedia >> - drivers >> - development >> - sysutils >> - security > > Would this include tools that are frequently called "hacker" tools too? > This category could be difficult and controversial to maintain, but I > don't know of a better way to do it... Sure, the problem we have is with tools like nmap,nemesis, hping etc. (to= ols both used for good and bad things). I like Matt McMillon's idea to search categories both as knowngood and knownbad. So everybody can decide for himself during search-time how to handle this. I think, operation system categories should be per default known-good. Each application categories should get an individual default setting for knowngood/knownbad. >> - known-bad > > Should there be a known-good too? I can imagine a situation where > someone hashes his /bin/, /sbin/, /usr/local/bin ... directories and > doesn't want to have to identify the category of each file. Known-bad was kind of a catch-all for all possible known-bad files. Problem is, if we segment known-bad, we'll get dozens of subcategories. While this is no problem in the database, it will be difficult to handle for autopsy. >> - other > > Where would child-porn fit into this? known-bad? That seems to be one > of the biggest categories of hashes and may warrant its own category. Yes, I thought it should be known-bad. During my forensic analyses, my main objectives so far were hacking-related, not child-porn. So it may be that I have kind of a blind spot for this problem. Ok, let's add a separate category "child-porn" with "known-bad" as defaul= t. > >> Operation system entry: >> - Linux >> - Windows >> - BSD >> - Mac >> - MacOSX >> - Solaris >> - DOS >> - Handheld OS >> - AIX >> - HP-UX >> - Other > > MacOS probably shouldn't get a separate category from OSX unless Win > '98 is also separated from Win XP. The specific types in BSD should be > defined (since OS X is actually a variant of BSD). The Solaris > category should also include SunOS. Ok, so BSD would include: (Free|Open|Net-BSD|BSD/OS|OS X) What about IRIX,TRUE64 etc ? Did we forget a category with many entries ? Problem is, we should hold the number of OS's low for a better usability of the search interface. Out of the box I find about three doze= n operation systems and probably forgetting some other dozen. > SHA-2 maynot be a bad idea. I recall threads in the past on other > lists about using SHA-2, so we may want to make a field for it (even > though the public DB don't use it yet). It can take the place of > CRC32. Good point here. > This looks good. I think more requirements for each app category would > be useful though. I'll compile a new draft with some more flesh to each category. This should help us for a more detailed discussion of the categories. Regards, Matthias |
From: David B. <to...@so...> - 2004-01-28 18:48:45
|
* Brian Carrier (ca...@sl...) wrote: [snip] > It would be nice if each entry had a static size, so that we could jump > around the text file of the database easily. Therefore, there would be > an index that correlates an application type to an integer. I would > think that doing integer comparisons would be faster than string > comparisons though when looking entries up. That maybe a pain to > manage though. Yes it is a good idea to map applications type to an integer; I even think that the OS field should be an integer too. It shouldn't be a pain to manage them if the import tools make the task easier. (the problem then is to develop proper import tools ;)) > > >Application entry: > >- remote management Thinking about this category, perhaps it is included in the server daemons category (for the servers) and network category (for the clients) > >- office tools > > Would adobe acrobat reader and calendars fit into this category? Yes, and even a mail client. > > >- database > >- desktop > > What are examples of this category? games? Proper examples for this category would be games, IM, screensavers, iconsets, wallpapers... but perhaps this category should be merged with the multimedia category ¿? > > >- server daemons > >- web > > A general name like network may scale better. Would email tools fit in > here too? I prefer network too, but take into account that also all the web scripts (CGI, php, perl, ...) should fit in this category. > >- multimedia > >- drivers > >- development > >- sysutils > >- security > > Would this include tools that are frequently called "hacker" tools too? > This category could be difficult and controversial to maintain, but I > don't know of a better way to do it... I would split perhaps this category in two other categories: security(whitehat) and malware(exploits, rootkits, ...) I know that malware is not the right word for them, but it is the name that gather more different types of such files. Other approach is to include only the 'whitehat' security tools in this category and the 'blackhat' tools in the next category (known-bad) > >- known-bad > > Should there be a known-good too? I can imagine a situation where > someone hashes his /bin/, /sbin/, /usr/local/bin ... directories and > doesn't want to have to identify the category of each file. Both known-bad and known-good could be a 'wrapper' for other categories. > >- other > > Where would child-porn fit into this? known-bad? That seems to be one > of the biggest categories of hashes and may warrant its own category. According to the above, it should fit in both malware (replace this word with other more suitable) and known-bad. > >Operation system entry: > >- Linux > >- Windows > >- BSD > >- Mac > >- MacOSX > >- Solaris > >- DOS > >- Handheld OS > >- AIX > >- HP-UX > >- Other > > MacOS probably shouldn't get a separate category from OSX unless Win > '98 is also separated from Win XP. The specific types in BSD should be > defined (since OS X is actually a variant of BSD). The Solaris > category should also include SunOS. Then we'd add OpenBSD, FreeBSD and NetBSD, and delete OSX. SunOS is included in the Solaris category. [snip] > >Did we miss important fields ? > > SHA-2 maynot be a bad idea. I recall threads in the past on other > lists about using SHA-2, so we may want to make a field for it (even > though the public DB don't use it yet). It can take the place of > CRC32. I have never used SHA-2 nor CRC32. If SHA-2 is being currently used, we should definitely add it. > Is the file size needed? I'm trying to think of a scenario where that > would be needed. Hmm not sure about that, but what happens when an application has several files with the same name in different directories (and different hashes)?. In addition, we should specify the application language in some field, because for instance the nt.dll file is different for Windows 2000 English version and Windows 2000 Spanish version, both with the same patches applied. |