Re: [sleuthkit-developers] First Draft - Layout Hash Database
Brought to you by:
carrier
From: Michael C. <mic...@ne...> - 2004-01-28 08:37:37
|
Hi All, > > File entry: > > - sha1 > > - md5 > > - os > > - application > > - filename > > - filesize > > It would be nice if each entry had a static size, so that we could jump > around the text file of the database easily. Therefore, there would be > an index that correlates an application type to an integer. I would > think that doing integer comparisons would be faster than string > comparisons though when looking entries up. That maybe a pain to > manage though. I find this is an important requirement, particularly for sql databases. The os and applications should be short ints so that an index may be built on them making it faster to search. Also I found that building a partial index on the md5 column itself speeds things up several orders of maginitude, but still keeps the index size reasonable so it fits well in ram. > > Application entry: Are you suggesting to not name the application product at all? but rather only contain information on the category of the application? So for example in the table "msword.exe" will have office tools as application, but not refer to microsoft word as a product? I really think that you still need to classify the hash set with the commercial name of the application, otherwise you would not know which specific application xyz.dll belongs to. In general I think the approach taken by NSRL is not a bad one. I sympathise with the dillema of not being able to rely on the hashes to get a quick yes/ no answer as to whether a disk contains "bad files". I think the task set out for by the NSRL is to merely identify the files. Classifying them into categories is a purely subjective decision, based in the most part on the circumstances of the case. The NSRL is used to see what applications/ packages/products are installed, the decision of those applications which are bad should be done in a seperate table altogether. So I suggest to make another table where you classify the products into categories etc. e.g.: product_code/application_code/package whatever code is appropriate product category So the hash table should have information relating a specific hash to MSword for example, and this new table tells us that msword is an office app. Similarly if we see a hash matching back orifice, we consult this new table to find that back orific is a hacker app. This is much more effective than having to redo the entire nsrl. > MacOS probably shouldn't get a separate category from OSX unless Win > '98 is also separated from Win XP. The specific types in BSD should be > defined (since OS X is actually a variant of BSD). The Solaris > category should also include SunOS. I think that OSs should be granulated down as much as practically possible. So I would give win98 a different category than winXP. Maybe not so much as to seperate the different service packs, but its often very evident what kind of os you are working on, and it would speed things out considerably if the database could be split into different tables, depending on the OS. This effect can be achieved by building an index on the OS column, this severely lightens the load on the query if we restrict our searches to particular os's. > > Questions so far: > > Do we need a separate architecture field for a hashsum entry ? This > > will > > require an additional search parameter later. I think we do, for the reason i mentioned above- no point searching all those spark entries when we are clearly working on an intel box. > Is the file size needed? I'm trying to think of a scenario where that > would be needed. Sometimes its usefull to see the filesize if the file is extremely small, e.g. 1 byte or 2 bytes - its very easy to get hash collisions on these files and the database is not reliable - in fact i think hashes should not be taken of such small files, but NSRL is full of 0 byte files. > This looks good. I think more requirements for each app category would > be useful though. It would be useful to design the hash database in a way that can leverage off NSRL, since NSRL is the richest source of hashes at the moment. Regards. Michael. |