Re: [sleuthkit-developers] Search
Brought to you by:
carrier
From: Michael C. <mic...@ne...> - 2004-05-05 11:46:25
|
Hi List, While we are on the topic of searching, I was thinking about comments made= =20 previously on this list with regards to indexing (specifically the thread=20 "[sleuthkit-developers] blindly indexing garbage..."). Rather than indexing= =20 every possible string in the image (which would take huge amounts of space= =20 for the index file - almost the same size of the image itself in some cases= ).=20 Would it make more sense to have a dictionary of words to search for and th= en=20 only index the offsets of these words in the image? You would get stuck if= =20 you wanted to search for a word not in the dictionary, but it might be usef= ul=20 for a standard initial search. For example, the english language dictionary= =20 on my linux box is about 85k words big. If all those were indexed, (because= =20 im never likely to search for an arbitrary random string only english words= ),=20 would this be more effective than indexing the whole thing? I imagine the=20 dictionary might also grow to contain non words too like hax0r or pr0n etc,= =20 or even might contain binary sequences of magic signatures etc. This technique is also a great alternative to the standard "strings" method= =20 (i.e. running strings over an image and grepping the result for hits),=20 because it will automatically not include the random printable strings that= =20 are garbage in most cases. Also there is no ambiguity with respect to what= =20 constitutes a valid string (e.g. 4 consecutive printable chars is ambigous = =2D=20 what chars are printable depends on the target language, unicode encoding=20 etc). If the dictionary contains the required words in a number of common=20 unicode encodings (binary sequences) it will automatically index. Michael. On Wed, 5 May 2004 03:02 am, M=E1rcio Carneiro wrote: > On Mon, 3 May 2004 13:19:06 -0700 (PDT), Linux Tard <lin...@ya...>= =20 escreveu: > > While not part of Sleuthkit you can use GLIMPSE. > > No, Glimpse only indexes text files. > > Lucene (http://jakarta.apache.org/lucene) seems to be a nice indexer, but > still doesn't indexes other types of files (but has a framework for one to > implement classes that understand other formats). > > :-/ > > M=E1rcio. > > > ------------------------------------------------------- > This SF.Net email is sponsored by: Oracle 10g > Get certified on the hottest thing ever to hit the market... Oracle 10g. > Take an Oracle 10g class now, and we'll give you the exam FREE. > http://ads.osdn.com/?ad_id=3D3149&alloc_id=3D8166&op=3Dclick > _______________________________________________ > sleuthkit-developers mailing list > sle...@li... > https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers |