Re: [sleuthkit-developers] Search

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi List,

	While we are on the topic of searching, I was thinking about comments made=
=20
previously on this list with regards to indexing (specifically the thread=20
"[sleuthkit-developers] blindly indexing garbage..."). Rather than indexing=
=20
every possible string in the image (which would take huge amounts of space=
=20
for the index file - almost the same size of the image itself in some cases=
).=20
Would it make more sense to have a dictionary of words to search for and th=
en=20
only index the offsets of these words in the image? You would get stuck if=
=20
you wanted to search for a word not in the dictionary, but it might be usef=
ul=20
for a standard initial search. For example, the english language dictionary=
=20
on my linux box is about 85k words big. If all those were indexed, (because=
=20
im never likely to search for an arbitrary random string only english words=
),=20
would this be more effective than indexing the whole thing? I imagine the=20
dictionary might also grow to contain non words too like hax0r or pr0n etc,=
=20
or even might contain binary sequences of magic signatures etc.

This technique is also a great alternative to the standard "strings" method=
=20
(i.e. running strings over an image and grepping the result for hits),=20
because it will automatically not include the random printable strings that=
=20
are garbage in most cases. Also there is no ambiguity with respect to what=
=20
constitutes a valid string (e.g. 4 consecutive printable chars is ambigous =
=2D=20
what chars are printable depends on the target language, unicode encoding=20
etc). If the dictionary contains the required words in a number of common=20
unicode encodings (binary sequences) it will automatically index.

Michael.

On Wed, 5 May 2004 03:02 am, M=E1rcio Carneiro wrote:
> On Mon, 3 May 2004 13:19:06 -0700 (PDT), Linux Tard <lin...@ya...>=
=20
escreveu:
> > While not part of Sleuthkit you can use GLIMPSE.
>
> No, Glimpse only indexes text files.
>
> Lucene (http://jakarta.apache.org/lucene) seems to be a nice indexer, but
> still doesn't indexes other types of files (but has a framework for one to
> implement classes that understand other formats).
>
> :-/
>
> M=E1rcio.
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g.
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3D3149&alloc_id=3D8166&op=3Dclick
> _______________________________________________
> sleuthkit-developers mailing list
> sle...@li...
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-developers