[sleuthkit-users] Re: Future of indexing in Autopsy and Sleuthkit

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Paul,

Here are some issues you may not have considered:
>
> Issue 1:
> I think it is advisable to limit the indexed character range to only=20=

> alphanumeric characters instead of the current limitation of all=20
> printable ASCII characters.

If you limit to printable ASCII characters, there will be problems for=20=

people outside the US (or people working with data outside the US). You=20=

need to be able to handle roman characters with accents. These are=20
normally represented with high-bits. If the user searches for an e,=20
they probably want to match on =E8 and =E9 and possibly other e's as =
well.

Then you have the issue of Arabic, Hebrew, and 16-bit characters.

At a minimum, I think that you should transparently handle codepages=20
and coerce them into 7-bit ASCII. But ideally you should handle=20
UNICODE, UTF-8, UTF-16, etc. Or do something for Arabic.
>
> Issue 2:
> Human readability of the files. A speedup in the indexed searching=20
> process and a redeuction of the size of the used files can be=20
> accomplished by changing the format of the index files. The=20
> consequence is that these cannot be read by a human anymore (No more=20=

> text-format file). The consequences are the following:
>  - POSITIVE: Speed of searches is increased
>  - POSITIVE: Size of used files is reduces
>  - NEGATIVE: Files cannot be checked anymore with the human eye.

I do not think that this is important. The index files should be in=20
binary; create a tool to browse or view them.