[sleuthkit-users] RE: Future of indexing in Autopsy and Sleuthkit

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hi Simson,

Thanks for the response

> If you limit to printable ASCII characters, there will be=20
> problems for=20
> people outside the US (or people working with data outside=20
> the US). You=20
> need to be able to handle roman characters with accents. These are=20
> normally represented with high-bits. If the user searches for an e,=20
> they probably want to match on =E8 and =E9 and possibly other e's as =
well.
>=20
> Then you have the issue of Arabic, Hebrew, and 16-bit characters.
>=20
> At a minimum, I think that you should transparently handle codepages=20
> and coerce them into 7-bit ASCII. But ideally you should handle=20
> UNICODE, UTF-8, UTF-16, etc. Or do something for Arabic.

OK.. The problem with indexed searching is that you have to have a =
limited
set of characters to search for. Otherwise it's not possible to generate
an index file. The size of the index file grows exponentially with the =
size
of the character set.

That said I will possibly add the diacritic ASCII characters, but =
Unicode contains
way to much characters. Therefore Unicode poses a problem....

If anyone can suggest a fix/solution I would greatly appreciate that!

I'm still thinking about a better solution.

--
Paul Bakker

Fox-IT Experts in IT Security!
Haagweg 137=20
2281 AG RIJSWIJK=20
T 070 336 9999=20
F 070 336 9990=20
I www.fox-it.com=20
E ba...@fo...
57A6 C5EA 55E4 CC1C A967 B13C F8C0 C0FB 8135 E225

Disclaimer: This email may contain confidential information. If this =
message is not addressed to you, you may not retain or use the =
information in it for any purpose. If you have received it in error, =
please notify the sender and delete this message. We try to screen out =
viruses but take no responsibility if this email contains a virus.