[sleuthkit-users] Re: Future of indexing in Autopsy and Sleuthkit
Brought to you by:
carrier
From: Simson L. G. <si...@lc...> - 2003-05-22 15:30:28
|
Paul, Here are some issues you may not have considered: > > Issue 1: > I think it is advisable to limit the indexed character range to only=20= > alphanumeric characters instead of the current limitation of all=20 > printable ASCII characters. If you limit to printable ASCII characters, there will be problems for=20= people outside the US (or people working with data outside the US). You=20= need to be able to handle roman characters with accents. These are=20 normally represented with high-bits. If the user searches for an e,=20 they probably want to match on =E8 and =E9 and possibly other e's as = well. Then you have the issue of Arabic, Hebrew, and 16-bit characters. At a minimum, I think that you should transparently handle codepages=20 and coerce them into 7-bit ASCII. But ideally you should handle=20 UNICODE, UTF-8, UTF-16, etc. Or do something for Arabic. > > Issue 2: > Human readability of the files. A speedup in the indexed searching=20 > process and a redeuction of the size of the used files can be=20 > accomplished by changing the format of the index files. The=20 > consequence is that these cannot be read by a human anymore (No more=20= > text-format file). The consequences are the following: > - POSITIVE: Speed of searches is increased > - POSITIVE: Size of used files is reduces > - NEGATIVE: Files cannot be checked anymore with the human eye. I do not think that this is important. The index files should be in=20 binary; create a tool to browse or view them. |