Thread: [sleuthkit-users] Re: Future of indexing in Autopsy and Sleuthkit
Brought to you by:
carrier
From: Matt B. <MB...@st...> - 2003-05-22 16:29:29
|
If you are feeling ambitious, why not give the option to the user. Take the suggestions you receive from this list to determine the default behavior of the application, and then give the user the option of changing that behavior if desired. In my opinion, one of the largest benefits to using open source software is its flexibility. Matt Bergen Lead Information Security Officer Wyoming Department of Employment >>> "Simson L. Garfinkel" <si...@lc...> 05/22/03 09:27AM >>> Paul, Here are some issues you may not have considered: > > Issue 1: > I think it is advisable to limit the indexed character range to only > alphanumeric characters instead of the current limitation of all=20 > printable ASCII characters. If you limit to printable ASCII characters, there will be problems for people outside the US (or people working with data outside the US). You need to be able to handle roman characters with accents. These are=20 normally represented with high-bits. If the user searches for an e,=20 they probably want to match on =E8 and =E9 and possibly other e's as well. Then you have the issue of Arabic, Hebrew, and 16-bit characters. At a minimum, I think that you should transparently handle codepages=20 and coerce them into 7-bit ASCII. But ideally you should handle=20 UNICODE, UTF-8, UTF-16, etc. Or do something for Arabic. > > Issue 2: > Human readability of the files. A speedup in the indexed searching=20 > process and a redeuction of the size of the used files can be=20 > accomplished by changing the format of the index files. The=20 > consequence is that these cannot be read by a human anymore (No more > text-format file). The consequences are the following: > - POSITIVE: Speed of searches is increased > - POSITIVE: Size of used files is reduces > - NEGATIVE: Files cannot be checked anymore with the human eye. I do not think that this is important. The index files should be in=20 binary; create a tool to browse or view them. ----------------------------------------------------------------- This list is provided by the SecurityFocus ARIS analyzer service. For more information on this free incident handling, management and tracking system please see: http://aris.securityfocus.com=20 |
From: Paul B. <ba...@fo...> - 2003-05-23 08:03:40
|
Hi Simson, Thanks for the response > If you limit to printable ASCII characters, there will be=20 > problems for=20 > people outside the US (or people working with data outside=20 > the US). You=20 > need to be able to handle roman characters with accents. These are=20 > normally represented with high-bits. If the user searches for an e,=20 > they probably want to match on =E8 and =E9 and possibly other e's as = well. >=20 > Then you have the issue of Arabic, Hebrew, and 16-bit characters. >=20 > At a minimum, I think that you should transparently handle codepages=20 > and coerce them into 7-bit ASCII. But ideally you should handle=20 > UNICODE, UTF-8, UTF-16, etc. Or do something for Arabic. OK.. The problem with indexed searching is that you have to have a = limited set of characters to search for. Otherwise it's not possible to generate an index file. The size of the index file grows exponentially with the = size of the character set. That said I will possibly add the diacritic ASCII characters, but = Unicode contains way to much characters. Therefore Unicode poses a problem.... If anyone can suggest a fix/solution I would greatly appreciate that! I'm still thinking about a better solution. -- Paul Bakker Fox-IT Experts in IT Security! Haagweg 137=20 2281 AG RIJSWIJK=20 T 070 336 9999=20 F 070 336 9990=20 I www.fox-it.com=20 E ba...@fo... 57A6 C5EA 55E4 CC1C A967 B13C F8C0 C0FB 8135 E225 Disclaimer: This email may contain confidential information. If this = message is not addressed to you, you may not retain or use the = information in it for any purpose. If you have received it in error, = please notify the sender and delete this message. We try to screen out = viruses but take no responsibility if this email contains a virus. |
From: Paul B. <ba...@fo...> - 2003-05-23 08:14:09
|
Hi Matthew, Thanks for your response. > Paul, is it just me, or do I read that as alphanumeric only? I often=20 > need to search for instances of email addresses, and while it is not=20 > always mandatory, having access to the @ symbol sure does speed the=20 > process up. I can understand your problem.... Please try to understand mine (As I = see it). The problem with indexed searching is that you have to have a = limited set of characters to search for. Otherwise it's not possible to = generate an index file. The size of the index file grows exponentially = with the size of the character set. Therefore it might be possible to add some other characters like the = diacrtitic ASCII characters and maybe an @ (BUt then other people want = to have other characters too). Based on this it will probably be = configurable in the final version. Unicode is for me a NoNo.... Beacause of the sheer size of the set of = characters contained therein. If anyone can suggest a fix/solution I would greatly appreciate that! I'm still thinking about a better solution. You should remember though that there will always be Standard searching = with regexp and all... Indexed searching is just to generate a speedup = for the most commonly used search strings (Which in my opinion are the = Alphanumeric and diacritic ASCII characters. PLEASE DEBATE WITH ME ON = THIS!!!!!!!) > >Issue 2: > >Human readability of the files. A speedup in the indexed=20 > >searching process and a redeuction of the size of the used=20 > > Not an issue in my opinion, in fact I agree with another post that=20 > mentioned making the file layout open, someone here will=20 > write a tool to=20 > read it. I will do both. I will document the file format and provide a tool to convert it to human readable format. -- Paul Bakker Fox-IT Experts in IT Security! Haagweg 137=20 2281 AG RIJSWIJK=20 T 070 336 9999=20 F 070 336 9990=20 I www.fox-it.com=20 E ba...@fo... 57A6 C5EA 55E4 CC1C A967 B13C F8C0 C0FB 8135 E225 Disclaimer: This email may contain confidential information. If this = message is not addressed to you, you may not retain or use the = information in it for any purpose. If you have received it in error, = please notify the sender and delete this message. We try to screen out = viruses but take no responsibility if this email contains a virus. |
From: Brian C. <ca...@sl...> - 2003-05-23 14:19:51
|
Paul Bakker <ba...@fo...> said: > > >Issue 2: > > >Human readability of the files. A speedup in the indexed > > >searching process and a redeuction of the size of the used > > > > Not an issue in my opinion, in fact I agree with another post that > > mentioned making the file layout open, someone here will > > write a tool to > > read it. > > I will do both. I will document the file format and provide a tool to > convert it to human readable format. Perfect. One of the goals of Autopsy is that all of its data and configuration files are open so that any tool can utilize them and one is not restricted to Autopsy if (s)he starts with it. Maybe we can eventully do some Sleuth Kit Informer articles on the format and design ... I would actually say to keep it in text for the initial versions so that people can verify it, feel comfortable with it, and debug any issues. It can be optimized later. thanks, brian |
From: Matthew M. S. <mm...@ta...> - 2003-06-05 09:58:57
|
To all- I'm looking for the best way to regex search with Autopsy for two disjoint words. In other words, I am looking for the appearance of two names in a given sector, i.e. Bob and Mike What would be the best way to do this? I've tried [Bob|bob](.*)[Mike|mike] granted, if the names come in reverse I'm not going to get anything, so perhaps.. [[Mike|mike](.*)[Bob|bob]|[Bob|bob](.*)[Mike|mike]] Or am I completely going the wrong direction here? Thanks! M |