Re: [sleuthkit-users] Regular Expressions
Brought to you by:
carrier
From: Luís F. N. <lfc...@gm...> - 2016-11-16 09:26:26
|
Hi Brian, Thanks for the explanation. I do not know about Solr, but a String field in Lucene is indexed and not tokenized. I have never tried regex on lucene string Fields, but I think it may match only at the beginning of the text if you do not start the regex with something like .* maybe I am wrong, so it will not be fast although the Field is indexed. Regards, Luis Em 15 de nov de 2016 00:28, "Brian Carrier" <ca...@sl...> escreveu: > Hi Luis, > > We currently (and will in the future) maintain two “copies” of the text to > support text and regexp searches. What will change if we adopt the 32KB > approach is to start storing the text in a non-indexed “string” field > (which has a size limitation of 32KB). It will not be tokenized and Solr > will apply the regular expression to each text field. > > So, this is in essence what Jon was also proposing of just doing a regexp > on the extracted text. Because this new field is not indexed, it will be > slower. Exact search performance hit TBD. > > brian > > > > > > > On Nov 14, 2016, at 8:53 PM, Luís Filipe Nassif <lfc...@gm...> > wrote: > > > > Hi Brian, > > > > I didn't understand exactly how text chunk size will help to index > spaces and other chars that breaks words into tokens. You will index text > twice? First with default tokenization, breaking words at spaces and > similar chars, and second time will index the whole text chunk as one > single token? Does the 32KB is the maximum Lucene token size? I think you > can do the second indexing (with performance consequences if you index > twice, it should be configurable, so users could disable it if they do not > need regex or if performance is critical). But I think you should not > disable the default indexing (with tokenization), otherwise users will have > to always use * as prefix and suffix of their searches, if not they will > miss a lot of hits. I do not known if they will be able to do phrase > searches, because Lucene does not allow to use * into a phrase search (* > between two " "). I do not know about Solr and if it extended that. > > > > Regards, > > Luis Nassif > > > > 2016-11-14 20:14 GMT-02:00 Brian Carrier <ca...@sl...>: > > Making this a little more specific, we seem to have two options to solve > this problem (which is inherent to Lucene/Solr/Elastic): > > > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and > can have the full power of regular expressions. The downside of the > smaller chunks is that there are more boundaries and places where a term > could span the boundary and we could miss a hit if it spans that boundary. > If we needed to, we could do some fancy overlapping. 32KB of text is > about 12 pages of English text (less for non-English). > > > > 2) We limit the types of regular expressions that people can use and > keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but > we won’t be able to support all expressions. For example, if you gave us > “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but > we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we > could in theory, but we dont’ want to add crazy complexity here. > > > > So, the question is if you’d rather have smaller chunks and the full > breadth of regular expressions or a more limited set of expressions and > bigger chunks. We are looking at the performance differences now, but > wanted to get some initial opinions. > > > > > > > > > > > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> > wrote: > > > > > > Autopsy currently has a limitation when searching for regular > expressions, that spaces are not supported. It’s not a problem for Email > addresses and URLs, but becomes an issue phone numbers, account numbers, > etc. This limitation comes from using an indexed search engine (since > spaces are used to break text into tokens). > > > > > > We’re looking at ways of solving that and need some guidance. > > > > > > If you write your own regular expressions, can you please let me know > and share what they look like. We want to know how complex the expressions > are that people use in real life. > > > > > > Thanks! > > > ------------------------------------------------------------ > ------------------ > > > _______________________________________________ > > > sleuthkit-users mailing list > > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > http://www.sleuthkit.org > > > > > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > > > |