Re: [sleuthkit-users] Regular Expressions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian,

Thanks for the explanation. I do not know about Solr, but a String field in
Lucene is indexed and not tokenized. I have never tried regex on lucene
string Fields, but I think it may match only at the beginning of the text
if you do not start the regex with something like .* maybe I am wrong, so
it will not be fast although the Field is indexed.

Regards,
Luis

Em 15 de nov de 2016 00:28, "Brian Carrier" <ca...@sl...>
escreveu:

> Hi Luis,
>
> We currently (and will in the future) maintain two “copies” of the text to
> support text and regexp searches.   What will change if we adopt the 32KB
> approach is to start storing the text in a non-indexed “string” field
> (which has a size limitation of 32KB).  It will not be tokenized and Solr
> will apply the regular expression to each text field.
>
> So, this is in essence what Jon was also proposing of just doing a regexp
> on the extracted text.  Because this new field is not indexed, it will be
> slower.  Exact search performance hit TBD.
>
> brian
>
>
>
>
>
> > On Nov 14, 2016, at 8:53 PM, Luís Filipe Nassif <lfc...@gm...>
> wrote:
> >
> > Hi Brian,
> >
> > I didn't understand exactly how text chunk size will help to index
> spaces and other chars that breaks words into tokens. You will index text
> twice? First with default tokenization, breaking words at spaces and
> similar chars, and second time will index the whole text chunk as one
> single token? Does the 32KB is the maximum Lucene token size? I think you
> can do the second indexing (with performance consequences if you index
> twice, it should be configurable, so users could disable it if they do not
> need regex or if performance is critical). But I think you should not
> disable the default indexing (with tokenization), otherwise users will have
> to always use * as prefix and suffix of their searches, if not they will
> miss a lot of hits. I do not known if they will be able to do phrase
> searches, because Lucene does not allow to use * into a phrase search (*
> between two " "). I do not know about Solr and if it extended that.
> >
> > Regards,
> > Luis Nassif
> >
> > 2016-11-14 20:14 GMT-02:00 Brian Carrier <ca...@sl...>:
> > Making this a little more specific, we seem to have two options to solve
> this problem (which is inherent to Lucene/Solr/Elastic):
> >
> > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and
> can have the full power of regular expressions.   The downside of the
> smaller chunks is that there are more boundaries and places where a term
> could span the boundary and we could miss a hit if it spans that boundary.
>  If we needed to, we could do some fancy overlapping.   32KB of text is
> about 12 pages of English text (less for non-English).
> >
> > 2) We limit the types of regular expressions that people can use and
> keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but
> we won’t be able to support all expressions.  For example, if you gave us
> “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but
> we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”.  Well we
> could in theory, but we dont’ want to add crazy complexity here.
> >
> > So, the question is if you’d rather have smaller chunks and the full
> breadth of regular expressions or a more limited set of expressions and
> bigger chunks.  We are looking at the performance differences now, but
> wanted to get some initial opinions.
> >
> >
> >
> >
> > > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...>
> wrote:
> > >
> > > Autopsy currently has a limitation when searching for regular
> expressions, that spaces are not supported.  It’s not a problem for Email
> addresses and URLs, but becomes an issue phone numbers, account numbers,
> etc.  This limitation comes from using an indexed search engine (since
> spaces are used to break text into tokens).
> > >
> > > We’re looking at ways of solving that and need some guidance.
> > >
> > > If you write your own regular expressions, can you please let me know
> and share what they look like.  We want to know how complex the expressions
> are that people use in real life.
> > >
> > > Thanks!
> > > ------------------------------------------------------------
> ------------------
> > > _______________________________________________
> > > sleuthkit-users mailing list
> > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> > > http://www.sleuthkit.org
> >
> >
> > ------------------------------------------------------------
> ------------------
> > _______________________________________________
> > sleuthkit-users mailing list
> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> > http://www.sleuthkit.org
> >
>
>