Re: [sleuthkit-users] Regular Expressions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Presumably what you've proposed so far works off of Lucene's capabilities. 

The other way to go would be simply to have a background processing job to grep the text documents and save the search hits. You lose out on the speed of an indexed search, but since the text has already been extracted it may still run in a reasonable timeframe, and you could search documents concurrently.

Handling of overlaps is something that liblightgrep supports well. If you provide it the shingled overlap text, it will look into it only enough to evaluate any potential search hits beginning before the overlap and exit early if all the potential hits resolve. This way you don't have to dedupe/filter hits. 

A lot of matters, especially ones involving discovery in some fashion, will revolve around a set of search terms that have been negotiated by different parties, and it is common to use regexps to reduce false positives as well as account for variances. For those types of matters, it can be tedious to perform a series of interactive searches, depending on how easy it is to record the results. 

Jon

> On Nov 14, 2016, at 5:18 PM, Brian Carrier <ca...@sl...> wrote:
> 
> Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic):
> 
> 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions.   The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary.   If we needed to, we could do some fancy overlapping.   32KB of text is about 12 pages of English text (less for non-English). 
> 
> 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions.  For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”.  Well we could in theory, but we dont’ want to add crazy complexity here. 
> 
> So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks.  We are looking at the performance differences now, but wanted to get some initial opinions. 
> 
> 
> 
> 
>> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote:
>> 
>> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported.  It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc.  This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). 
>> 
>> We’re looking at ways of solving that and need some guidance. 
>> 
>> If you write your own regular expressions, can you please let me know and share what they look like.  We want to know how complex the expressions are that people use in real life. 
>> 
>> Thanks!
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> sleuthkit-users mailing list
>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
>> http://www.sleuthkit.org
> 
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> sleuthkit-users mailing list
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> http://www.sleuthkit.org