Thread: [sleuthkit-users] Regular Expressions
Brought to you by:
carrier
From: Brian C. <ca...@sl...> - 2016-11-14 18:09:58
|
Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). We’re looking at ways of solving that and need some guidance. If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. Thanks! |
From: Derrick K. <dk...@gm...> - 2016-11-14 18:59:30
|
I tend to go with Zawinski/Lundh's mantra on this one... 'Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.' xD Seriously though, I used to write a lot more regexes especially for things like email addresses, credit cards, and credit card track 2 data but that's all built in to the latest Autopsy! Yay! My only comment is that I tend to gravitate towards Perl-style regex vs. POSIX (ie. "\s" vs. "[:space:]") and am often searching through fixed column formats for stuff. ie. Looking at webserver or system logs where the date would be "Nov\s\s09" or "Nov\s10". If it's anything else like looking for a phone number then I'll tend to do whole word searches from an index (ie. "555-1212") or a "\s?\d{3}-\d{4}" regex to find it. Derrick On Mon, Nov 14, 2016 at 11:09 AM, Brian Carrier <ca...@sl...> wrote: > Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). > > We’re looking at ways of solving that and need some guidance. > > If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. > > Thanks! > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Brian C. <ca...@sl...> - 2016-11-14 22:15:07
|
Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: > > Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). > > We’re looking at ways of solving that and need some guidance. > > If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. > > Thanks! > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Jon S. <JSt...@St...> - 2016-11-15 02:12:45
|
Presumably what you've proposed so far works off of Lucene's capabilities. The other way to go would be simply to have a background processing job to grep the text documents and save the search hits. You lose out on the speed of an indexed search, but since the text has already been extracted it may still run in a reasonable timeframe, and you could search documents concurrently. Handling of overlaps is something that liblightgrep supports well. If you provide it the shingled overlap text, it will look into it only enough to evaluate any potential search hits beginning before the overlap and exit early if all the potential hits resolve. This way you don't have to dedupe/filter hits. A lot of matters, especially ones involving discovery in some fashion, will revolve around a set of search terms that have been negotiated by different parties, and it is common to use regexps to reduce false positives as well as account for variances. For those types of matters, it can be tedious to perform a series of interactive searches, depending on how easy it is to record the results. Jon > On Nov 14, 2016, at 5:18 PM, Brian Carrier <ca...@sl...> wrote: > > Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). > > 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. > > So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > > > > >> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: >> >> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). >> >> We’re looking at ways of solving that and need some guidance. >> >> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. >> >> Thanks! >> ------------------------------------------------------------------------------ >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > > > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Simson G. <si...@ac...> - 2016-11-14 22:23:00
|
Brian, With respect to #1 - I solved this problem with bulk_extractor by using an overlapping margin. Extend each block 1K or so into the next block. The extra 1k is called the Margin. Only report hits on string search if the text string beings in the main block, not if it begin in the margin (because then it is included entirely in the next block). You can tune the margin size to describe the largest text object that you wish to find with search. Simson > On Nov 14, 2016, at 5:14 PM, Brian Carrier <ca...@sl...> wrote: > > Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). > > 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. > > So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > > > > >> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: >> >> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). >> >> We’re looking at ways of solving that and need some guidance. >> >> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. >> >> Thanks! >> ------------------------------------------------------------------------------ >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > > > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: <slo...@gm...> - 2016-11-15 01:39:09
|
I favor complex regex, option 1, enhanced with with Simson’s boundary solution, if possible. From: Simson Garfinkel Sent: Monday, November 14, 2016 2:23 PM To: Brian Carrier Cc: sle...@li... users Subject: Re: [sleuthkit-users] Regular Expressions Brian, With respect to #1 - I solved this problem with bulk_extractor by using an overlapping margin. Extend each block 1K or so into the next block. The extra 1k is called the Margin. Only report hits on string search if the text string beings in the main block, not if it begin in the margin (because then it is included entirely in the next block). You can tune the margin size to describe the largest text object that you wish to find with search. Simson > On Nov 14, 2016, at 5:14 PM, Brian Carrier <ca...@sl...> wrote: > > Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). > > 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. > > So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > > > > >> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: >> >> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). >> >> We’re looking at ways of solving that and need some guidance. >> >> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. >> >> Thanks! >> ------------------------------------------------------------------------------ >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org > > > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org ------------------------------------------------------------------------------ _______________________________________________ sleuthkit-users mailing list https://lists.sourceforge.net/lists/listinfo/sleuthkit-users http://www.sleuthkit.org |
From: Tim <tim...@se...> - 2016-11-14 23:31:41
|
Right. Why not go with 32KB blocks and then index based on overlapped windows? To index block[N], you include this string in the index: (block[N-1] || block[N] || block[N+1]) Then when a match occurs, you just add some logic to figure out where it actually showed up (only in margin blocks or partially in block[N]) This is perhaps more naive than what Simson suggests, but with small blocks you don't need to worry about having the margins be much smaller than the block you're indexing. tim PS - I'm probably missing something here. I've been out of the game a while. On Mon, Nov 14, 2016 at 05:22:26PM -0500, Simson Garfinkel wrote: > Brian, > > With respect to #1 - I solved this problem with bulk_extractor by using an overlapping margin. Extend each block 1K or so into the next block. The extra 1k is called the Margin. Only report hits on string search if the text string beings in the main block, not if it begin in the margin (because then it is included entirely in the next block). You can tune the margin size to describe the largest text object that you wish to find with search. > > Simson > > > > On Nov 14, 2016, at 5:14 PM, Brian Carrier <ca...@sl...> wrote: > > > > Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): > > > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). > > > > 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. > > > > So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > > > > > > > > > >> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: > >> > >> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). > >> > >> We’re looking at ways of solving that and need some guidance. > >> > >> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. > >> > >> Thanks! > >> ------------------------------------------------------------------------------ > >> _______________________________________________ > >> sleuthkit-users mailing list > >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >> http://www.sleuthkit.org > > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org |
From: Simson G. <si...@ac...> - 2016-11-14 23:40:07
|
Hi Tim, Take a look at the bulk_extractor paper, which explains this in detail. There is no need to index block[N-1] below, just block[N] || X bytes from block[N+1], where X is the margin. You always need to worry about the margins, because if you don't, you double-report findings. It turns out that there are a lot of optimizations that you can implement if you do things the way I recommend below. For example, you never need to do duplicate suppression if you only index strings that being in block[N], even if they extend into block[N+1]. > On Nov 14, 2016, at 6:04 PM, Tim <tim...@se...> wrote: > > > Right. Why not go with 32KB blocks and then index based on overlapped > windows? To index block[N], you include this string in the index: > (block[N-1] || block[N] || block[N+1]) > > Then when a match occurs, you just add some logic to figure out where > it actually showed up (only in margin blocks or partially in block[N]) > > This is perhaps more naive than what Simson suggests, but with small > blocks you don't need to worry about having the margins be much > smaller than the block you're indexing. > > tim > > PS - I'm probably missing something here. I've been out of the game a > while. > > > On Mon, Nov 14, 2016 at 05:22:26PM -0500, Simson Garfinkel wrote: >> Brian, >> >> With respect to #1 - I solved this problem with bulk_extractor by using an overlapping margin. Extend each block 1K or so into the next block. The extra 1k is called the Margin. Only report hits on string search if the text string beings in the main block, not if it begin in the margin (because then it is included entirely in the next block). You can tune the margin size to describe the largest text object that you wish to find with search. >> >> Simson >> >> >>> On Nov 14, 2016, at 5:14 PM, Brian Carrier <ca...@sl...> wrote: >>> >>> Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): >>> >>> 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). >>> >>> 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. >>> >>> So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. >>> >>> >>> >>> >>>> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: >>>> >>>> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). >>>> >>>> We’re looking at ways of solving that and need some guidance. >>>> >>>> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. >>>> >>>> Thanks! >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> sleuthkit-users mailing list >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>>> http://www.sleuthkit.org >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> sleuthkit-users mailing list >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> http://www.sleuthkit.org >> >> >> ------------------------------------------------------------------------------ >> _______________________________________________ >> sleuthkit-users mailing list >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >> http://www.sleuthkit.org |
From: Tim <tim...@se...> - 2016-11-14 23:53:15
|
Makes sense. But yeah, if you deal with this situation properly, then you get a lot more flexibility in what block sizes and tokenization approach you use. If you don't deal with this situation, large blocks don't eliminate the problem, they just make missing things less likely. tim On Mon, Nov 14, 2016 at 06:39:59PM -0500, Simson Garfinkel wrote: > Hi Tim, > > Take a look at the bulk_extractor paper, which explains this in detail. There is no need to index block[N-1] below, just block[N] || X bytes from block[N+1], where X is the margin. > > You always need to worry about the margins, because if you don't, you double-report findings. It turns out that there are a lot of optimizations that you can implement if you do things the way I recommend below. For example, you never need to do duplicate suppression if you only index strings that being in block[N], even if they extend into block[N+1]. > > > On Nov 14, 2016, at 6:04 PM, Tim <tim...@se...> wrote: > > > > > > Right. Why not go with 32KB blocks and then index based on overlapped > > windows? To index block[N], you include this string in the index: > > (block[N-1] || block[N] || block[N+1]) > > > > Then when a match occurs, you just add some logic to figure out where > > it actually showed up (only in margin blocks or partially in block[N]) > > > > This is perhaps more naive than what Simson suggests, but with small > > blocks you don't need to worry about having the margins be much > > smaller than the block you're indexing. > > > > tim > > > > PS - I'm probably missing something here. I've been out of the game a > > while. > > > > > > On Mon, Nov 14, 2016 at 05:22:26PM -0500, Simson Garfinkel wrote: > >> Brian, > >> > >> With respect to #1 - I solved this problem with bulk_extractor by using an overlapping margin. Extend each block 1K or so into the next block. The extra 1k is called the Margin. Only report hits on string search if the text string beings in the main block, not if it begin in the margin (because then it is included entirely in the next block). You can tune the margin size to describe the largest text object that you wish to find with search. > >> > >> Simson > >> > >> > >>> On Nov 14, 2016, at 5:14 PM, Brian Carrier <ca...@sl...> wrote: > >>> > >>> Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): > >>> > >>> 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). > >>> > >>> 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. > >>> > >>> So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > >>> > >>> > >>> > >>> > >>>> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: > >>>> > >>>> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). > >>>> > >>>> We’re looking at ways of solving that and need some guidance. > >>>> > >>>> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. > >>>> > >>>> Thanks! > >>>> ------------------------------------------------------------------------------ > >>>> _______________________________________________ > >>>> sleuthkit-users mailing list > >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>>> http://www.sleuthkit.org > >>> > >>> > >>> ------------------------------------------------------------------------------ > >>> _______________________________________________ > >>> sleuthkit-users mailing list > >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >>> http://www.sleuthkit.org > >> > >> > >> ------------------------------------------------------------------------------ > >> _______________________________________________ > >> sleuthkit-users mailing list > >> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > >> http://www.sleuthkit.org > |
From: Brian C. <ca...@sl...> - 2016-11-15 02:35:26
|
The Autopsy design is a bit interesting for this because we only report in the DB one hit per keyword per file. The details of each hit are then figured out when the user wants to review the file. So de-duping at that level is not important (we make an artifact at the first hit. . But, we need to do something fancy with Solr to allow for both the overlap for the matching, but allow a human to page through the file and not read the overlapping text and think “I feel like I just read that on the previous page”…. > On Nov 14, 2016, at 6:39 PM, Simson Garfinkel <si...@ac...> wrote: > > Hi Tim, > > Take a look at the bulk_extractor paper, which explains this in detail. There is no need to index block[N-1] below, just block[N] || X bytes from block[N+1], where X is the margin. > > You always need to worry about the margins, because if you don't, you double-report findings. It turns out that there are a lot of optimizations that you can implement if you do things the way I recommend below. For example, you never need to do duplicate suppression if you only index strings that being in block[N], even if they extend into block[N+1]. > >> On Nov 14, 2016, at 6:04 PM, Tim <tim...@se...> wrote: >> >> >> Right. Why not go with 32KB blocks and then index based on overlapped >> windows? To index block[N], you include this string in the index: >> (block[N-1] || block[N] || block[N+1]) >> >> Then when a match occurs, you just add some logic to figure out where >> it actually showed up (only in margin blocks or partially in block[N]) >> >> This is perhaps more naive than what Simson suggests, but with small >> blocks you don't need to worry about having the margins be much >> smaller than the block you're indexing. >> >> tim >> >> PS - I'm probably missing something here. I've been out of the game a >> while. >> >> >> On Mon, Nov 14, 2016 at 05:22:26PM -0500, Simson Garfinkel wrote: >>> Brian, >>> >>> With respect to #1 - I solved this problem with bulk_extractor by using an overlapping margin. Extend each block 1K or so into the next block. The extra 1k is called the Margin. Only report hits on string search if the text string beings in the main block, not if it begin in the margin (because then it is included entirely in the next block). You can tune the margin size to describe the largest text object that you wish to find with search. >>> >>> Simson >>> >>> >>>> On Nov 14, 2016, at 5:14 PM, Brian Carrier <ca...@sl...> wrote: >>>> >>>> Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): >>>> >>>> 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). >>>> >>>> 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. >>>> >>>> So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. >>>> >>>> >>>> >>>> >>>>> On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: >>>>> >>>>> Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). >>>>> >>>>> We’re looking at ways of solving that and need some guidance. >>>>> >>>>> If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. >>>>> >>>>> Thanks! >>>>> ------------------------------------------------------------------------------ >>>>> _______________________________________________ >>>>> sleuthkit-users mailing list >>>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>>>> http://www.sleuthkit.org >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> _______________________________________________ >>>> sleuthkit-users mailing list >>>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>>> http://www.sleuthkit.org >>> >>> >>> ------------------------------------------------------------------------------ >>> _______________________________________________ >>> sleuthkit-users mailing list >>> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users >>> http://www.sleuthkit.org > |
From: Luís F. N. <lfc...@gm...> - 2016-11-15 01:53:56
|
Hi Brian, I didn't understand exactly how text chunk size will help to index spaces and other chars that breaks words into tokens. You will index text twice? First with default tokenization, breaking words at spaces and similar chars, and second time will index the whole text chunk as one single token? Does the 32KB is the maximum Lucene token size? I think you can do the second indexing (with performance consequences if you index twice, it should be configurable, so users could disable it if they do not need regex or if performance is critical). But I think you should not disable the default indexing (with tokenization), otherwise users will have to always use * as prefix and suffix of their searches, if not they will miss a lot of hits. I do not known if they will be able to do phrase searches, because Lucene does not allow to use * into a phrase search (* between two " "). I do not know about Solr and if it extended that. Regards, Luis Nassif 2016-11-14 20:14 GMT-02:00 Brian Carrier <ca...@sl...>: > Making this a little more specific, we seem to have two options to solve > this problem (which is inherent to Lucene/Solr/Elastic): > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and > can have the full power of regular expressions. The downside of the > smaller chunks is that there are more boundaries and places where a term > could span the boundary and we could miss a hit if it spans that boundary. > If we needed to, we could do some fancy overlapping. 32KB of text is > about 12 pages of English text (less for non-English). > > 2) We limit the types of regular expressions that people can use and keep > our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we > won’t be able to support all expressions. For example, if you gave us > “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but > we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we > could in theory, but we dont’ want to add crazy complexity here. > > So, the question is if you’d rather have smaller chunks and the full > breadth of regular expressions or a more limited set of expressions and > bigger chunks. We are looking at the performance differences now, but > wanted to get some initial opinions. > > > > > > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> > wrote: > > > > Autopsy currently has a limitation when searching for regular > expressions, that spaces are not supported. It’s not a problem for Email > addresses and URLs, but becomes an issue phone numbers, account numbers, > etc. This limitation comes from using an indexed search engine (since > spaces are used to break text into tokens). > > > > We’re looking at ways of solving that and need some guidance. > > > > If you write your own regular expressions, can you please let me know > and share what they look like. We want to know how complex the expressions > are that people use in real life. > > > > Thanks! > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > > ------------------------------------------------------------ > ------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org > |
From: Brian C. <ca...@sl...> - 2016-11-15 02:28:59
|
Hi Luis, We currently (and will in the future) maintain two “copies” of the text to support text and regexp searches. What will change if we adopt the 32KB approach is to start storing the text in a non-indexed “string” field (which has a size limitation of 32KB). It will not be tokenized and Solr will apply the regular expression to each text field. So, this is in essence what Jon was also proposing of just doing a regexp on the extracted text. Because this new field is not indexed, it will be slower. Exact search performance hit TBD. brian > On Nov 14, 2016, at 8:53 PM, Luís Filipe Nassif <lfc...@gm...> wrote: > > Hi Brian, > > I didn't understand exactly how text chunk size will help to index spaces and other chars that breaks words into tokens. You will index text twice? First with default tokenization, breaking words at spaces and similar chars, and second time will index the whole text chunk as one single token? Does the 32KB is the maximum Lucene token size? I think you can do the second indexing (with performance consequences if you index twice, it should be configurable, so users could disable it if they do not need regex or if performance is critical). But I think you should not disable the default indexing (with tokenization), otherwise users will have to always use * as prefix and suffix of their searches, if not they will miss a lot of hits. I do not known if they will be able to do phrase searches, because Lucene does not allow to use * into a phrase search (* between two " "). I do not know about Solr and if it extended that. > > Regards, > Luis Nassif > > 2016-11-14 20:14 GMT-02:00 Brian Carrier <ca...@sl...>: > Making this a little more specific, we seem to have two options to solve this problem (which is inherent to Lucene/Solr/Elastic): > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and can have the full power of regular expressions. The downside of the smaller chunks is that there are more boundaries and places where a term could span the boundary and we could miss a hit if it spans that boundary. If we needed to, we could do some fancy overlapping. 32KB of text is about 12 pages of English text (less for non-English). > > 2) We limit the types of regular expressions that people can use and keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we won’t be able to support all expressions. For example, if you gave us “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we could in theory, but we dont’ want to add crazy complexity here. > > So, the question is if you’d rather have smaller chunks and the full breadth of regular expressions or a more limited set of expressions and bigger chunks. We are looking at the performance differences now, but wanted to get some initial opinions. > > > > > > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> wrote: > > > > Autopsy currently has a limitation when searching for regular expressions, that spaces are not supported. It’s not a problem for Email addresses and URLs, but becomes an issue phone numbers, account numbers, etc. This limitation comes from using an indexed search engine (since spaces are used to break text into tokens). > > > > We’re looking at ways of solving that and need some guidance. > > > > If you write your own regular expressions, can you please let me know and share what they look like. We want to know how complex the expressions are that people use in real life. > > > > Thanks! > > ------------------------------------------------------------------------------ > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > > ------------------------------------------------------------------------------ > _______________________________________________ > sleuthkit-users mailing list > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > http://www.sleuthkit.org > |
From: Luís F. N. <lfc...@gm...> - 2016-11-16 09:26:26
|
Hi Brian, Thanks for the explanation. I do not know about Solr, but a String field in Lucene is indexed and not tokenized. I have never tried regex on lucene string Fields, but I think it may match only at the beginning of the text if you do not start the regex with something like .* maybe I am wrong, so it will not be fast although the Field is indexed. Regards, Luis Em 15 de nov de 2016 00:28, "Brian Carrier" <ca...@sl...> escreveu: > Hi Luis, > > We currently (and will in the future) maintain two “copies” of the text to > support text and regexp searches. What will change if we adopt the 32KB > approach is to start storing the text in a non-indexed “string” field > (which has a size limitation of 32KB). It will not be tokenized and Solr > will apply the regular expression to each text field. > > So, this is in essence what Jon was also proposing of just doing a regexp > on the extracted text. Because this new field is not indexed, it will be > slower. Exact search performance hit TBD. > > brian > > > > > > > On Nov 14, 2016, at 8:53 PM, Luís Filipe Nassif <lfc...@gm...> > wrote: > > > > Hi Brian, > > > > I didn't understand exactly how text chunk size will help to index > spaces and other chars that breaks words into tokens. You will index text > twice? First with default tokenization, breaking words at spaces and > similar chars, and second time will index the whole text chunk as one > single token? Does the 32KB is the maximum Lucene token size? I think you > can do the second indexing (with performance consequences if you index > twice, it should be configurable, so users could disable it if they do not > need regex or if performance is critical). But I think you should not > disable the default indexing (with tokenization), otherwise users will have > to always use * as prefix and suffix of their searches, if not they will > miss a lot of hits. I do not known if they will be able to do phrase > searches, because Lucene does not allow to use * into a phrase search (* > between two " "). I do not know about Solr and if it extended that. > > > > Regards, > > Luis Nassif > > > > 2016-11-14 20:14 GMT-02:00 Brian Carrier <ca...@sl...>: > > Making this a little more specific, we seem to have two options to solve > this problem (which is inherent to Lucene/Solr/Elastic): > > > > 1) We store text in 32KB chunks (instead of our current 1MB chunks) and > can have the full power of regular expressions. The downside of the > smaller chunks is that there are more boundaries and places where a term > could span the boundary and we could miss a hit if it spans that boundary. > If we needed to, we could do some fancy overlapping. 32KB of text is > about 12 pages of English text (less for non-English). > > > > 2) We limit the types of regular expressions that people can use and > keep our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but > we won’t be able to support all expressions. For example, if you gave us > “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but > we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”. Well we > could in theory, but we dont’ want to add crazy complexity here. > > > > So, the question is if you’d rather have smaller chunks and the full > breadth of regular expressions or a more limited set of expressions and > bigger chunks. We are looking at the performance differences now, but > wanted to get some initial opinions. > > > > > > > > > > > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...> > wrote: > > > > > > Autopsy currently has a limitation when searching for regular > expressions, that spaces are not supported. It’s not a problem for Email > addresses and URLs, but becomes an issue phone numbers, account numbers, > etc. This limitation comes from using an indexed search engine (since > spaces are used to break text into tokens). > > > > > > We’re looking at ways of solving that and need some guidance. > > > > > > If you write your own regular expressions, can you please let me know > and share what they look like. We want to know how complex the expressions > are that people use in real life. > > > > > > Thanks! > > > ------------------------------------------------------------ > ------------------ > > > _______________________________________________ > > > sleuthkit-users mailing list > > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > > http://www.sleuthkit.org > > > > > > ------------------------------------------------------------ > ------------------ > > _______________________________________________ > > sleuthkit-users mailing list > > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users > > http://www.sleuthkit.org > > > > |