sleuthkit-users Mailing List for The Sleuth Kit (Page 12)

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Brian,

I didn't understand exactly how text chunk size will help to index spaces
and other chars that breaks words into tokens. You will index text twice?
First with default tokenization, breaking words at spaces and similar
chars, and second time will index the whole text chunk as one single token?
Does the 32KB is the maximum Lucene token size? I think you can do the
second indexing (with performance consequences if you index twice, it
should be configurable, so users could disable it if they do not need regex
or if performance is critical). But I think you should not disable the
default indexing (with tokenization), otherwise users will have to always
use * as prefix and suffix of their searches, if not they will miss a lot
of hits. I do not known if they will be able to do phrase searches, because
Lucene does not allow to use * into a phrase search (* between two " "). I
do not know about Solr and if it extended that.

Regards,
Luis Nassif

2016-11-14 20:14 GMT-02:00 Brian Carrier <ca...@sl...>:

> Making this a little more specific, we seem to have two options to solve
> this problem (which is inherent to Lucene/Solr/Elastic):
>
> 1) We store text in 32KB chunks (instead of our current 1MB chunks) and
> can have the full power of regular expressions.   The downside of the
> smaller chunks is that there are more boundaries and places where a term
> could span the boundary and we could miss a hit if it spans that boundary.
>  If we needed to, we could do some fancy overlapping.   32KB of text is
> about 12 pages of English text (less for non-English).
>
> 2) We limit the types of regular expressions that people can use and keep
> our 1MB chunks. We’ll add some logic into Autopsy to span tokens, but we
> won’t be able to support all expressions.  For example, if you gave us
> “\d\d\d\s\d\d\d\d” we’d turn that into a search for “\d\d\d \d\d\d\d”, but
> we wouldn’t able to support a search like “\d\d\d[\s-]\d\d\d\d”.  Well we
> could in theory, but we dont’ want to add crazy complexity here.
>
> So, the question is if you’d rather have smaller chunks and the full
> breadth of regular expressions or a more limited set of expressions and
> bigger chunks.  We are looking at the performance differences now, but
> wanted to get some initial opinions.
>
>
>
>
> > On Nov 14, 2016, at 1:09 PM, Brian Carrier <ca...@sl...>
> wrote:
> >
> > Autopsy currently has a limitation when searching for regular
> expressions, that spaces are not supported.  It’s not a problem for Email
> addresses and URLs, but becomes an issue phone numbers, account numbers,
> etc.  This limitation comes from using an indexed search engine (since
> spaces are used to break text into tokens).
> >
> > We’re looking at ways of solving that and need some guidance.
> >
> > If you write your own regular expressions, can you please let me know
> and share what they look like.  We want to know how complex the expressions
> are that people use in real life.
> >
> > Thanks!
> > ------------------------------------------------------------
> ------------------
> > _______________________________________________
> > sleuthkit-users mailing list
> > https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> > http://www.sleuthkit.org
>
>
> ------------------------------------------------------------
> ------------------
> _______________________________________________
> sleuthkit-users mailing list
> https://lists.sourceforge.net/lists/listinfo/sleuthkit-users
> http://www.sleuthkit.org
>

2002	Jan	Feb	Mar	Apr	May	Jun	Jul (6)	Aug	Sep (11)	Oct (5)	Nov (4)	Dec
2003	Jan (1)	Feb (20)	Mar (60)	Apr (40)	May (24)	Jun (28)	Jul (18)	Aug (27)	Sep (6)	Oct (14)	Nov (15)	Dec (22)
2004	Jan (34)	Feb (13)	Mar (28)	Apr (23)	May (27)	Jun (26)	Jul (37)	Aug (19)	Sep (20)	Oct (39)	Nov (17)	Dec (9)
2005	Jan (45)	Feb (43)	Mar (66)	Apr (36)	May (19)	Jun (64)	Jul (10)	Aug (11)	Sep (35)	Oct (6)	Nov (4)	Dec (13)
2006	Jan (52)	Feb (34)	Mar (39)	Apr (39)	May (37)	Jun (15)	Jul (13)	Aug (48)	Sep (9)	Oct (10)	Nov (47)	Dec (13)
2007	Jan (25)	Feb (4)	Mar (2)	Apr (29)	May (11)	Jun (19)	Jul (13)	Aug (15)	Sep (30)	Oct (12)	Nov (10)	Dec (13)
2008	Jan (2)	Feb (54)	Mar (58)	Apr (43)	May (10)	Jun (27)	Jul (25)	Aug (27)	Sep (48)	Oct (69)	Nov (55)	Dec (43)
2009	Jan (26)	Feb (36)	Mar (28)	Apr (27)	May (55)	Jun (9)	Jul (19)	Aug (16)	Sep (15)	Oct (17)	Nov (70)	Dec (21)
2010	Jan (56)	Feb (59)	Mar (53)	Apr (32)	May (25)	Jun (31)	Jul (36)	Aug (11)	Sep (37)	Oct (19)	Nov (23)	Dec (6)
2011	Jan (21)	Feb (20)	Mar (30)	Apr (30)	May (74)	Jun (50)	Jul (34)	Aug (34)	Sep (12)	Oct (33)	Nov (10)	Dec (8)
2012	Jan (23)	Feb (57)	Mar (26)	Apr (14)	May (27)	Jun (27)	Jul (60)	Aug (88)	Sep (13)	Oct (36)	Nov (97)	Dec (85)
2013	Jan (60)	Feb (24)	Mar (43)	Apr (32)	May (22)	Jun (38)	Jul (51)	Aug (50)	Sep (76)	Oct (65)	Nov (25)	Dec (30)
2014	Jan (19)	Feb (41)	Mar (43)	Apr (28)	May (61)	Jun (12)	Jul (10)	Aug (37)	Sep (76)	Oct (31)	Nov (41)	Dec (12)
2015	Jan (33)	Feb (28)	Mar (53)	Apr (22)	May (29)	Jun (20)	Jul (15)	Aug (17)	Sep (52)	Oct (3)	Nov (18)	Dec (21)
2016	Jan (20)	Feb (8)	Mar (21)	Apr (7)	May (13)	Jun (35)	Jul (34)	Aug (11)	Sep (14)	Oct (22)	Nov (31)	Dec (23)
2017	Jan (20)	Feb (7)	Mar (5)	Apr (6)	May (6)	Jun (22)	Jul (11)	Aug (16)	Sep (8)	Oct (1)	Nov (1)	Dec (1)
2018	Jan	Feb	Mar (16)	Apr (2)	May (6)	Jun (5)	Jul	Aug (2)	Sep (4)	Oct	Nov (16)	Dec (13)
2019	Jan	Feb (1)	Mar (25)	Apr (9)	May (2)	Jun (1)	Jul (1)	Aug	Sep	Oct	Nov	Dec
2020	Jan (2)	Feb	Mar (1)	Apr	May (1)	Jun (3)	Jul (2)	Aug	Sep	Oct (5)	Nov	Dec
2021	Jan	Feb	Mar (1)	Apr	May	Jun (4)	Jul (1)	Aug	Sep (1)	Oct	Nov (1)	Dec
2022	Jan	Feb (2)	Mar	Apr	May (2)	Jun	Jul (3)	Aug	Sep	Oct	Nov	Dec
2023	Jan (2)	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec
2024	Jan	Feb (3)	Mar	Apr (1)	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2025	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug	Sep	Oct	Nov	Dec

sleuthkit-users Mailing List for The Sleuth Kit (Page 12)

sleuthkit-users — List to discuss Autopsy and The Sleuth Kit.