text-analysis Wiki

Status: Alpha

Brought to you by: kostia76

StringMatching

Labels:

String Matching

Performing a dictionary-based search of terms on long texts may be very time expensive, if you use a naive approach such as String#indexOf(.).
The Aho-Corasick algorithm finds all matches against a dictionary in linear time, that means independently regards to the number of matches or the size of the dictionary. It particularly helpful for dictionary-based Named Entity recognition tasks.

Below some results from the comparison between the Aho-Corasick algorithm and indexOf:

Terms in dictionary: 10100

Text lenght
Aho-Corasick (ms)
indexof (ms)

1000
164
128

2000
296
260

3000
292
401

4000
341
531

5000
137
639

6000
151
770

7000
284
895

8000
254
1074

9000
337
1161

10000
348
1321

Terms in dictionary: 25100

Text lenght
Aho-Corasick (ms)
indexof (ms)

1000
771
401

2000
633
646

3000
853
961

4000
714
1308

5000
978
1616

6000
660
1899

7000
680
2319

8000
647
2546

9000
871
2916

10000
580
3186

Terms in dictionary: 40100

Text lenght
Aho-Corasick (ms)
indexof (ms)

1000
1652
512

2000
1047
1017

3000
986
1531

4000
1767
2054

5000
890
2524

6000
1496
3124

7000
1107
3589

8000
883
4163

9000
3546
4618

10000
983
5135

Usually the Aho-Corasick exceed indexOf for texts longer than 2000 chars.

As you can see in the following diagram, while the indexOf-performance increases linearly by longer texts, the Aho-Corarick- performance remains quite invariable.

Wiki: Home

text-analysis Wiki

StringMatching

String Matching

Related