Performing a dictionary-based search of terms on long texts may be very time expensive, if you use a naive approach such as String#indexOf(.).
The Aho-Corasick algorithm finds all matches against a dictionary in linear time, that means independently regards to the number of matches or the size of the dictionary. It particularly helpful for dictionary-based Named Entity recognition tasks.
Below some results from the comparison between the Aho-Corasick algorithm and indexOf:
Terms in dictionary: 10100
Text lenght
Aho-Corasick (ms)
indexof (ms)
1000
164
128
2000
296
260
3000
292
401
4000
341
531
5000
137
639
6000
151
770
7000
284
895
8000
254
1074
9000
337
1161
10000
348
1321
Terms in dictionary: 25100
Text lenght
Aho-Corasick (ms)
indexof (ms)
1000
771
401
2000
633
646
3000
853
961
4000
714
1308
5000
978
1616
6000
660
1899
7000
680
2319
8000
647
2546
9000
871
2916
10000
580
3186
Terms in dictionary: 40100
Text lenght
Aho-Corasick (ms)
indexof (ms)
1000
1652
512
2000
1047
1017
3000
986
1531
4000
1767
2054
5000
890
2524
6000
1496
3124
7000
1107
3589
8000
883
4163
9000
3546
4618
10000
983
5135
Usually the Aho-Corasick exceed indexOf for texts longer than 2000 chars.
As you can see in the following diagram, while the indexOf-performance increases linearly by longer texts, the Aho-Corarick- performance remains quite invariable.