<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to StringMatching</title><link>https://sourceforge.net/p/text-analysis/wiki/StringMatching/</link><description>Recent changes to StringMatching</description><atom:link href="https://sourceforge.net/p/text-analysis/wiki/StringMatching/feed" rel="self"/><language>en</language><lastBuildDate>Tue, 20 May 2014 09:36:46 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/text-analysis/wiki/StringMatching/feed" rel="self" type="application/rss+xml"/><item><title>StringMatching modified by Kostia</title><link>https://sourceforge.net/p/text-analysis/wiki/StringMatching/</link><description>&lt;div class="markdown_content"&gt;&lt;h2 id="string-matching"&gt;String Matching&lt;/h2&gt;
&lt;p&gt;Performing a dictionary-based search of terms on long texts may be very time expensive, if you use a naive approach such as String#indexOf(.).&lt;br /&gt;
The Aho-Corasick algorithm finds all matches against a dictionary in linear time, that means independently regards to the number of matches or the size of the dictionary. It particularly helpful for dictionary-based Named Entity recognition tasks. &lt;/p&gt;
&lt;p&gt;Below some results from the comparison between the Aho-Corasick algorithm and indexOf: &lt;/p&gt;
&lt;p&gt;Terms in dictionary: 10100 &lt;/p&gt;
&lt;p&gt;Text lenght&lt;br /&gt;
Aho-Corasick (ms)&lt;br /&gt;
indexof (ms) &lt;/p&gt;
&lt;p&gt;1000&lt;br /&gt;
164&lt;br /&gt;
128 &lt;/p&gt;
&lt;p&gt;2000&lt;br /&gt;
296&lt;br /&gt;
260 &lt;/p&gt;
&lt;p&gt;3000&lt;br /&gt;
292&lt;br /&gt;
401 &lt;/p&gt;
&lt;p&gt;4000&lt;br /&gt;
341&lt;br /&gt;
531 &lt;/p&gt;
&lt;p&gt;5000&lt;br /&gt;
137&lt;br /&gt;
639 &lt;/p&gt;
&lt;p&gt;6000&lt;br /&gt;
151&lt;br /&gt;
770 &lt;/p&gt;
&lt;p&gt;7000&lt;br /&gt;
284&lt;br /&gt;
895 &lt;/p&gt;
&lt;p&gt;8000&lt;br /&gt;
254&lt;br /&gt;
1074 &lt;/p&gt;
&lt;p&gt;9000&lt;br /&gt;
337&lt;br /&gt;
1161 &lt;/p&gt;
&lt;p&gt;10000&lt;br /&gt;
348&lt;br /&gt;
1321 &lt;/p&gt;
&lt;p&gt;Terms in dictionary: 25100 &lt;/p&gt;
&lt;p&gt;Text lenght&lt;br /&gt;
Aho-Corasick (ms)&lt;br /&gt;
indexof (ms) &lt;/p&gt;
&lt;p&gt;1000&lt;br /&gt;
771&lt;br /&gt;
401 &lt;/p&gt;
&lt;p&gt;2000&lt;br /&gt;
633&lt;br /&gt;
646 &lt;/p&gt;
&lt;p&gt;3000&lt;br /&gt;
853&lt;br /&gt;
961 &lt;/p&gt;
&lt;p&gt;4000&lt;br /&gt;
714&lt;br /&gt;
1308 &lt;/p&gt;
&lt;p&gt;5000&lt;br /&gt;
978&lt;br /&gt;
1616 &lt;/p&gt;
&lt;p&gt;6000&lt;br /&gt;
660&lt;br /&gt;
1899 &lt;/p&gt;
&lt;p&gt;7000&lt;br /&gt;
680&lt;br /&gt;
2319 &lt;/p&gt;
&lt;p&gt;8000&lt;br /&gt;
647&lt;br /&gt;
2546 &lt;/p&gt;
&lt;p&gt;9000&lt;br /&gt;
871&lt;br /&gt;
2916 &lt;/p&gt;
&lt;p&gt;10000&lt;br /&gt;
580&lt;br /&gt;
3186 &lt;/p&gt;
&lt;p&gt;Terms in dictionary: 40100 &lt;/p&gt;
&lt;p&gt;Text lenght&lt;br /&gt;
Aho-Corasick (ms)&lt;br /&gt;
indexof (ms) &lt;/p&gt;
&lt;p&gt;1000&lt;br /&gt;
1652&lt;br /&gt;
512 &lt;/p&gt;
&lt;p&gt;2000&lt;br /&gt;
1047&lt;br /&gt;
1017 &lt;/p&gt;
&lt;p&gt;3000&lt;br /&gt;
986&lt;br /&gt;
1531 &lt;/p&gt;
&lt;p&gt;4000&lt;br /&gt;
1767&lt;br /&gt;
2054 &lt;/p&gt;
&lt;p&gt;5000&lt;br /&gt;
890&lt;br /&gt;
2524 &lt;/p&gt;
&lt;p&gt;6000&lt;br /&gt;
1496&lt;br /&gt;
3124 &lt;/p&gt;
&lt;p&gt;7000&lt;br /&gt;
1107&lt;br /&gt;
3589 &lt;/p&gt;
&lt;p&gt;8000&lt;br /&gt;
883&lt;br /&gt;
4163 &lt;/p&gt;
&lt;p&gt;9000&lt;br /&gt;
3546&lt;br /&gt;
4618 &lt;/p&gt;
&lt;p&gt;10000&lt;br /&gt;
983&lt;br /&gt;
5135 &lt;/p&gt;
&lt;p&gt;Usually the Aho-Corasick exceed indexOf for texts longer than 2000 chars. &lt;/p&gt;
&lt;p&gt;As you can see in the following diagram, while the indexOf-performance increases linearly by longer texts, the Aho-Corarick- performance remains quite invariable. &lt;img alt="" src="/userapps/trac/kostia76/raw-attachment/wiki/StringMatching/Performance%20Comparison%20Diagram.jpg" /&gt;&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Kostia</dc:creator><pubDate>Tue, 20 May 2014 09:36:46 -0000</pubDate><guid>https://sourceforge.net7edd1d48561dbd9ae80f70e700ce73688941e74e</guid></item></channel></rss>