<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent changes to TextSimilarity</title><link>https://sourceforge.net/p/webtextanalysis/wiki/TextSimilarity/</link><description>Recent changes to TextSimilarity</description><atom:link href="https://sourceforge.net/p/webtextanalysis/wiki/TextSimilarity/feed" rel="self"/><language>en</language><lastBuildDate>Tue, 20 May 2014 08:56:24 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/webtextanalysis/wiki/TextSimilarity/feed" rel="self" type="application/rss+xml"/><item><title>TextSimilarity modified by Kostia</title><link>https://sourceforge.net/p/webtextanalysis/wiki/TextSimilarity/</link><description>&lt;div class="markdown_content"&gt;&lt;h2 id="text-similarity"&gt;Text Similarity&lt;/h2&gt;
&lt;p&gt;We implement some metrics to measure the similarity between words , sentences or documents (expressed as bag-of-words). The similarity between words is based on the single chars, while that one for sentences or documents on the overlapping words. &lt;/p&gt;
&lt;p&gt;Regarding the &lt;strong&gt;word similarity&lt;/strong&gt;, the following metrics are implemented: &lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;em&gt;Jaccard&lt;/em&gt; distance, is obtained by dividing sizes of the intersection and the union. It's like the Cosine similarity. &lt;/li&gt;
&lt;li&gt;The &lt;em&gt;Jaro&lt;/em&gt; distance is mainly used in the area of record linkage (duplicate detection) because it takes into account typical spelling deviations. &lt;/li&gt;
&lt;li&gt;The &lt;em&gt;Jaro-Winkler&lt;/em&gt; distance is an extension of the Jaro distance metric, that takes into account typical spelling deviations. &lt;/li&gt;
&lt;li&gt;The &lt;em&gt;Levenstein&lt;/em&gt; distance function (also known as edit distance) is defined as the minimum number of edits needed to transform one string into the other. &lt;/li&gt;
&lt;li&gt;The &lt;em&gt;Luhn&lt;/em&gt; metric is based on the Luhn's paper "The Automatic Creation of Literature Abstracts". &lt;/li&gt;
&lt;li&gt;The &lt;em&gt;Soundex&lt;/em&gt; distance computes the Soundex Phonetic representation of the words and then compare it by Jaro-Winkel distance. &lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;They all are implemented in the package &lt;code&gt;eu.kostia.textanalysis.similarity&lt;/code&gt;. To add you own measure, extend the class &lt;code&gt;AbstractStringSimilarityMetric&lt;/code&gt; or implement the interface &lt;code&gt;StringSimilarityMetric&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;On the other side, &lt;strong&gt;sentence or bag-pf-words similarity&lt;/strong&gt; is compute by counting the number of matching words (case sensitive) between two sentences, without respect to their order. The raw count in then normalized to a coefficient between &lt;span&gt;[0,1]&lt;/span&gt; according the the sentence lengths. F-Measure, Dice Coefficient and cosine similarity are available and computed as follows: &lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;overlaps&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;nd&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;
&lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;overlaps&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;
&lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;measure&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;recall&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Dice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;overlaps&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sum&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;both&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;Cosine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;overlaps&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;sqrt&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;number&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="n"&gt;in&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="n"&gt;nd&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;
&lt;p&gt;In fact, the F-Measure and the Dice Coefficient are always equivalent. &lt;/p&gt;
&lt;p&gt;All these metrics are implemented in the class &lt;code&gt;SentenceSimilarity&lt;/code&gt; and &lt;code&gt;BagOfWordsSimilarity&lt;/code&gt;. Please see some example in the relative unit tests (&lt;code&gt;SentenceSimilarityTest&lt;/code&gt; and &lt;code&gt;BagOfWordsSimilarityTest&lt;/code&gt;). &lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Kostia</dc:creator><pubDate>Tue, 20 May 2014 08:56:24 -0000</pubDate><guid>https://sourceforge.netdb5cbbf7e0f2a696f8efd7780404c01b0961a4b7</guid></item></channel></rss>