TextSimilarity modified by Kostia

Kostia — Tue, 20 May 2014 08:56:24 -0000

Text Similarity

We implement some metrics to measure the similarity between words , sentences or documents (expressed as bag-of-words). The similarity between words is based on the single chars, while that one for sentences or documents on the overlapping words.

Regarding the word similarity, the following metrics are implemented:

The Jaccard distance, is obtained by dividing sizes of the intersection and the union. It's like the Cosine similarity.
The Jaro distance is mainly used in the area of record linkage (duplicate detection) because it takes into account typical spelling deviations.
The Jaro-Winkler distance is an extension of the Jaro distance metric, that takes into account typical spelling deviations.
The Levenstein distance function (also known as edit distance) is defined as the minimum number of edits needed to transform one string into the other.
The Luhn metric is based on the Luhn's paper "The Automatic Creation of Literature Abstracts".
The Soundex distance computes the Soundex Phonetic representation of the words and then compare it by Jaro-Winkel distance.

They all are implemented in the package eu.kostia.textanalysis.similarity. To add you own measure, extend the class AbstractStringSimilarityMetric or implement the interface StringSimilarityMetric.

On the other side, sentence or bag-pf-words similarity is compute by counting the number of matching words (case sensitive) between two sentences, without respect to their order. The raw count in then normalized to a coefficient between [0,1] according the the sentence lengths. F-Measure, Dice Coefficient and cosine similarity are available and computed as follows:

precision = overlaps / number of tokens in the 2nd sentence
recall = overlaps / number of tokens in the 1st sentence
F-measure = 2 * precision * recall / (precision + recall)

Dice = 2 * overlaps / (sum of number of tokens in both sentences)

Cosine = overlaps / sqrt (number of tokens in the 1st sentence * number of tokens in the 2nd sentence)

In fact, the F-Measure and the Dice Coefficient are always equivalent.

All these metrics are implemented in the class SentenceSimilarity and BagOfWordsSimilarity. Please see some example in the relative unit tests (SentenceSimilarityTest and BagOfWordsSimilarityTest).

Recent changes to TextSimilarity

TextSimilarity modified by Kostia

Text Similarity