We implement some metrics to measure the similarity between words , sentences or documents (expressed as bag-of-words). The similarity between words is based on the single chars, while that one for sentences or documents on the overlapping words.
Regarding the word similarity, the following metrics are implemented:
They all are implemented in the package eu.kostia.textanalysis.similarity
. To add you own measure, extend the class AbstractStringSimilarityMetric
or implement the interface StringSimilarityMetric
.
On the other side, sentence or bag-pf-words similarity is compute by counting the number of matching words (case sensitive) between two sentences, without respect to their order. The raw count in then normalized to a coefficient between [0,1] according the the sentence lengths. F-Measure, Dice Coefficient and cosine similarity are available and computed as follows:
precision = overlaps / number of tokens in the 2nd sentence recall = overlaps / number of tokens in the 1st sentence F-measure = 2 * precision * recall / (precision + recall) Dice = 2 * overlaps / (sum of number of tokens in both sentences) Cosine = overlaps / sqrt (number of tokens in the 1st sentence * number of tokens in the 2nd sentence)
In fact, the F-Measure and the Dice Coefficient are always equivalent.
All these metrics are implemented in the class SentenceSimilarity
and BagOfWordsSimilarity
. Please see some example in the relative unit tests (SentenceSimilarityTest
and BagOfWordsSimilarityTest
).