We implement some metrics to measure the similarity between words , sentences or documents (expressed as bag-of-words). The similarity between words is based on the single chars, while that one for sentences or documents on the overlapping words.
Regarding the word similarity, the following metrics are implemented:
They all are implemented in the package eu.kostia.textanalysis.similarity. To add you own measure, extend the class AbstractStringSimilarityMetric or implement the interface StringSimilarityMetric.
On the other side, sentence or bag-pf-words similarity is compute by counting the number of matching words (case sensitive) between two sentences, without respect to their order. The raw count in then normalized to a coefficient between [0,1] according the the sentence lengths. F-Measure, Dice Coefficient and cosine similarity are available and computed as follows:
precision = overlaps / number of tokens in the 2nd sentence
recall = overlaps / number of tokens in the 1st sentence
F-measure = 2 * precision * recall / (precision + recall)
Dice = 2 * overlaps / (sum of number of tokens in both sentences)
Cosine = overlaps / sqrt (number of tokens in the 1st sentence * number of tokens in the 2nd sentence)
In fact, the F-Measure and the Dice Coefficient are always equivalent.
All these metrics are implemented in the class SentenceSimilarity and BagOfWordsSimilarity. Please see some example in the relative unit tests (SentenceSimilarityTest and BagOfWordsSimilarityTest).