From: Ted P. <dul...@gm...> - 2008-04-06 14:50:10
|
We are pleased to announce the release of version 0.06 of Text-Similarity. This is a module that WordNet-Similarity uses in the computation of the lesk measure, and one of the new features in this release is providing a "lesk" score that does our calculation for "lesk overlap" for any pair of files or strings you provide to it. As you may recall the lesk measure takes glosses and compares them for overlaps (matches) and then scores them by taking the length of each phrasal match, squaring it, and then summing those scores. Consider the following example (line breaks introduced for clarity) which measures the two given strings for similarity: text_similarity.pl --type Text::Similarity::Overlaps --verbose --stoplist stoplist.txt --string 'winston churchill was the prime minister of england' 'prime minister of england winston churchill came for a visit that day' keys: 2 -->'prime minister england' len(3) cnt(1) -->'winston churchill' len(2) cnt(1) wc 1: 5 wc 2: 7 Raw score: 5 Precision: 0.714285714285714 Recall : 1 F-measure: 0.833333333333333 Dice : 0.833333333333333 E-measure: 0.166666666666667 Cosine : 0.845154254728517 Raw lesk : 13 Lesk : 0.371428571428571 0.833333333333333 We find two phrasal matches of length 2 and 3, so those are scored (by raw lesk) as 2^2 + 3^2 = 13. That is then scaled by the product of the two string lengths to arrive at a normalized lesk score. By default WordNet Similarity uses raw lesk. Note that the raw score is simply the number of matching words (prime minister england winston churchill) without regard to their order, and that this value is the basis of all the other measures except for raw lesk and lesk. So, of the measures above, only lesk is really considering phrasal matches and treats them differently. This package provides both a command line program (text_similarity.pl) and Perl API calls (examples in the SYNOPSIS sections of the CPAN documentation). You can find more info and find download links at http://text-similarity.sourceforge.net I'm sure we'll continue to tinker with and extend Text Similarity, so please do let us know of any suggestions you have. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |