text-similarity-news Mailing List for Text::Similarity

Status: Beta

Brought to you by: sidz1979, tpederse

text-similarity-news — News about Text-Similarity

You can subscribe to this list here.

2008	_Jan	_Feb	_Mar	_Apr (4)	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov (1)	_Dec
2010	_Jan	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2013	_Jan (1)	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2015	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (1)	_Nov	_Dec

Flat | Threaded

[text-similarity-news] Text::Similarity version 0.11 released (bug fix release)

From: Ted P. <dul...@gm...> - 2015-10-08 00:54:29

We are pleased to announce the release of version 0.11 of
Text::Similarity.  This includes a few fixes and corrections supplied by
users (which we are always most grateful for!).

You can download the new version from CPAN or sourceforge via links found
at http://text-similarity.sourceforge.net. Below is the change log for this
release. Finally,  we are very open to other patches or ideas that users
have, so please feel free to let us know!

0.11
        Released October 6, 2015 (all changes by TDP)

        Contributed enhancement by Tani Hosokawa

        Not a bug, but an optimization. Original version
        does inefficient repeated linear search over text
        that can't possibly match. Instead, precaches
        locations of keywords. Comparing 100 semi-randomly
        generated fairly similar documents of about 500
        words each results in approx 90% speed increase,
        the efficiency increases as the documents get larger.
        https://rt.cpan.org/Public/Ticket/Attachment/999948/520850

        Make various documentation/typo fixes as suggested by
        Alex Becker. Found in CPAN bug list.

Enjoy,
Ted

[text-similarity-news] Text::Similarity v0.10 released

From: Ted P. <tpederse@d.umn.edu> - 2013-06-27 12:17:09

We are pleased to announce the release of version 0.10 of
Text-Similarity. This release only includes a single fix, and that is
a change to a test case that fails on Windows. Unless this sort of
thing really bothers you, you probably don't need to update. :)

You can find the most current version on CPAN or at sourceforge:
http://text-similarity.sourceforge.net

However, there is a more important announcement, and that is that as
of 0.10 Text-Similarity is again current in our sourceforge cvs
archive. There were some transitions happening at sourceforge when
0.09 came out, so we did not use cvs. But, we are back to using cvs
now, and that is always available for viewing or modifying if you are
interested. Note that the cvs module name is now TS. As of now the web
view hasn't been updated to include this new directory, but that
should occur in the next day or two. Additional instructions on using
cvs are available in sourceforge:

http://sourceforge.net/p/text-similarity/code/?source=navbar

Enjoy, and please let us know if any questions arise.
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-news] Text::Similarity version 0.09 released

From: Ted P. <tpederse@d.umn.edu> - 2013-01-22 21:01:05

Version 0.09 of Text::Similarity has been released on CPAN and
sourceforge. This release includes two user contributions (that are
very much appreciated). See details below, and feel free to download
from http://text-similarity.sourceforge.net

 0.09
        Released January 22, 2013

        *   This release includes changes contributed by Myroslava Dzikovska
            that provide the full set of similarity scores programmatically.
            She modified the interface so that the getSimilarity function
            returns a pair ($score, %allScores) where %allScores is a hash
            of all possible scores that it computes. She made it so that in
            scalar context it will only return $score, so it is fully
            backwards compatible with the older versions. She also changed
            the printing to STDERR, to make it easier to use the code in
            filter scripts that depend on STDIN/STDOUT.

        *   This release also inludes changes ontributed by Nathan Glen to
            allow test cases to pass on Windows. The single quote used
            previously caused arguments to the script not to be passed
            corrected, leading to test failures. The single quotes have been
            changed to double quotes.

Enjoy,
Ted

[text-similarity-news] Text-Similarity version 0.08 released

From: Ted P. <tpederse@d.umn.edu> - 2010-06-13 15:55:13

We are pleased to announce the release of version 0.08 of Text-Similarity.

This versions one important change - when you are using a stoplist,
you can now specify stop words using regular expressions.

In previous versions a stoplist can be specified as follows (in a
single file, one line per word)

a
of
in

This will cause a, of and in to be treated as stop words (and not use
them in computing similarity).

As of 0.08 you may continue to use the above format, or you can use
regular expressions...

For example...

/\b\w\b/
/\b\d+\b/

...would cause all single character words and numeric values to be removed...

You can get this new version via CPAN or sourceforge - find links to both at :

http://text-similarity.sourceforge.net

Enjoy,
Ted and Ying

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-news] Text-Similarity version 0.07 released

From: Ted P. <dul...@gm...> - 2008-11-16 00:04:23

We are pleased to announce the release of version 0.07 of
Text-Similarity. This release has a single fix to a test case that has
caused trouble for Windows installation, so you should only worry
about upgrading if you are using Windows, or if you are using a
version less than 0.06 (which had a number of significant changes).

You can find download links from CPAN and sourceforge at
http://text-similarity.sourceforge.net

Please let us know if you have any questions or concerns!

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-news] Text Similarity version 0.06 released

From: Ted P. <dul...@gm...> - 2008-04-06 14:50:09

We are pleased to announce the release of version 0.06 of Text-Similarity.
This is a module that WordNet-Similarity uses in the computation of
the lesk measure,  and one of the new features in this release is
providing a "lesk" score that does our calculation for "lesk overlap"
for any pair of files or strings you provide to it.

As you may recall  the lesk measure takes glosses and compares them for
overlaps (matches) and then scores them by taking the length of each phrasal
match, squaring it, and then summing those scores.

Consider the following example (line breaks introduced for clarity)
which measures the two given  strings for similarity:

 text_similarity.pl --type Text::Similarity::Overlaps --verbose
 --stoplist stoplist.txt --string
 'winston churchill was the prime minister of england'
 'prime minister of england winston churchill came for a visit that day'

 keys: 2
 -->'prime minister england' len(3) cnt(1)
 -->'winston churchill' len(2) cnt(1)
 wc 1: 5
 wc 2: 7
  Raw score: 5
  Precision: 0.714285714285714
  Recall   : 1
  F-measure: 0.833333333333333
  Dice     : 0.833333333333333
  E-measure: 0.166666666666667
  Cosine   : 0.845154254728517
  Raw lesk : 13
  Lesk     : 0.371428571428571
 0.833333333333333

We find two phrasal matches of length 2 and 3, so those are scored (by
raw lesk) as 2^2 + 3^2 = 13. That is  then scaled by the product of
the two string lengths to arrive at a  normalized lesk score. By
default WordNet
Similarity uses raw lesk. Note that the raw score is simply the number
of matching words (prime minister  england winston churchill) without regard to
their order, and that  this value is the basis of all the other measures
except for raw lesk and lesk. So, of the measures above, only lesk is
really considering phrasal matches and treats them differently.

This package provides both a command line program (text_similarity.pl)
and Perl API calls (examples in the  SYNOPSIS sections of the CPAN
documentation).

You can find more info and find download links at
http://text-similarity.sourceforge.net

I'm sure we'll continue to tinker with and extend Text Similarity, so
please do let us know of any suggestions you have.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-news] Text Similarity version 0.06 released

From: Ted P. <dul...@gm...> - 2008-04-06 14:40:35

We are pleased to announce the release of version 0.06 of
Text-Similarity. This is a module
that WordNet-Similarity uses in the computation of the lesk measure,
and one of the new
features in this release is providing a "lesk" score that does our
calculation for "lesk overlap"
for any pair of files or strings you provide to it. As you may recall
the lesk measure takes
glosses and compares them for overlaps (matches) and then scores them
by taking the length
of each phrasal match, squaring it, and then summing those scores.

Consider the following example (line breaks introduced for clarity)
which measures the two given
strings for similarity:

text_similarity.pl --type Text::Similarity::Overlaps --verbose
--stoplist stoplist.txt --string
'winston churchill was the prime minister of england'
'prime minister of england winston churchill came for a visit that day'

keys: 2
-->'prime minister england' len(3) cnt(1)
-->'winston churchill' len(2) cnt(1)
wc 1: 5
wc 2: 7
 Raw score: 5
 Precision: 0.714285714285714
 Recall   : 1
 F-measure: 0.833333333333333
 Dice     : 0.833333333333333
 E-measure: 0.166666666666667
 Cosine   : 0.845154254728517
 Raw lesk : 13
 Lesk     : 0.371428571428571
0.833333333333333

We see two phrasal matches of length 2 and 3, so those are scored (by
raw lesk) as 2^2 + 3^2 = 13. That is
then scaled by the product of the two string lengths to arrive at a
normalized lesk score. By default WordNet
Similarity uses raw lesk. Note that the raw score is simply the number
of matching words (prime minister
england winston churchill) without regard to their order, and that
this value is the basis of all the other measures
except for raw lesk and lesk. So, of the measures above, only lesk is
really considering phrasal matches
and treating them differently.

This package provides both a command line program (text_similarity.pl)
and Perl API calls (examples in the
SYNOPSIS sections of the CPAN documentation).

You can find more info and find download links at
http://text-similarity.sourceforge.net

I'm sure we'll continue to tinker with and extend Text Similarity, so
please do let us know of any suggestions
you have.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-news] Text-Similarity version 0.05 released

From: Ted P. <dul...@gm...> - 2008-04-04 19:29:39

We are pleased to announce the release of Text-Similarity version 0.05.
This version allows users to measure two strings for similarity, in addition
to being able to measure two files (which was the existing functionality).

http://text-similarity.sourceforge.net

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-news] Text-Similarity version 0.05 released

From: Ted P. <tpederse@d.umn.edu> - 2008-04-04 19:16:35

We are pleased to announce the release of Text-Similarity version 0.05.
This version allows users to measure two strings for similarity, in addition
to being able to measure two files (which was the existing functionality).

http://text-similarity.sourceforge.net

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Flat | Threaded

2008	Jan	Feb	Mar	Apr (4)	May	Jun	Jul	Aug	Sep	Oct	Nov (1)	Dec
2010	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec
2013	Jan (1)	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec