text-similarity-users Mailing List for Text::Similarity

Status: Beta

Brought to you by: sidz1979, tpederse

text-similarity-users — Text-Similarity user's list

You can subscribe to this list here.

2008	_Jan	_Feb	_Mar	_Apr (5)	_May	_Jun	_Jul	_Aug	_Sep	_Oct (1)	_Nov (2)	_Dec
2009	_Jan (5)	_Feb	_Mar (1)	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2010	_Jan	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2011	_Jan (4)	_Feb	_Mar	_Apr	_May	_Jun (2)	_Jul (1)	_Aug	_Sep	_Oct	_Nov	_Dec
2013	_Jan (1)	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep (2)	_Oct	_Nov	_Dec
2014	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep (1)	_Oct	_Nov	_Dec
2015	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (1)	_Nov	_Dec

Flat | Threaded

1 2 > >> (Page 1 of 2)

[text-similarity-users] Text::Similarity version 0.11 released (bug fix release)

From: Ted P. <dul...@gm...> - 2015-10-08 00:54:29

We are pleased to announce the release of version 0.11 of
Text::Similarity.  This includes a few fixes and corrections supplied by
users (which we are always most grateful for!).

You can download the new version from CPAN or sourceforge via links found
at http://text-similarity.sourceforge.net. Below is the change log for this
release. Finally,  we are very open to other patches or ideas that users
have, so please feel free to let us know!

0.11
        Released October 6, 2015 (all changes by TDP)

        Contributed enhancement by Tani Hosokawa

        Not a bug, but an optimization. Original version
        does inefficient repeated linear search over text
        that can't possibly match. Instead, precaches
        locations of keywords. Comparing 100 semi-randomly
        generated fairly similar documents of about 500
        words each results in approx 90% speed increase,
        the efficiency increases as the documents get larger.
        https://rt.cpan.org/Public/Ticket/Attachment/999948/520850

        Make various documentation/typo fixes as suggested by
        Alex Becker. Found in CPAN bug list.

Enjoy,
Ted

[text-similarity-users] Fwd: Ngrams and Text Similarity deployed as SOAP web services

From: Ted P. <dul...@gm...> - 2014-09-22 11:49:28

Very nice news for users of NSP and Text::Similarity! Please support these
resources by giving them a try and letting others know about them too.

Cordially,
Ted

---------- Forwarded message ----------
From: Marta Villegas <mar...@up...>
Date: Mon, Sep 22, 2014 at 4:13 AM
Subject: Ngrams and Text Similarity deployed as SOAP web services
To: tpe...@um...

Dear Ted,

Because of our participation in CLARIN  <http://clarin.eu/>and PANACEA
<http://www.panacea-lr.eu/>EU projects, in the last few years we deployed
some NLP tools as web services. Among these you will find yours Ngrams and
Text Similarity services. They are deployed as SOAP web services and they
are open and accessible .

You can find a description in our LOD-browser catalogue
<http://lod.iula.upf.edu/index-en.html> (please let us know if you want us
to change something)

1) TedPedersen's Ngrams Counter Web Service
<http://lod.iula.upf.edu/resources/184>

2) TedPedersen's Ngram Statistics Package
<http://lod.iula.upf.edu/resources/108>

3) TedPedersen's Text Similarity Web Service
<http://lod.iula.upf.edu/resources/429>

The corresponding demo invocations are also available here:

http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.countngrams_row

http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.ngrams_row

http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.text_similarity_row

Best regards

-- 
Marta Villegas
mar...@gm...

Re: [text-similarity-users] text::similarity-publication to cite?

From: Ted P. <tpederse@d.umn.edu> - 2013-09-19 13:51:14

Hi Thomas,

Thanks for asking, we certainly appreciate the desire to cite Text::Similarity!

Unfortunately there is really no publication that could be cited. I
think the best course of action might be to mention both the name and
version you used (Text::Similarity version 0.10) and provide the URL
of the software's home page (http://text-similarity.sourceforge.net)
in a reference or footnote. We are of course always interested to know
how people have used this software, so if you do get a paper written
up please do consider passing that along to us.

Cordially,
Ted

On Wed, Sep 18, 2013 at 9:58 AM, Thomas Meyer <Tho...@id...> wrote:
> Hi,
>
> First of all, thanks for the great tool.
>
> Is there a publication (conference paper etc.), that I can cite when
> using the text::similarity module in my own research?
>
> Thanks,
> Thomas
>
> ------------------------------------------------------------------------------
> LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99!
> 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint
> 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes
> Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13.
> http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] text::similarity-publication to cite?

From: Thomas M. <Tho...@id...> - 2013-09-18 15:25:11

Hi,

First of all, thanks for the great tool.

Is there a publication (conference paper etc.), that I can cite when
using the text::similarity module in my own research?

Thanks,
Thomas

[text-similarity-users] Text::Similarity v0.10 released

From: Ted P. <tpederse@d.umn.edu> - 2013-06-27 12:17:08

We are pleased to announce the release of version 0.10 of
Text-Similarity. This release only includes a single fix, and that is
a change to a test case that fails on Windows. Unless this sort of
thing really bothers you, you probably don't need to update. :)

You can find the most current version on CPAN or at sourceforge:
http://text-similarity.sourceforge.net

However, there is a more important announcement, and that is that as
of 0.10 Text-Similarity is again current in our sourceforge cvs
archive. There were some transitions happening at sourceforge when
0.09 came out, so we did not use cvs. But, we are back to using cvs
now, and that is always available for viewing or modifying if you are
interested. Note that the cvs module name is now TS. As of now the web
view hasn't been updated to include this new directory, but that
should occur in the next day or two. Additional instructions on using
cvs are available in sourceforge:

http://sourceforge.net/p/text-similarity/code/?source=navbar

Enjoy, and please let us know if any questions arise.
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] Text::Similarity version 0.09 released

From: Ted P. <tpederse@d.umn.edu> - 2013-01-22 21:01:04

Version 0.09 of Text::Similarity has been released on CPAN and
sourceforge. This release includes two user contributions (that are
very much appreciated). See details below, and feel free to download
from http://text-similarity.sourceforge.net

 0.09
        Released January 22, 2013

        *   This release includes changes contributed by Myroslava Dzikovska
            that provide the full set of similarity scores programmatically.
            She modified the interface so that the getSimilarity function
            returns a pair ($score, %allScores) where %allScores is a hash
            of all possible scores that it computes. She made it so that in
            scalar context it will only return $score, so it is fully
            backwards compatible with the older versions. She also changed
            the printing to STDERR, to make it easier to use the code in
            filter scripts that depend on STDIN/STDOUT.

        *   This release also inludes changes ontributed by Nathan Glen to
            allow test cases to pass on Windows. The single quote used
            previously caused arguments to the script not to be passed
            corrected, leading to test failures. The single quotes have been
            changed to double quotes.

Enjoy,
Ted

Re: [text-similarity-users] Taking many similarity measurements between two corpora

From: Throop, D. R. (JSC-ER)[J. Technology] <dav...@na...> - 2011-07-12 17:02:20

Thanks!  I would have responded faster, but your message wandered into my spam filter, where I just noticed it.

I tore your code apart and was able to cache the computation that only affected one of the strings (the tokenization etc) so that the only computation inside the 600 x 300 loop was the direct comparison.  Eventually I'll ask NASA for the forms to allow me to post it on CPAN.

David Throop


-----Original Message-----
From: Ted Pedersen [mailto:tpederse@d.umn.edu]
Sent: Wednesday, June 15, 2011 4:38 PM
To: tex...@li...
Subject: Re: [text-similarity-users] Taking many similarity measurements between two corpora

Hi David,

Nice question, and unfortunately I don't think there is a particularly
better way to do what you propose, other than a long series of
pairwise comparisons.

That said, I ran something of the same dimensionality that you want to
do (600 x 300) and the following script took 2.5 hours on a 5 year old
desktop...so, if this isn't something you need to do on a regular
basis, maybe it works out ok....

Below is my timing output...

ted@linux-zxku:~> time bash runit.sh


real    156m55.322s
user    124m11.270s
sys     24m30.416s
ted@linux-zxku:~>

And then there is the script I ran - I just took a file and made 600
individual 1 line files, and then did a bunch of pairwise similarities
with our command line tool. Using the API would in effect result in
the same thing...

ted@linux-zxku:~> more runit.sh

-----------------------

for line in {1..600..1}
do
        head -$line text | tail -1 > text.$line
done

for linea in {1..600..1}
do
        for lineb in {1..300..1}
        do
                text_similarity.pl --type Text::Similarity::Overlaps
text.$linea text.$lineb >> text.output
        done
done

for line in {1..600..1}
do
        rm text.$line
done

-----------

I hope this helps...please feel free to let us know of any additional
questions that might arise.

Cordially,
Ted

On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs
Technology] <dav...@na...> wrote:
> I need the pairwise similarity measurements between two corpora, MxN.   I
> wondered about the efficiency of this in Text::Similarity.
>
> Here's the task.  We're transitioning a piece of hardware from one program
> to another.  The hardware was built to the old program's requirements,
> (roughly 300 old requirements.)  The new program has its own requirements
> (roughly 600 requirements.)  Each requirement is ~ 100 words.
>
> I'm supporting a gap analysis.  One task in the gap analysis can be stated
> as
>
> For each old requirement, find up to 3 new requirements which are
> most-similar to the old requirement.
>
>
> Example: Suppose I have an old requirement that reads "The Delivery-unit
> shall fold to a stowage volume that will fit within the Transport Bag
> dimensions of 48 by 20 by 14 inches and allow space for foam cushioning
> material."  Then I want to find any new requirements that are talking about
> delivery-units, stowage volume, dimensions, transport bags or foam
> cushioning.
>
> To do this, I want the pairwise similarity scores between all the old and
> new requirements, roughly 300x600 = 180,000 comparisons.  I suspect that
> invoking
> $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
> isn't the best way to do this.   E.g, it would call sanitizeString on each
> old requirement 600 times.
>
> Am I missing something?  Is there already to iterate efficiently over a such
> a pair of corpora?
>
> David Throop
>
>
> ------------------------------------------------------------------------------
> EditLive Enterprise is the world's most technically advanced content
> authoring tool. Experience the power of Track Changes, Inline Image
> Editing and ensure content is compliant with Accessibility Checking.
> http://p.sf.net/sfu/ephox-dev2dev
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>



--
Ted Pedersen
http://www.d.umn.edu/~tpederse

------------------------------------------------------------------------------
EditLive Enterprise is the world's most technically advanced content
authoring tool. Experience the power of Track Changes, Inline Image
Editing and ensure content is compliant with Accessibility Checking.
http://p.sf.net/sfu/ephox-dev2dev
_______________________________________________
text-similarity-users mailing list
tex...@li...
https://lists.sourceforge.net/lists/listinfo/text-similarity-users

Re: [text-similarity-users] Taking many similarity measurements between two corpora

From: Ted P. <tpederse@d.umn.edu> - 2011-06-15 21:38:32

Hi David,

Nice question, and unfortunately I don't think there is a particularly
better way to do what you propose, other than a long series of
pairwise comparisons.

That said, I ran something of the same dimensionality that you want to
do (600 x 300) and the following script took 2.5 hours on a 5 year old
desktop...so, if this isn't something you need to do on a regular
basis, maybe it works out ok....

Below is my timing output...

ted@linux-zxku:~> time bash runit.sh


real    156m55.322s
user    124m11.270s
sys     24m30.416s
ted@linux-zxku:~>

And then there is the script I ran - I just took a file and made 600
individual 1 line files, and then did a bunch of pairwise similarities
with our command line tool. Using the API would in effect result in
the same thing...

ted@linux-zxku:~> more runit.sh

-----------------------

for line in {1..600..1}
do
        head -$line text | tail -1 > text.$line
done

for linea in {1..600..1}
do
        for lineb in {1..300..1}
        do
                text_similarity.pl --type Text::Similarity::Overlaps
text.$linea text.$lineb >> text.output
        done
done

for line in {1..600..1}
do
        rm text.$line
done

-----------

I hope this helps...please feel free to let us know of any additional
questions that might arise.

Cordially,
Ted

On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs
Technology] <dav...@na...> wrote:
> I need the pairwise similarity measurements between two corpora, MxN.   I
> wondered about the efficiency of this in Text::Similarity.
>
> Here’s the task.  We’re transitioning a piece of hardware from one program
> to another.  The hardware was built to the old program’s requirements,
> (roughly 300 old requirements.)  The new program has its own requirements
> (roughly 600 requirements.)  Each requirement is ~ 100 words.
>
> I’m supporting a gap analysis.  One task in the gap analysis can be stated
> as
>
> For each old requirement, find up to 3 new requirements which are
> most-similar to the old requirement.
>
>
> Example: Suppose I have an old requirement that reads “The Delivery-unit
> shall fold to a stowage volume that will fit within the Transport Bag
> dimensions of 48 by 20 by 14 inches and allow space for foam cushioning
> material.”  Then I want to find any new requirements that are talking about
> delivery-units, stowage volume, dimensions, transport bags or foam
> cushioning.
>
> To do this, I want the pairwise similarity scores between all the old and
> new requirements, roughly 300x600 = 180,000 comparisons.  I suspect that
> invoking
> $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
> isn’t the best way to do this.   E.g, it would call sanitizeString on each
> old requirement 600 times.
>
> Am I missing something?  Is there already to iterate efficiently over a such
> a pair of corpora?
>
> David Throop
>
>
> ------------------------------------------------------------------------------
> EditLive Enterprise is the world's most technically advanced content
> authoring tool. Experience the power of Track Changes, Inline Image
> Editing and ensure content is compliant with Accessibility Checking.
> http://p.sf.net/sfu/ephox-dev2dev
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] Taking many similarity measurements between two corpora

From: Throop, D. R. (JSC-ER)[J. Technology] <dav...@na...> - 2011-06-14 20:52:44

I need the pairwise similarity measurements between two corpora, MxN.   I wondered about the efficiency of this in Text::Similarity.

Here's the task.  We're transitioning a piece of hardware from one program to another.  The hardware was built to the old program's requirements, (roughly 300 old requirements.)  The new program has its own requirements (roughly 600 requirements.)  Each requirement is ~ 100 words.

I'm supporting a gap analysis.  One task in the gap analysis can be stated as
*       For each old requirement, find up to 3 new requirements which are most-similar to the old requirement.

Example: Suppose I have an old requirement that reads "The Delivery-unit shall fold to a stowage volume that will fit within the Transport Bag dimensions of 48 by 20 by 14 inches and allow space for foam cushioning material."  Then I want to find any new requirements that are talking about delivery-units, stowage volume, dimensions, transport bags or foam cushioning.

To do this, I want the pairwise similarity scores between all the old and new requirements, roughly 300x600 = 180,000 comparisons.  I suspect that invoking
$score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459);
 isn't the best way to do this.   E.g, it would call sanitizeString on each old requirement 600 times.

Am I missing something?  Is there already to iterate efficiently over a such a pair of corpora?

David Throop

Re: [text-similarity-users] v-0.08 problems

From: Ted P. <tpederse@d.umn.edu> - 2011-01-02 00:11:36

Hi Sean,

I'm happy to report I think I figured this out. I use American
spellings. :) The option is actually "normalize", whereas you were
using "normalise" which I guess was just getting ignored (and we
aren't apparently taking action when an invalid option is specified,
which is a concern.) I think when you make this change things will
work out more as you expect.

ted@linux-qdw9:~> cat ts3.pl
my $str1 = "the dog bit Jim";
my $str2 = "jim bit the dog ";
my $laptool = "Text::Similarity::Overlaps";
eval "require $laptool";
if ($@) {die "\nWARNING !  $tool not loaded ..\n\n";}
my %lapopts = ('normalize' => 1, 'verbose' => 1);        # 'verbose'
=++lesk-score
my $mod = $laptool->new(\%lapopts);
unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;}
$score = $mod->getSimilarityStrings ($str1, $str2);
print "score= $score\n\n";


ted@linux-qdw9:~> perl ts3.pl
keys: 3
-->'bit' len(1) cnt(1)
-->'jim' len(1) cnt(1)
-->'the dog' len(2) cnt(1)
wc 1: 4
wc 2: 4
 Raw score: 4
 Precision: 1
 Recall   : 1
 F-measure: 1
 Dice     : 1
 E-measure: 0
 Cosine   : 1
 Raw lesk : 6
 Lesk     : 0.375
score= 1

ted@linux-qdw9:~> cat ts4.pl
my $str1 = "the dog bit Jim";
my $str2 = "jim bit the dog ";
my $laptool = "Text::Similarity::Overlaps";
eval "require $laptool";
if ($@) {die "\nWARNING !  $tool not loaded ..\n\n";}
my %lapopts = ('normalize' => 0, 'verbose' => 1);      # 'verbose' =
++lesk-score
my $mod = $laptool->new(\%lapopts);
unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;}
$score = $mod->getSimilarityStrings ($str1, $str2);
print "score= $score\n\n";


ted@linux-qdw9:~> perl ts4.pl
keys: 3
-->'bit' len(1) cnt(1)
-->'jim' len(1) cnt(1)
-->'the dog' len(2) cnt(1)
score= 4

BTW, I very much agree with your suggestions for some methods to
return particular values of scores. I'll see if we can't do something
about that in the next few months, as others have made a similar point
(as you point out).

Cordially,
Ted

On Fri, Dec 31, 2010 at 11:41 PM, Sean <so...@or...> wrote:
> Sure, Ted, here it is:
>
> #-------------------------------------------- CODE
> ----------------------------------------------
> my $str1 = "the dog bit Jim";
> my $str2 = "jim bit the dog ";
> my $laptool = "Text::Similarity::Overlaps";
> eval "require $laptool";
> if ($@) {die "\nWARNING !  $tool not loaded ..\n\n";}
> #my %lapopts = ('normalise' => 0, 'verbose' => 1);      # 'verbose' =
> ++lesk-score
> my %lapopts = ('normalise' => 1, 'verbose' => 1);        # 'verbose' =
> ++lesk-score
> my $mod = $laptool->new(\%lapopts);
> unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;}
> $score = $mod->getSimilarityStrings ($str1, $str2);
> print "score= $score\n\n";
> #---------------------------------------END CODE
> ----------------------------------------------
>
> My guess is that self->verbose is not actually getting properly set via the
> options?
>
> regards
> Sean
>
>
> Ted Pedersen wrote:
>
> Hi Sean,
>
> Thanks for your suggestions, let me take a look at those and see what
> we might be able to do.
>
> And I'm sorry you are having some troubles. Can you go ahead and post
> whatever code you are running to get these results? That will make it
> easier to recreate the output.
>
> Cordially,
> Ted
>
> On Fri, Dec 31, 2010 at 7:33 PM, Sean <so...@or...> wrote:
>
>
> Hello Ted
>
> I have installed v-0.08 and do not seem to get the results as documented.
>
> In order to get the Lesk-score I have tried setting the options 2
> different ways, both without luck.
> 1. ('normalise' => 0, 'verbose' => 1);  # I expected this to work, not
> wanting the Lesk normalised ...
> 2. ('normalise' => 1, 'verbose' => 1);  # also tried this just in case ...
>
> The COMPLETE screen-printed output from BOTH (using your doc example) is:
> "keys: 3      -->'bit' len(1) cnt(1)      -->'jim' len(1) cnt(1)
> -->'the dog' len(2) cnt(1)"
>
> This is using the getSimilarityStrings ($str1, $str2) function directly
> from another script ..(getting a score of 4 returned there as expected).
>
> While at it I may as well mention what tops my wish-list for v-0.09. I
> would like to see additional simple wrapper functions like getLesk(),
> getCosine() which would return just the single measure specified, and
> getAll() which would return a hashref of 'named'-parameters to include
> all provided measures, and which the other functions would be simple
> wrappers around to pull out one or other from that comprehensive
> getAll()-hashref?
>
> This would avoid having to capture & parse output from stdout/stderr or
> some other arbitrary output channel, although it would probably do no
> harm to also "print" those measures. Since adding string (rather than
> file) acceptance obviously came as an afterthought itself, this might be
> the next logical extension to functionality. Looking at previous mailers
> I thought I detected similar requests, though expressed somewhat
> differently.
>
> Keep up the good work in 2011.
>
> Sean
>
>
>
> ------------------------------------------------------------------------------
> Learn how Oracle Real Application Clusters (RAC) One Node allows customers
> to consolidate database storage, standardize their database environment,
> and,
> should the need arise, upgrade to a full multi-node Oracle RAC database
> without downtime or disruption
> http://p.sf.net/sfu/oracle-sfdevnl
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [text-similarity-users] v-0.08 problems

From: Sean <so...@or...> - 2011-01-01 05:41:26

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Sure, Ted, here it is:<br>
<br>
#-------------------------------------------- CODE
----------------------------------------------<br>
my $str1 = "the dog bit Jim";<br>
my $str2 = "jim bit the dog ";<br>
my $laptool = "Text::Similarity::Overlaps";<br>
eval "require $laptool";&nbsp; &nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; <br>
if ($@) {die "\nWARNING !&nbsp; $tool not loaded ..\n\n";}<br>
#my %lapopts = ('normalise' =&gt; 0, 'verbose' =&gt; 1);&nbsp;&nbsp;&nbsp; &nbsp; #
'verbose' = ++lesk-score<br>
my %lapopts = ('normalise' =&gt; 1, 'verbose' =&gt; 1);&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; #
'verbose' = ++lesk-score&nbsp;&nbsp;&nbsp; &nbsp;&nbsp;&nbsp; <br>
my $mod = $laptool-&gt;new(\%lapopts);<br>
unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;}<br>
$score = $mod-&gt;getSimilarityStrings ($str1, $str2);<br>
print "score= $score\n\n";<br>
#---------------------------------------END CODE
----------------------------------------------<br>
<br>
My guess is that self-&gt;verbose is not actually getting properly set
via the options?<br>
<br>
regards<br>
Sean<br>
<br>
<br>
Ted Pedersen wrote:
<blockquote
 cite="mid:AAN...@ma..."
 type="cite">
  <pre wrap="">Hi Sean,

Thanks for your suggestions, let me take a look at those and see what
we might be able to do.

And I'm sorry you are having some troubles. Can you go ahead and post
whatever code you are running to get these results? That will make it
easier to recreate the output.

Cordially,
Ted

On Fri, Dec 31, 2010 at 7:33 PM, Sean <a class="moz-txt-link-rfc2396E" href="mailto:so...@or...">&lt;so...@or...&gt;</a> wrote:
  </pre>
  <blockquote type="cite">
    <pre wrap="">Hello Ted

I have installed v-0.08 and do not seem to get the results as documented.

In order to get the Lesk-score I have tried setting the options 2
different ways, both without luck.
1. ('normalise' =&gt; 0, 'verbose' =&gt; 1); &nbsp;# I expected this to work, not
wanting the Lesk normalised ...
2. ('normalise' =&gt; 1, 'verbose' =&gt; 1); &nbsp;# also tried this just in case ...

The COMPLETE screen-printed output from BOTH (using your doc example) is:
"keys: 3 &nbsp; &nbsp; &nbsp;--&gt;'bit' len(1) cnt(1) &nbsp; &nbsp; &nbsp;--&gt;'jim' len(1) cnt(1)
--&gt;'the dog' len(2) cnt(1)"

This is using the getSimilarityStrings ($str1, $str2) function directly
from another script ..(getting a score of 4 returned there as expected).

While at it I may as well mention what tops my wish-list for v-0.09. I
would like to see additional simple wrapper functions like getLesk(),
getCosine() which would return just the single measure specified, and
getAll() which would return a hashref of 'named'-parameters to include
all provided measures, and which the other functions would be simple
wrappers around to pull out one or other from that comprehensive
getAll()-hashref?

This would avoid having to capture &amp; parse output from stdout/stderr or
some other arbitrary output channel, although it would probably do no
harm to also "print" those measures. Since adding string (rather than
file) acceptance obviously came as an afterthought itself, this might be
the next logical extension to functionality. Looking at previous mailers
I thought I detected similar requests, though expressed somewhat
differently.

Keep up the good work in 2011.

Sean



------------------------------------------------------------------------------
Learn how Oracle Real Application Clusters (RAC) One Node allows customers
to consolidate database storage, standardize their database environment, and,
should the need arise, upgrade to a full multi-node Oracle RAC database
without downtime or disruption
<a class="moz-txt-link-freetext" href="http://p.sf.net/sfu/oracle-sfdevnl">http://p.sf.net/sfu/oracle-sfdevnl</a>
_______________________________________________
text-similarity-users mailing list
<a class="moz-txt-link-abbreviated" href="mailto:tex...@li...">tex...@li...</a>
<a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/text-similarity-users">https://lists.sourceforge.net/lists/listinfo/text-similarity-users</a>

    </pre>
  </blockquote>
  <pre wrap=""><!---->


  </pre>
</blockquote>
</body>
</html>

Re: [text-similarity-users] v-0.08 problems

From: Ted P. <tpederse@d.umn.edu> - 2011-01-01 05:13:00

Hi Sean,

Thanks for your suggestions, let me take a look at those and see what
we might be able to do.

And I'm sorry you are having some troubles. Can you go ahead and post
whatever code you are running to get these results? That will make it
easier to recreate the output.

Cordially,
Ted

On Fri, Dec 31, 2010 at 7:33 PM, Sean <so...@or...> wrote:
> Hello Ted
>
> I have installed v-0.08 and do not seem to get the results as documented.
>
> In order to get the Lesk-score I have tried setting the options 2
> different ways, both without luck.
> 1. ('normalise' => 0, 'verbose' => 1);  # I expected this to work, not
> wanting the Lesk normalised ...
> 2. ('normalise' => 1, 'verbose' => 1);  # also tried this just in case ...
>
> The COMPLETE screen-printed output from BOTH (using your doc example) is:
> "keys: 3      -->'bit' len(1) cnt(1)      -->'jim' len(1) cnt(1)
> -->'the dog' len(2) cnt(1)"
>
> This is using the getSimilarityStrings ($str1, $str2) function directly
> from another script ..(getting a score of 4 returned there as expected).
>
> While at it I may as well mention what tops my wish-list for v-0.09. I
> would like to see additional simple wrapper functions like getLesk(),
> getCosine() which would return just the single measure specified, and
> getAll() which would return a hashref of 'named'-parameters to include
> all provided measures, and which the other functions would be simple
> wrappers around to pull out one or other from that comprehensive
> getAll()-hashref?
>
> This would avoid having to capture & parse output from stdout/stderr or
> some other arbitrary output channel, although it would probably do no
> harm to also "print" those measures. Since adding string (rather than
> file) acceptance obviously came as an afterthought itself, this might be
> the next logical extension to functionality. Looking at previous mailers
> I thought I detected similar requests, though expressed somewhat
> differently.
>
> Keep up the good work in 2011.
>
> Sean
>
>
>
> ------------------------------------------------------------------------------
> Learn how Oracle Real Application Clusters (RAC) One Node allows customers
> to consolidate database storage, standardize their database environment, and,
> should the need arise, upgrade to a full multi-node Oracle RAC database
> without downtime or disruption
> http://p.sf.net/sfu/oracle-sfdevnl
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] v-0.08 problems

From: Sean <so...@or...> - 2011-01-01 02:07:19

Hello Ted

I have installed v-0.08 and do not seem to get the results as documented.

In order to get the Lesk-score I have tried setting the options 2 
different ways, both without luck.
1. ('normalise' => 0, 'verbose' => 1);  # I expected this to work, not 
wanting the Lesk normalised ...
2. ('normalise' => 1, 'verbose' => 1);  # also tried this just in case ... 

The COMPLETE screen-printed output from BOTH (using your doc example) is:
"keys: 3      -->'bit' len(1) cnt(1)      -->'jim' len(1) cnt(1)      
-->'the dog' len(2) cnt(1)"

This is using the getSimilarityStrings ($str1, $str2) function directly 
from another script ..(getting a score of 4 returned there as expected).

While at it I may as well mention what tops my wish-list for v-0.09. I 
would like to see additional simple wrapper functions like getLesk(), 
getCosine() which would return just the single measure specified, and 
getAll() which would return a hashref of 'named'-parameters to include 
all provided measures, and which the other functions would be simple 
wrappers around to pull out one or other from that comprehensive 
getAll()-hashref?

This would avoid having to capture & parse output from stdout/stderr or 
some other arbitrary output channel, although it would probably do no 
harm to also "print" those measures. Since adding string (rather than 
file) acceptance obviously came as an afterthought itself, this might be 
the next logical extension to functionality. Looking at previous mailers 
I thought I detected similar requests, though expressed somewhat 
differently.

Keep up the good work in 2011.

Sean

[text-similarity-users] Text-Similarity version 0.08 released

From: Ted P. <tpederse@d.umn.edu> - 2010-06-13 15:55:13

We are pleased to announce the release of version 0.08 of Text-Similarity.

This versions one important change - when you are using a stoplist,
you can now specify stop words using regular expressions.

In previous versions a stoplist can be specified as follows (in a
single file, one line per word)

a
of
in

This will cause a, of and in to be treated as stop words (and not use
them in computing similarity).

As of 0.08 you may continue to use the above format, or you can use
regular expressions...

For example...

/\b\w\b/
/\b\d+\b/

...would cause all single character words and numeric values to be removed...

You can get this new version via CPAN or sourceforge - find links to both at :

http://text-similarity.sourceforge.net

Enjoy,
Ted and Ying

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [text-similarity-users] [rt.cpan.org #43758] installation error

From: Ted P. <dul...@gm...> - 2009-03-09 04:07:29

Thanks for reporting this James...this error seems to only occur with
this particular version of Windows...

http://www.cpantesters.org/show/Text-Similarity.html#0.07

As a result, I'm inclined to say this looks more like a problem with
that version of Windows than it does Text-Similarity. The specific
error that you are seeing is for one test case that checks to make
sure that the order of the words in a string doesn't affect the
score...

the big cat
the cat big

Should get the same similarity score...

Here is the specific test that is failing...

## this test case was causing trouble for Windows - changed in 0.07

##$output = `$^X $inc $text_similarity_pl --type
Text::Similarity::Overlaps --string 'sir winston churchill' 'winston
churchill SIR!!!' `;
$output = `$^X $inc $text_similarity_pl --type
Text::Similarity::Overlaps --string 'sir winston churchill' 'winston
churchill sir' `;
chomp $output;

is ($output, 1, "order doesn't affect score");

In fact the punctuation in the commented version seemed to cause
problems earlier with Windows, not sure what's happening here....

While this might be somewhat risky, you may simply want to force the
install and/or comment out this test case to get things installed
ok...

I wish I could shed more light on this - I normally don't use Windows
so I'm not in a really good position to test things out, etc. so any
observations you might have would be greatly appreciated.

Thanks!
Ted


On Sun, Mar 1, 2009 at 8:06 PM, James F. Mahon III via RT
<bug...@rt...> wrote:
> Sun Mar 01 20:06:48 2009: Request 43758 was acted upon.
> Transaction: Ticket created by jam...@gm...
>       Queue: Text-Similarity
>     Subject: installation error
>   Broken in: (no value)
>    Severity: (no value)
>       Owner: Nobody
>  Requestors: jam...@gm...
>      Status: new
>  Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=43758 >
>
>
> Hello,
>
> I attempted to install Text::Similarity-0.07, but "nmake install"
> returned an error that I don't know how to address. I've pasted the
> error below. I'm attempting this installation with Perl v5.8.8 built for
> MSWin32-x86-multi-thread. Here are my locally applied patched attained
> by running perl -V.
>
>  Locally applied patches:
>        ActivePerl Build 822 [280952]
>        Iin_load_module moved for compatibility with build 806
>        PerlEx support in CGI::Carp
>        Less verbose ExtUtils::Install and Pod::Find
>        Patch for CAN-2005-0448 from Debian with modifications
>        Rearrange @INC so that 'site' is searched before 'perl'
>        Partly reverted 24733 to preserve binary compatibility
>        MAINT31223 plus additional changes
>        31490 Problem bootstraping Win32CORE
>        31324 Fix DynaLoader::dl_findfile() to locate .so files again
>        31214 Win32::GetLastError fails when first called
>        31211 Restore Windows NT support
>        31188 Problem killing a pseudo-forked child on Win32
>        29732 ANSIfy the PATH environment variable on Windows
>        27527,29868 win32_async_check() can loop indefinitely
>        26970 Make Passive mode the default for Net::FTP
>        26379 Fix alarm() for Windows 2003
>        24699 ICMP_UNREACHABLE handling in Net::Ping
>
> Can you offer any advice?
>
> Best,
>
> James
>
> Microsoft Windows XP [Version 5.1.2600]
> (C) Copyright 1985-2001 Microsoft Corp.
>
> P:\LWP\Text-Similarity-0.07>perl makefile.pl
> Checking if your kit is complete...
> Looks good
> Writing Makefile for Text::Similarity
>
> P:\LWP\Text-Similarity-0.07>nmake
>
> Microsoft (R) Program Maintenance Utility   Version 1.50
> Copyright (c) Microsoft Corp 1988-94. All rights reserved.
>
> cp lib/Text/Similarity.pm blib\lib\Text\Similarity.pm
> cp lib/Text/OverlapFinder.pm blib\lib\Text\OverlapFinder.pm
> cp lib/Text/Similarity/Overlaps.pm blib\lib\Text\Similarity\Overlaps.pm
>        C:\Perl\bin\perl.exe -MExtUtils::Command -e cp
> bin/text_similarity.pl bl
> ib\script\text_similarity.pl
>        pl2bat.bat blib\script\text_similarity.pl
>
> P:\LWP\Text-Similarity-0.07>nmake test
>
> Microsoft (R) Program Maintenance Utility   Version 1.50
> Copyright (c) Microsoft Corp 1988-94. All rights reserved.
>
>        C:\Perl\bin\perl.exe -MExtUtils::Command -e cp
> bin/text_similarity.pl bl
> ib\script\text_similarity.pl
>        pl2bat.bat blib\script\text_similarity.pl
>        C:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e"
> "test_harness(0, 'bl
> ib\lib', 'blib\arch')" t/*.t
> t/getsimilaritystrings......ok
> t/no-normalize..............ok
> t/normalize.................ok
> t/overlaps..................ok
> t/text_similarity...........ok
> t/text_similarity_string....ok 5/8
> #   Failed test 'order doesn't affect score'
> #   at t/text_similarity_string.t line 69.
> t/text_similarity_string....NOK 8/8#          got: '0'
> #     expected: '1'
> # Looks like you failed 1 test of 8.
> t/text_similarity_string....dubious
>        Test returned status 1 (wstat 256, 0x100)
> DIED. FAILED test 8
>        Failed 1/8 tests, 87.50% okay
> Failed Test                Stat Wstat Total Fail  List of Failed
> -------------------------------------------------------------------------------
> t/text_similarity_string.t    1   256     8    1  8
> Failed 1/6 test scripts. 1/130 subtests failed.
> Files=6, Tests=130,  6 wallclock secs ( 0.00 cusr +  0.00 csys =  0.00 CPU)
> Failed 1/6 test programs. 1/130 subtests failed.
> NMAKE : fatal error U1077: 'C:\WINDOWS\system32\cmd.exe' : return code '0x1'
> Stop.
>
> --
> James Mahon
> Research Professional
> Becker Center, Chicago Booth, University of Chicago
> Tel: 773.834.7369
> Fax: 773.834.3040
> Email: jm...@ch...
>
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [text-similarity-users] reading text-similarity output from java

From: Antonio T. <Ant...@il...> - 2009-01-25 19:01:20

Hi Ted,

finally I figured out how to do it. I've changed this part of the code:

String command = "text_similarity.pl --type=Text::Similarity::Overlaps --
string \"" + string1 + "\" \"" + string2 + "\"";
Process p = Runtime.getRuntime().exec(command);

with:

Vector<String> command = new Vector<String>();
		command.add("text_similarity.pl");
		command.add("--type=Text::Similarity::Overlaps");
		command.add("--string");
		command.add(string1);
		command.add(string2);
ProcessBuilder pb = new ProcessBuilder(command);
Process p = pb.start();


and now it works smoothly. Hope it can be useful for someone.

Regards,
Antonio

> Hi Antonio,
>
> I'm afraid I have very little experience with Java, so I don't really
> know how to include Perl in a Java program. I do know that there are
> Perl modules that let you do the opposite, that is include Java in
> Perl programs .... this is most commonly done with Inline::Java, which
> can be found here :
>
> http://search.cpan.org/~patl/Inline-Java/
>
> I don't know if that would give any ideas about how to include Perl in
> Java, but it's about the only thing I could think of to mention.
>
> Please do let us know if you figure this out, seems like potentially a
> very useful technique.
>
> Cordially,
> Ted

Re: [text-similarity-users] some clarifications about Text::similarity

From: Ted P. <dul...@gm...> - 2009-01-20 23:49:14

Hi Yashar,

Thanks for your questions - see my responses inline...

On Tue, Jan 20, 2009 at 9:31 AM, Yashar Mehdad <yas...@ya...> wrote:
> Dear Ted
>
> I'm using your package in some of my experiments and in order to cite it I
> need few clarifications :
>
> while "text_similarity.pl --type=Text::Similarity::Overlaps file1 file2" is
> executed a normalized measure is obtained. what I understood from your
> documentation this measure is raw normalized (F-measure = 2 * precision *
> recall / (precision + recall)), is that right?

Correct. Consider the following example...

ted@ted-desktop:~$ text_similarity.pl test1 test2
--type=Text::Similarity::Overlaps --no-normalize
5

ted@ted-desktop:~$ text_similarity.pl test1 test2
--type=Text::Similarity::Overlaps
0.555555555555556

ted@ted-desktop:~$ more test1
this is test1 i am happy he is not

ted@ted-desktop:~$ more test2
this is test2 i am hungry she is sad

no-normalize for the second result shows that 5 words have matched
(without regard to order
or length of phrase)

>
> while "text_similarity.pl --type=Text::Similarity::Overlaps --no-normalize
> file1 file2" is executed, the output would be a simple raw score of overlap
> not lesk raw score? is it right?

Correct! There is no "bonus" for phrasal matching in the overlap scoring.

>
> is there any way in which by using text_similarity.pl an reach the lesk
> measure through defining any option? (im aware that by defining verbose
> option we could get all measures but is there any way that directly lead us
> to lesk measure).

Not from the command line, however, you could edit Overlaps.pm to just
output lesk....

Here's the relevant snippet, where I've added comments...

if ($self->verbose) {
#	    print " Raw score: $score\n";
#	    print " Precision: $prec\n";
#	    print " Recall   : $recall\n";
#	    print " F-measure: $f\n";
#	    my $dice = 2 * $score / ($wc1 + $wc2) ;
#	    print " Dice     : $dice\n";

#	    my $e = 1 - $f;
#	    print " E-measure: $e\n";

#	    my $cos = $score / sqrt ($wc1 * $wc2);
#	    print " Cosine   : $cos\n";
	

            my $lesk = $raw_lesk/ ($wc1 * $wc2);
#	    print " Raw lesk : $raw_lesk\n";
	    print " Lesk     : $lesk\n";


I know that's a bit messy, but should be a fairly easy fix in the
short term at least...

I hope this helps!
Ted

>
> Thanks in advacne for your reply and help.
>
> Best regards
> Yashar.
>
>
>
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] some clarifications about Text::similarity

From: Yashar M. <yas...@ya...> - 2009-01-20 15:31:57

Dear Ted

I'm using your package in some of my experiments and in order to cite it I need few clarifications :

while "text_similarity.pl --type=Text::Similarity::Overlaps file1 file2" is executed a normalized measure is obtained. what I understood from your documentation this measure is raw normalized (F-measure = 2 * precision * recall / (precision + recall)), is that right?
while "text_similarity.pl --type=Text::Similarity::Overlaps --no-normalize file1 file2" is executed, the output would be a simple raw score of overlap not lesk raw score? is it right?is there any way in which by using text_similarity.pl an reach the lesk measure through defining any option? (im aware that by defining verbose option we could get all measures but is there any way that directly lead us to lesk measure).Thanks in advacne for your reply and help.

Best regards
Yashar.

Re: [text-similarity-users] reading text-similarity output from java

From: Ted P. <dul...@gm...> - 2009-01-20 14:54:26

Hi Antonio,

I'm afraid I have very little experience with Java, so I don't really
know how to include Perl in a Java program. I do know that there are
Perl modules that let you do the opposite, that is include Java in
Perl programs .... this is most commonly done with Inline::Java, which
can be found here :

http://search.cpan.org/~patl/Inline-Java/

I don't know if that would give any ideas about how to include Perl in
Java, but it's about the only thing I could think of to mention.

Please do let us know if you figure this out, seems like potentially a
very useful technique.

Cordially,
Ted


On Fri, Jan 16, 2009 at 11:32 AM, Antonio Toral
<ant...@il...> wrote:
> hi,
>
> i'd like to use text-similarity from a java program. So from this program I
> call text-similarity and then I capture its output. This is the java code:
>
> String command
> = "text_similarity.pl --type=Text::Similarity::Overlaps --string \"" +
> string1 + "\" \"" + string2 + "\"";
>                String result = "";
>                try {
>                        Process p = Runtime.getRuntime().exec(command);
>                        int command_exit = p.waitFor();
>                        System.err.println("Command ended with value: " + command_exit);
>
>                        BufferedReader stdInput = new BufferedReader(new
> InputStreamReader(p.getInputStream()));
>                        BufferedReader stdError = new BufferedReader(new
> InputStreamReader(p.getErrorStream()));
>                        String s = null;
>                        while ((s = stdInput.readLine()) != null) {
>                                System.out.println("\treading result: " + s);
>                                result += s;
>                        }
>                        while ((s = stdError.readLine()) != null) {
>                                System.out.println("\tERROR: " + s);
>                        }
>                }
>                catch (IOException e) {
>                        e.printStackTrace();
>                        System.exit(-1);
>                }
>                catch (InterruptedException i) {
>                        i.printStackTrace();
>                        System.exit(-2);
>                }
>
>
> However it does not work! I just get the string "0".
>
> If I run text-similarity from the command line it works fine.
> If I call from my java program a shell command (like "ls") or a "hello world"
> perl script then I can capture its output, so I guess the problem is somehow
> related to the way text-similarity buffers its output. I've read of people
> having this kind of issues when calling perl scripts from java and someone
> proposes as a solution to put this at the beginning of perl scripts:
>
> use IO::Handle;
> STDOUT->autoflush(1);
> STDERR->autoflush(1);
> [http://forums.sun.com/thread.jspa?threadID=189595&forumID=31]
>
> however, I've also tried to put this at the beginning of text_similarity.pl
> but without luck!
>
> Does anyone know how should I do?
>
>
> Thanks in advance,
> Antonio Toral
>
> ------------------------------------------------------------------------------
> This SF.net email is sponsored by:
> SourcForge Community
> SourceForge wants to tell your story.
> http://p.sf.net/sfu/sf-spreadtheword
> _______________________________________________
> text-similarity-users mailing list
> tex...@li...
> https://lists.sourceforge.net/lists/listinfo/text-similarity-users
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] reading text-similarity output from java

From: Antonio T. <ant...@il...> - 2009-01-16 18:15:20

hi,

i'd like to use text-similarity from a java program. So from this program I 
call text-similarity and then I capture its output. This is the java code:

String command 	
= "text_similarity.pl --type=Text::Similarity::Overlaps --string \"" + 
string1 + "\" \"" + string2 + "\"";
		String result = "";
		try {
			Process p = Runtime.getRuntime().exec(command);
			int command_exit = p.waitFor();
			System.err.println("Command ended with value: " + command_exit);
			
			BufferedReader stdInput = new BufferedReader(new 
InputStreamReader(p.getInputStream()));
			BufferedReader stdError = new BufferedReader(new 
InputStreamReader(p.getErrorStream()));
			String s = null;
			while ((s = stdInput.readLine()) != null) {
				System.out.println("\treading result: " + s);
				result += s;
			}
			while ((s = stdError.readLine()) != null) {
		            	System.out.println("\tERROR: " + s);
        		}
		}
		catch (IOException e) {
			e.printStackTrace();
			System.exit(-1);
		}
		catch (InterruptedException i) {
			i.printStackTrace();
			System.exit(-2);
		}


However it does not work! I just get the string "0".

If I run text-similarity from the command line it works fine.
If I call from my java program a shell command (like "ls") or a "hello world" 
perl script then I can capture its output, so I guess the problem is somehow 
related to the way text-similarity buffers its output. I've read of people 
having this kind of issues when calling perl scripts from java and someone 
proposes as a solution to put this at the beginning of perl scripts:

use IO::Handle;
STDOUT->autoflush(1);
STDERR->autoflush(1);
[http://forums.sun.com/thread.jspa?threadID=189595&forumID=31]

however, I've also tried to put this at the beginning of text_similarity.pl 
but without luck!

Does anyone know how should I do?


Thanks in advance,
Antonio Toral

Re: [text-similarity-users] asking for text::similarity

From: Ted P. <tpederse@d.umn.edu> - 2008-11-21 13:37:50

Hi Hamed,

Thanks for your interest in Text-Similarity. See my comments inline...

On Fri, Nov 21, 2008 at 1:13 AM,  <kha...@pe...> wrote:
> Dear Dr.Pedersen
>
> I need to use text::similarity module in my project so I wanted to know:
>
> 1-  How to extract only Lesk  measurement from text:similarity in case that
> I need only that one since when I put
>
> ('normalize' => 1, 'verbose' => 1) it gives range of measures that I need
> only Lesk to put in my program . is there any
>
> extra functions for that ? in other word I need to put Lesk measure into $my
> score to be used for calculating
>
> sentences  semantic relateness in my project.

The following code (from
http://search.cpan.org/dist/Text-Similarity/lib/Text/Similarity.pm)
will give you just the Lesk (overlap) measure. This is not actually
the same thing as "semantic relatedness", so if you are interested in
that you might want to look at the lesk measure as found in
WordNet::Similarity, which is based on the use of Text::Similarity but
does some other things too. Text::Similarity (for lesk) simply finds
the overlaps between two strings or files.

  use Text::Similarity::Overlaps;
      my $mod = Text::Similarity::Overlaps->new;
      defined $mod or die "Construction of Text::Similarity::Overlaps failed";

      # adjust file names to reflect true relative position
      # these paths are valid from lib/Text/Similarity
      my $text_file1 = 'sent11.txt';
      my $text_file2 = 'sent21.txt';

      my $score = $mod->getSimilarity ($text_file1, $text_file2);
      print "The similarity of $text_file1 and $text_file2 is : $score\n";

>
> 2- How to cite the job?in my project?

You could use the following type of reference :

Pedersen, Ted (2008)
Text-Similarity (version 0.07) : A Perl Module to Measure the
Pair-Wise Similarity of Files or Strings
http://search.cpan.org/dist/Text-Similarity/

Good Luck!
Ted

>
> Thank you very much.
>
> your attention would be most appreciated .
>
> Hamed Khanpour
>
> MCS student.
> Malaysia
> ------------------------------------------------------------------------------------
> UNIVERSITY OF MALAYA  -  " Producing Leaders Since 1905 "
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] Text-Similarity version 0.07 released

From: Ted P. <dul...@gm...> - 2008-11-16 00:04:24

We are pleased to announce the release of version 0.07 of
Text-Similarity. This release has a single fix to a test case that has
caused trouble for Windows installation, so you should only worry
about upgrading if you are using Windows, or if you are using a
version less than 0.06 (which had a number of significant changes).

You can find download links from CPAN and sourceforge at
http://text-similarity.sourceforge.net

Please let us know if you have any questions or concerns!

Cordially,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [text-similarity-users] Question on your Text::Similarity package

From: Ted P. <tpederse@d.umn.edu> - 2008-10-25 16:02:05

Hi Karthick,

I'm glad to know you are finding Text::Similarity useful...

I think the main documentation we have about these measures is found here :

http://search.cpan.org/dist/Text-Similarity/lib/Text/Similarity/Overlaps.pm

This gives the formulas that we use in the program - I think in
general these are pretty commonly accepted definitions (except perhaps
for lesk) so we didn't elaborate a great deal on them. However, I'm
happy to add some details as needed.

The lesk measure in terms of the overlap counting, etc. that we do is
probably best described here (in section 7.3):

An Adapted Lesk Algorithm for Word Sense Disambiguation using WordNet
(Banerjee and Pedersen) - Appears in the Proceedings of the Third
International Conference on Intelligent Text Processing and
Computational Linguistics, pp. 136-145, February 17-23, 2002, Mexico
City.
http://www.d.umn.edu/~tpederse/Pubs/cicling2002-b.pdf

The other measures I *think* are fairly standard, although if you have
doubts about what we have done with them let me know and I can
hopefully clarify.

Thanks!
Ted

On Sat, Oct 25, 2008 at 10:45 AM, Karthick Jayaraman
<kar...@gm...> wrote:
> Dear Professor,
>
> I am using your Text:Similarity package in one my current projects. Is
> there any documentation on the details of the metrics such as
> F-Measure, Precision, Recall, Cosine, and Lesk ? Kindly let me know.
>
> We are currently using your package to do establish similarity of
> JavaScript programs that undergo certain forms of minor minor dynamic
> updatings.
>
> We would like to cite your package and the reference on the metrics.
>
> --
> Cheers!,
> Karthick Jayaraman
>
> You must do the things you think you cannot do.
> Eleanor Roosevelt
>
> http://web.syr.edu/~kjayaram
>

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] Text Similarity version 0.06 released

From: Ted P. <dul...@gm...> - 2008-04-06 14:50:10

We are pleased to announce the release of version 0.06 of Text-Similarity.
This is a module that WordNet-Similarity uses in the computation of
the lesk measure,  and one of the new features in this release is
providing a "lesk" score that does our calculation for "lesk overlap"
for any pair of files or strings you provide to it.

As you may recall  the lesk measure takes glosses and compares them for
overlaps (matches) and then scores them by taking the length of each phrasal
match, squaring it, and then summing those scores.

Consider the following example (line breaks introduced for clarity)
which measures the two given  strings for similarity:

 text_similarity.pl --type Text::Similarity::Overlaps --verbose
 --stoplist stoplist.txt --string
 'winston churchill was the prime minister of england'
 'prime minister of england winston churchill came for a visit that day'

 keys: 2
 -->'prime minister england' len(3) cnt(1)
 -->'winston churchill' len(2) cnt(1)
 wc 1: 5
 wc 2: 7
  Raw score: 5
  Precision: 0.714285714285714
  Recall   : 1
  F-measure: 0.833333333333333
  Dice     : 0.833333333333333
  E-measure: 0.166666666666667
  Cosine   : 0.845154254728517
  Raw lesk : 13
  Lesk     : 0.371428571428571
 0.833333333333333

We find two phrasal matches of length 2 and 3, so those are scored (by
raw lesk) as 2^2 + 3^2 = 13. That is  then scaled by the product of
the two string lengths to arrive at a  normalized lesk score. By
default WordNet
Similarity uses raw lesk. Note that the raw score is simply the number
of matching words (prime minister  england winston churchill) without regard to
their order, and that  this value is the basis of all the other measures
except for raw lesk and lesk. So, of the measures above, only lesk is
really considering phrasal matches and treats them differently.

This package provides both a command line program (text_similarity.pl)
and Perl API calls (examples in the  SYNOPSIS sections of the CPAN
documentation).

You can find more info and find download links at
http://text-similarity.sourceforge.net

I'm sure we'll continue to tinker with and extend Text Similarity, so
please do let us know of any suggestions you have.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-users] Text Similarity version 0.06 released

From: Ted P. <dul...@gm...> - 2008-04-06 14:40:35

We are pleased to announce the release of version 0.06 of
Text-Similarity. This is a module
that WordNet-Similarity uses in the computation of the lesk measure,
and one of the new
features in this release is providing a "lesk" score that does our
calculation for "lesk overlap"
for any pair of files or strings you provide to it. As you may recall
the lesk measure takes
glosses and compares them for overlaps (matches) and then scores them
by taking the length
of each phrasal match, squaring it, and then summing those scores.

Consider the following example (line breaks introduced for clarity)
which measures the two given
strings for similarity:

text_similarity.pl --type Text::Similarity::Overlaps --verbose
--stoplist stoplist.txt --string
'winston churchill was the prime minister of england'
'prime minister of england winston churchill came for a visit that day'

keys: 2
-->'prime minister england' len(3) cnt(1)
-->'winston churchill' len(2) cnt(1)
wc 1: 5
wc 2: 7
 Raw score: 5
 Precision: 0.714285714285714
 Recall   : 1
 F-measure: 0.833333333333333
 Dice     : 0.833333333333333
 E-measure: 0.166666666666667
 Cosine   : 0.845154254728517
 Raw lesk : 13
 Lesk     : 0.371428571428571
0.833333333333333

We see two phrasal matches of length 2 and 3, so those are scored (by
raw lesk) as 2^2 + 3^2 = 13. That is
then scaled by the product of the two string lengths to arrive at a
normalized lesk score. By default WordNet
Similarity uses raw lesk. Note that the raw score is simply the number
of matching words (prime minister
england winston churchill) without regard to their order, and that
this value is the basis of all the other measures
except for raw lesk and lesk. So, of the measures above, only lesk is
really considering phrasal matches
and treating them differently.

This package provides both a command line program (text_similarity.pl)
and Perl API calls (examples in the
SYNOPSIS sections of the CPAN documentation).

You can find more info and find download links at
http://text-similarity.sourceforge.net

I'm sure we'll continue to tinker with and extend Text Similarity, so
please do let us know of any suggestions
you have.

Enjoy,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Flat | Threaded

1 2 > >> (Page 1 of 2)

2008	Jan	Feb	Mar	Apr (5)	May	Jun	Jul	Aug	Sep	Oct (1)	Nov (2)	Dec
2009	Jan (5)	Feb	Mar (1)	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec
2011	Jan (4)	Feb	Mar	Apr	May	Jun (2)	Jul (1)	Aug	Sep	Oct	Nov	Dec
2013	Jan (1)	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep (2)	Oct	Nov	Dec
2014	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep (1)	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec