You can subscribe to this list here.
2008 |
Jan
|
Feb
|
Mar
|
Apr
(5) |
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
(2) |
Dec
|
---|---|---|---|---|---|---|---|---|---|---|---|---|
2009 |
Jan
(5) |
Feb
|
Mar
(1) |
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2010 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2011 |
Jan
(4) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(2) |
Jul
(1) |
Aug
|
Sep
|
Oct
|
Nov
|
Dec
|
2013 |
Jan
(1) |
Feb
|
Mar
|
Apr
|
May
|
Jun
(1) |
Jul
|
Aug
|
Sep
(2) |
Oct
|
Nov
|
Dec
|
2014 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
(1) |
Oct
|
Nov
|
Dec
|
2015 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
|
Aug
|
Sep
|
Oct
(1) |
Nov
|
Dec
|
From: Ted P. <dul...@gm...> - 2015-10-08 00:54:29
|
We are pleased to announce the release of version 0.11 of Text::Similarity. This includes a few fixes and corrections supplied by users (which we are always most grateful for!). You can download the new version from CPAN or sourceforge via links found at http://text-similarity.sourceforge.net. Below is the change log for this release. Finally, we are very open to other patches or ideas that users have, so please feel free to let us know! 0.11 Released October 6, 2015 (all changes by TDP) Contributed enhancement by Tani Hosokawa Not a bug, but an optimization. Original version does inefficient repeated linear search over text that can't possibly match. Instead, precaches locations of keywords. Comparing 100 semi-randomly generated fairly similar documents of about 500 words each results in approx 90% speed increase, the efficiency increases as the documents get larger. https://rt.cpan.org/Public/Ticket/Attachment/999948/520850 Make various documentation/typo fixes as suggested by Alex Becker. Found in CPAN bug list. Enjoy, Ted |
From: Ted P. <dul...@gm...> - 2014-09-22 11:49:28
|
Very nice news for users of NSP and Text::Similarity! Please support these resources by giving them a try and letting others know about them too. Cordially, Ted ---------- Forwarded message ---------- From: Marta Villegas <mar...@up...> Date: Mon, Sep 22, 2014 at 4:13 AM Subject: Ngrams and Text Similarity deployed as SOAP web services To: tpe...@um... Dear Ted, Because of our participation in CLARIN <http://clarin.eu/>and PANACEA <http://www.panacea-lr.eu/>EU projects, in the last few years we deployed some NLP tools as web services. Among these you will find yours Ngrams and Text Similarity services. They are deployed as SOAP web services and they are open and accessible . You can find a description in our LOD-browser catalogue <http://lod.iula.upf.edu/index-en.html> (please let us know if you want us to change something) 1) TedPedersen's Ngrams Counter Web Service <http://lod.iula.upf.edu/resources/184> 2) TedPedersen's Ngram Statistics Package <http://lod.iula.upf.edu/resources/108> 3) TedPedersen's Text Similarity Web Service <http://lod.iula.upf.edu/resources/429> The corresponding demo invocations are also available here: http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.countngrams_row http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.ngrams_row http://ws04.iula.upf.edu/soaplab2-axis/#statistics_analysis.text_similarity_row Best regards -- Marta Villegas mar...@gm... |
From: Ted P. <tpederse@d.umn.edu> - 2013-09-19 13:51:14
|
Hi Thomas, Thanks for asking, we certainly appreciate the desire to cite Text::Similarity! Unfortunately there is really no publication that could be cited. I think the best course of action might be to mention both the name and version you used (Text::Similarity version 0.10) and provide the URL of the software's home page (http://text-similarity.sourceforge.net) in a reference or footnote. We are of course always interested to know how people have used this software, so if you do get a paper written up please do consider passing that along to us. Cordially, Ted On Wed, Sep 18, 2013 at 9:58 AM, Thomas Meyer <Tho...@id...> wrote: > Hi, > > First of all, thanks for the great tool. > > Is there a publication (conference paper etc.), that I can cite when > using the text::similarity module in my own research? > > Thanks, > Thomas > > ------------------------------------------------------------------------------ > LIMITED TIME SALE - Full Year of Microsoft Training For Just $49.99! > 1,500+ hours of tutorials including VisualStudio 2012, Windows 8, SharePoint > 2013, SQL 2012, MVC 4, more. BEST VALUE: New Multi-Library Power Pack includes > Mobile, Cloud, Java, and UX Design. Lowest price ever! Ends 9/20/13. > http://pubads.g.doubleclick.net/gampad/clk?id=58041151&iu=/4140/ostg.clktrk > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Thomas M. <Tho...@id...> - 2013-09-18 15:25:11
|
Hi, First of all, thanks for the great tool. Is there a publication (conference paper etc.), that I can cite when using the text::similarity module in my own research? Thanks, Thomas |
From: Ted P. <tpederse@d.umn.edu> - 2013-06-27 12:17:08
|
We are pleased to announce the release of version 0.10 of Text-Similarity. This release only includes a single fix, and that is a change to a test case that fails on Windows. Unless this sort of thing really bothers you, you probably don't need to update. :) You can find the most current version on CPAN or at sourceforge: http://text-similarity.sourceforge.net However, there is a more important announcement, and that is that as of 0.10 Text-Similarity is again current in our sourceforge cvs archive. There were some transitions happening at sourceforge when 0.09 came out, so we did not use cvs. But, we are back to using cvs now, and that is always available for viewing or modifying if you are interested. Note that the cvs module name is now TS. As of now the web view hasn't been updated to include this new directory, but that should occur in the next day or two. Additional instructions on using cvs are available in sourceforge: http://sourceforge.net/p/text-similarity/code/?source=navbar Enjoy, and please let us know if any questions arise. Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Ted P. <tpederse@d.umn.edu> - 2013-01-22 21:01:04
|
Version 0.09 of Text::Similarity has been released on CPAN and sourceforge. This release includes two user contributions (that are very much appreciated). See details below, and feel free to download from http://text-similarity.sourceforge.net 0.09 Released January 22, 2013 * This release includes changes contributed by Myroslava Dzikovska that provide the full set of similarity scores programmatically. She modified the interface so that the getSimilarity function returns a pair ($score, %allScores) where %allScores is a hash of all possible scores that it computes. She made it so that in scalar context it will only return $score, so it is fully backwards compatible with the older versions. She also changed the printing to STDERR, to make it easier to use the code in filter scripts that depend on STDIN/STDOUT. * This release also inludes changes ontributed by Nathan Glen to allow test cases to pass on Windows. The single quote used previously caused arguments to the script not to be passed corrected, leading to test failures. The single quotes have been changed to double quotes. Enjoy, Ted |
From: Throop, D. R. (JSC-ER)[J. Technology] <dav...@na...> - 2011-07-12 17:02:20
|
Thanks! I would have responded faster, but your message wandered into my spam filter, where I just noticed it. I tore your code apart and was able to cache the computation that only affected one of the strings (the tokenization etc) so that the only computation inside the 600 x 300 loop was the direct comparison. Eventually I'll ask NASA for the forms to allow me to post it on CPAN. David Throop -----Original Message----- From: Ted Pedersen [mailto:tpederse@d.umn.edu] Sent: Wednesday, June 15, 2011 4:38 PM To: tex...@li... Subject: Re: [text-similarity-users] Taking many similarity measurements between two corpora Hi David, Nice question, and unfortunately I don't think there is a particularly better way to do what you propose, other than a long series of pairwise comparisons. That said, I ran something of the same dimensionality that you want to do (600 x 300) and the following script took 2.5 hours on a 5 year old desktop...so, if this isn't something you need to do on a regular basis, maybe it works out ok.... Below is my timing output... ted@linux-zxku:~> time bash runit.sh real 156m55.322s user 124m11.270s sys 24m30.416s ted@linux-zxku:~> And then there is the script I ran - I just took a file and made 600 individual 1 line files, and then did a bunch of pairwise similarities with our command line tool. Using the API would in effect result in the same thing... ted@linux-zxku:~> more runit.sh ----------------------- for line in {1..600..1} do head -$line text | tail -1 > text.$line done for linea in {1..600..1} do for lineb in {1..300..1} do text_similarity.pl --type Text::Similarity::Overlaps text.$linea text.$lineb >> text.output done done for line in {1..600..1} do rm text.$line done ----------- I hope this helps...please feel free to let us know of any additional questions that might arise. Cordially, Ted On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs Technology] <dav...@na...> wrote: > I need the pairwise similarity measurements between two corpora, MxN. I > wondered about the efficiency of this in Text::Similarity. > > Here's the task. We're transitioning a piece of hardware from one program > to another. The hardware was built to the old program's requirements, > (roughly 300 old requirements.) The new program has its own requirements > (roughly 600 requirements.) Each requirement is ~ 100 words. > > I'm supporting a gap analysis. One task in the gap analysis can be stated > as > > For each old requirement, find up to 3 new requirements which are > most-similar to the old requirement. > > > Example: Suppose I have an old requirement that reads "The Delivery-unit > shall fold to a stowage volume that will fit within the Transport Bag > dimensions of 48 by 20 by 14 inches and allow space for foam cushioning > material." Then I want to find any new requirements that are talking about > delivery-units, stowage volume, dimensions, transport bags or foam > cushioning. > > To do this, I want the pairwise similarity scores between all the old and > new requirements, roughly 300x600 = 180,000 comparisons. I suspect that > invoking > $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459); > isn't the best way to do this. E.g, it would call sanitizeString on each > old requirement 600 times. > > Am I missing something? Is there already to iterate efficiently over a such > a pair of corpora? > > David Throop > > > ------------------------------------------------------------------------------ > EditLive Enterprise is the world's most technically advanced content > authoring tool. Experience the power of Track Changes, Inline Image > Editing and ensure content is compliant with Accessibility Checking. > http://p.sf.net/sfu/ephox-dev2dev > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > > -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------------------------------ EditLive Enterprise is the world's most technically advanced content authoring tool. Experience the power of Track Changes, Inline Image Editing and ensure content is compliant with Accessibility Checking. http://p.sf.net/sfu/ephox-dev2dev _______________________________________________ text-similarity-users mailing list tex...@li... https://lists.sourceforge.net/lists/listinfo/text-similarity-users |
From: Ted P. <tpederse@d.umn.edu> - 2011-06-15 21:38:32
|
Hi David, Nice question, and unfortunately I don't think there is a particularly better way to do what you propose, other than a long series of pairwise comparisons. That said, I ran something of the same dimensionality that you want to do (600 x 300) and the following script took 2.5 hours on a 5 year old desktop...so, if this isn't something you need to do on a regular basis, maybe it works out ok.... Below is my timing output... ted@linux-zxku:~> time bash runit.sh real 156m55.322s user 124m11.270s sys 24m30.416s ted@linux-zxku:~> And then there is the script I ran - I just took a file and made 600 individual 1 line files, and then did a bunch of pairwise similarities with our command line tool. Using the API would in effect result in the same thing... ted@linux-zxku:~> more runit.sh ----------------------- for line in {1..600..1} do head -$line text | tail -1 > text.$line done for linea in {1..600..1} do for lineb in {1..300..1} do text_similarity.pl --type Text::Similarity::Overlaps text.$linea text.$lineb >> text.output done done for line in {1..600..1} do rm text.$line done ----------- I hope this helps...please feel free to let us know of any additional questions that might arise. Cordially, Ted On Tue, Jun 14, 2011 at 3:52 PM, Throop, David R. (JSC-ER)[Jacobs Technology] <dav...@na...> wrote: > I need the pairwise similarity measurements between two corpora, MxN. I > wondered about the efficiency of this in Text::Similarity. > > Here’s the task. We’re transitioning a piece of hardware from one program > to another. The hardware was built to the old program’s requirements, > (roughly 300 old requirements.) The new program has its own requirements > (roughly 600 requirements.) Each requirement is ~ 100 words. > > I’m supporting a gap analysis. One task in the gap analysis can be stated > as > > For each old requirement, find up to 3 new requirements which are > most-similar to the old requirement. > > > Example: Suppose I have an old requirement that reads “The Delivery-unit > shall fold to a stowage volume that will fit within the Transport Bag > dimensions of 48 by 20 by 14 inches and allow space for foam cushioning > material.” Then I want to find any new requirements that are talking about > delivery-units, stowage volume, dimensions, transport bags or foam > cushioning. > > To do this, I want the pairwise similarity scores between all the old and > new requirements, roughly 300x600 = 180,000 comparisons. I suspect that > invoking > $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459); > isn’t the best way to do this. E.g, it would call sanitizeString on each > old requirement 600 times. > > Am I missing something? Is there already to iterate efficiently over a such > a pair of corpora? > > David Throop > > > ------------------------------------------------------------------------------ > EditLive Enterprise is the world's most technically advanced content > authoring tool. Experience the power of Track Changes, Inline Image > Editing and ensure content is compliant with Accessibility Checking. > http://p.sf.net/sfu/ephox-dev2dev > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Throop, D. R. (JSC-ER)[J. Technology] <dav...@na...> - 2011-06-14 20:52:44
|
I need the pairwise similarity measurements between two corpora, MxN. I wondered about the efficiency of this in Text::Similarity. Here's the task. We're transitioning a piece of hardware from one program to another. The hardware was built to the old program's requirements, (roughly 300 old requirements.) The new program has its own requirements (roughly 600 requirements.) Each requirement is ~ 100 words. I'm supporting a gap analysis. One task in the gap analysis can be stated as * For each old requirement, find up to 3 new requirements which are most-similar to the old requirement. Example: Suppose I have an old requirement that reads "The Delivery-unit shall fold to a stowage volume that will fit within the Transport Bag dimensions of 48 by 20 by 14 inches and allow space for foam cushioning material." Then I want to find any new requirements that are talking about delivery-units, stowage volume, dimensions, transport bags or foam cushioning. To do this, I want the pairwise similarity scores between all the old and new requirements, roughly 300x600 = 180,000 comparisons. I suspect that invoking $score->[278]->[459] = $mod->getSimilarity ($reqtOld_278, $reqtNew_459); isn't the best way to do this. E.g, it would call sanitizeString on each old requirement 600 times. Am I missing something? Is there already to iterate efficiently over a such a pair of corpora? David Throop |
From: Ted P. <tpederse@d.umn.edu> - 2011-01-02 00:11:36
|
Hi Sean, I'm happy to report I think I figured this out. I use American spellings. :) The option is actually "normalize", whereas you were using "normalise" which I guess was just getting ignored (and we aren't apparently taking action when an invalid option is specified, which is a concern.) I think when you make this change things will work out more as you expect. ted@linux-qdw9:~> cat ts3.pl my $str1 = "the dog bit Jim"; my $str2 = "jim bit the dog "; my $laptool = "Text::Similarity::Overlaps"; eval "require $laptool"; if ($@) {die "\nWARNING ! $tool not loaded ..\n\n";} my %lapopts = ('normalize' => 1, 'verbose' => 1); # 'verbose' =++lesk-score my $mod = $laptool->new(\%lapopts); unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;} $score = $mod->getSimilarityStrings ($str1, $str2); print "score= $score\n\n"; ted@linux-qdw9:~> perl ts3.pl keys: 3 -->'bit' len(1) cnt(1) -->'jim' len(1) cnt(1) -->'the dog' len(2) cnt(1) wc 1: 4 wc 2: 4 Raw score: 4 Precision: 1 Recall : 1 F-measure: 1 Dice : 1 E-measure: 0 Cosine : 1 Raw lesk : 6 Lesk : 0.375 score= 1 ted@linux-qdw9:~> cat ts4.pl my $str1 = "the dog bit Jim"; my $str2 = "jim bit the dog "; my $laptool = "Text::Similarity::Overlaps"; eval "require $laptool"; if ($@) {die "\nWARNING ! $tool not loaded ..\n\n";} my %lapopts = ('normalize' => 0, 'verbose' => 1); # 'verbose' = ++lesk-score my $mod = $laptool->new(\%lapopts); unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;} $score = $mod->getSimilarityStrings ($str1, $str2); print "score= $score\n\n"; ted@linux-qdw9:~> perl ts4.pl keys: 3 -->'bit' len(1) cnt(1) -->'jim' len(1) cnt(1) -->'the dog' len(2) cnt(1) score= 4 BTW, I very much agree with your suggestions for some methods to return particular values of scores. I'll see if we can't do something about that in the next few months, as others have made a similar point (as you point out). Cordially, Ted On Fri, Dec 31, 2010 at 11:41 PM, Sean <so...@or...> wrote: > Sure, Ted, here it is: > > #-------------------------------------------- CODE > ---------------------------------------------- > my $str1 = "the dog bit Jim"; > my $str2 = "jim bit the dog "; > my $laptool = "Text::Similarity::Overlaps"; > eval "require $laptool"; > if ($@) {die "\nWARNING ! $tool not loaded ..\n\n";} > #my %lapopts = ('normalise' => 0, 'verbose' => 1); # 'verbose' = > ++lesk-score > my %lapopts = ('normalise' => 1, 'verbose' => 1); # 'verbose' = > ++lesk-score > my $mod = $laptool->new(\%lapopts); > unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;} > $score = $mod->getSimilarityStrings ($str1, $str2); > print "score= $score\n\n"; > #---------------------------------------END CODE > ---------------------------------------------- > > My guess is that self->verbose is not actually getting properly set via the > options? > > regards > Sean > > > Ted Pedersen wrote: > > Hi Sean, > > Thanks for your suggestions, let me take a look at those and see what > we might be able to do. > > And I'm sorry you are having some troubles. Can you go ahead and post > whatever code you are running to get these results? That will make it > easier to recreate the output. > > Cordially, > Ted > > On Fri, Dec 31, 2010 at 7:33 PM, Sean <so...@or...> wrote: > > > Hello Ted > > I have installed v-0.08 and do not seem to get the results as documented. > > In order to get the Lesk-score I have tried setting the options 2 > different ways, both without luck. > 1. ('normalise' => 0, 'verbose' => 1); # I expected this to work, not > wanting the Lesk normalised ... > 2. ('normalise' => 1, 'verbose' => 1); # also tried this just in case ... > > The COMPLETE screen-printed output from BOTH (using your doc example) is: > "keys: 3 -->'bit' len(1) cnt(1) -->'jim' len(1) cnt(1) > -->'the dog' len(2) cnt(1)" > > This is using the getSimilarityStrings ($str1, $str2) function directly > from another script ..(getting a score of 4 returned there as expected). > > While at it I may as well mention what tops my wish-list for v-0.09. I > would like to see additional simple wrapper functions like getLesk(), > getCosine() which would return just the single measure specified, and > getAll() which would return a hashref of 'named'-parameters to include > all provided measures, and which the other functions would be simple > wrappers around to pull out one or other from that comprehensive > getAll()-hashref? > > This would avoid having to capture & parse output from stdout/stderr or > some other arbitrary output channel, although it would probably do no > harm to also "print" those measures. Since adding string (rather than > file) acceptance obviously came as an afterthought itself, this might be > the next logical extension to functionality. Looking at previous mailers > I thought I detected similar requests, though expressed somewhat > differently. > > Keep up the good work in 2011. > > Sean > > > > ------------------------------------------------------------------------------ > Learn how Oracle Real Application Clusters (RAC) One Node allows customers > to consolidate database storage, standardize their database environment, > and, > should the need arise, upgrade to a full multi-node Oracle RAC database > without downtime or disruption > http://p.sf.net/sfu/oracle-sfdevnl > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Sean <so...@or...> - 2011-01-01 05:41:26
|
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> <html> <head> <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type"> <title></title> </head> <body bgcolor="#ffffff" text="#000000"> Sure, Ted, here it is:<br> <br> #-------------------------------------------- CODE ----------------------------------------------<br> my $str1 = "the dog bit Jim";<br> my $str2 = "jim bit the dog ";<br> my $laptool = "Text::Similarity::Overlaps";<br> eval "require $laptool"; <br> if ($@) {die "\nWARNING ! $tool not loaded ..\n\n";}<br> #my %lapopts = ('normalise' => 0, 'verbose' => 1); # 'verbose' = ++lesk-score<br> my %lapopts = ('normalise' => 1, 'verbose' => 1); # 'verbose' = ++lesk-score <br> my $mod = $laptool->new(\%lapopts);<br> unless (defined($mod)) {print "FAILED '$laptool'\n"; return 0;}<br> $score = $mod->getSimilarityStrings ($str1, $str2);<br> print "score= $score\n\n";<br> #---------------------------------------END CODE ----------------------------------------------<br> <br> My guess is that self->verbose is not actually getting properly set via the options?<br> <br> regards<br> Sean<br> <br> <br> Ted Pedersen wrote: <blockquote cite="mid:AAN...@ma..." type="cite"> <pre wrap="">Hi Sean, Thanks for your suggestions, let me take a look at those and see what we might be able to do. And I'm sorry you are having some troubles. Can you go ahead and post whatever code you are running to get these results? That will make it easier to recreate the output. Cordially, Ted On Fri, Dec 31, 2010 at 7:33 PM, Sean <a class="moz-txt-link-rfc2396E" href="mailto:so...@or..."><so...@or...></a> wrote: </pre> <blockquote type="cite"> <pre wrap="">Hello Ted I have installed v-0.08 and do not seem to get the results as documented. In order to get the Lesk-score I have tried setting the options 2 different ways, both without luck. 1. ('normalise' => 0, 'verbose' => 1); # I expected this to work, not wanting the Lesk normalised ... 2. ('normalise' => 1, 'verbose' => 1); # also tried this just in case ... The COMPLETE screen-printed output from BOTH (using your doc example) is: "keys: 3 -->'bit' len(1) cnt(1) -->'jim' len(1) cnt(1) -->'the dog' len(2) cnt(1)" This is using the getSimilarityStrings ($str1, $str2) function directly from another script ..(getting a score of 4 returned there as expected). While at it I may as well mention what tops my wish-list for v-0.09. I would like to see additional simple wrapper functions like getLesk(), getCosine() which would return just the single measure specified, and getAll() which would return a hashref of 'named'-parameters to include all provided measures, and which the other functions would be simple wrappers around to pull out one or other from that comprehensive getAll()-hashref? This would avoid having to capture & parse output from stdout/stderr or some other arbitrary output channel, although it would probably do no harm to also "print" those measures. Since adding string (rather than file) acceptance obviously came as an afterthought itself, this might be the next logical extension to functionality. Looking at previous mailers I thought I detected similar requests, though expressed somewhat differently. Keep up the good work in 2011. Sean ------------------------------------------------------------------------------ Learn how Oracle Real Application Clusters (RAC) One Node allows customers to consolidate database storage, standardize their database environment, and, should the need arise, upgrade to a full multi-node Oracle RAC database without downtime or disruption <a class="moz-txt-link-freetext" href="http://p.sf.net/sfu/oracle-sfdevnl">http://p.sf.net/sfu/oracle-sfdevnl</a> _______________________________________________ text-similarity-users mailing list <a class="moz-txt-link-abbreviated" href="mailto:tex...@li...">tex...@li...</a> <a class="moz-txt-link-freetext" href="https://lists.sourceforge.net/lists/listinfo/text-similarity-users">https://lists.sourceforge.net/lists/listinfo/text-similarity-users</a> </pre> </blockquote> <pre wrap=""><!----> </pre> </blockquote> </body> </html> |
From: Ted P. <tpederse@d.umn.edu> - 2011-01-01 05:13:00
|
Hi Sean, Thanks for your suggestions, let me take a look at those and see what we might be able to do. And I'm sorry you are having some troubles. Can you go ahead and post whatever code you are running to get these results? That will make it easier to recreate the output. Cordially, Ted On Fri, Dec 31, 2010 at 7:33 PM, Sean <so...@or...> wrote: > Hello Ted > > I have installed v-0.08 and do not seem to get the results as documented. > > In order to get the Lesk-score I have tried setting the options 2 > different ways, both without luck. > 1. ('normalise' => 0, 'verbose' => 1); # I expected this to work, not > wanting the Lesk normalised ... > 2. ('normalise' => 1, 'verbose' => 1); # also tried this just in case ... > > The COMPLETE screen-printed output from BOTH (using your doc example) is: > "keys: 3 -->'bit' len(1) cnt(1) -->'jim' len(1) cnt(1) > -->'the dog' len(2) cnt(1)" > > This is using the getSimilarityStrings ($str1, $str2) function directly > from another script ..(getting a score of 4 returned there as expected). > > While at it I may as well mention what tops my wish-list for v-0.09. I > would like to see additional simple wrapper functions like getLesk(), > getCosine() which would return just the single measure specified, and > getAll() which would return a hashref of 'named'-parameters to include > all provided measures, and which the other functions would be simple > wrappers around to pull out one or other from that comprehensive > getAll()-hashref? > > This would avoid having to capture & parse output from stdout/stderr or > some other arbitrary output channel, although it would probably do no > harm to also "print" those measures. Since adding string (rather than > file) acceptance obviously came as an afterthought itself, this might be > the next logical extension to functionality. Looking at previous mailers > I thought I detected similar requests, though expressed somewhat > differently. > > Keep up the good work in 2011. > > Sean > > > > ------------------------------------------------------------------------------ > Learn how Oracle Real Application Clusters (RAC) One Node allows customers > to consolidate database storage, standardize their database environment, and, > should the need arise, upgrade to a full multi-node Oracle RAC database > without downtime or disruption > http://p.sf.net/sfu/oracle-sfdevnl > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Sean <so...@or...> - 2011-01-01 02:07:19
|
Hello Ted I have installed v-0.08 and do not seem to get the results as documented. In order to get the Lesk-score I have tried setting the options 2 different ways, both without luck. 1. ('normalise' => 0, 'verbose' => 1); # I expected this to work, not wanting the Lesk normalised ... 2. ('normalise' => 1, 'verbose' => 1); # also tried this just in case ... The COMPLETE screen-printed output from BOTH (using your doc example) is: "keys: 3 -->'bit' len(1) cnt(1) -->'jim' len(1) cnt(1) -->'the dog' len(2) cnt(1)" This is using the getSimilarityStrings ($str1, $str2) function directly from another script ..(getting a score of 4 returned there as expected). While at it I may as well mention what tops my wish-list for v-0.09. I would like to see additional simple wrapper functions like getLesk(), getCosine() which would return just the single measure specified, and getAll() which would return a hashref of 'named'-parameters to include all provided measures, and which the other functions would be simple wrappers around to pull out one or other from that comprehensive getAll()-hashref? This would avoid having to capture & parse output from stdout/stderr or some other arbitrary output channel, although it would probably do no harm to also "print" those measures. Since adding string (rather than file) acceptance obviously came as an afterthought itself, this might be the next logical extension to functionality. Looking at previous mailers I thought I detected similar requests, though expressed somewhat differently. Keep up the good work in 2011. Sean |
From: Ted P. <tpederse@d.umn.edu> - 2010-06-13 15:55:13
|
We are pleased to announce the release of version 0.08 of Text-Similarity. This versions one important change - when you are using a stoplist, you can now specify stop words using regular expressions. In previous versions a stoplist can be specified as follows (in a single file, one line per word) a of in This will cause a, of and in to be treated as stop words (and not use them in computing similarity). As of 0.08 you may continue to use the above format, or you can use regular expressions... For example... /\b\w\b/ /\b\d+\b/ ...would cause all single character words and numeric values to be removed... You can get this new version via CPAN or sourceforge - find links to both at : http://text-similarity.sourceforge.net Enjoy, Ted and Ying -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Ted P. <dul...@gm...> - 2009-03-09 04:07:29
|
Thanks for reporting this James...this error seems to only occur with this particular version of Windows... http://www.cpantesters.org/show/Text-Similarity.html#0.07 As a result, I'm inclined to say this looks more like a problem with that version of Windows than it does Text-Similarity. The specific error that you are seeing is for one test case that checks to make sure that the order of the words in a string doesn't affect the score... the big cat the cat big Should get the same similarity score... Here is the specific test that is failing... ## this test case was causing trouble for Windows - changed in 0.07 ##$output = `$^X $inc $text_similarity_pl --type Text::Similarity::Overlaps --string 'sir winston churchill' 'winston churchill SIR!!!' `; $output = `$^X $inc $text_similarity_pl --type Text::Similarity::Overlaps --string 'sir winston churchill' 'winston churchill sir' `; chomp $output; is ($output, 1, "order doesn't affect score"); In fact the punctuation in the commented version seemed to cause problems earlier with Windows, not sure what's happening here.... While this might be somewhat risky, you may simply want to force the install and/or comment out this test case to get things installed ok... I wish I could shed more light on this - I normally don't use Windows so I'm not in a really good position to test things out, etc. so any observations you might have would be greatly appreciated. Thanks! Ted On Sun, Mar 1, 2009 at 8:06 PM, James F. Mahon III via RT <bug...@rt...> wrote: > Sun Mar 01 20:06:48 2009: Request 43758 was acted upon. > Transaction: Ticket created by jam...@gm... > Queue: Text-Similarity > Subject: installation error > Broken in: (no value) > Severity: (no value) > Owner: Nobody > Requestors: jam...@gm... > Status: new > Ticket <URL: https://rt.cpan.org/Ticket/Display.html?id=43758 > > > > Hello, > > I attempted to install Text::Similarity-0.07, but "nmake install" > returned an error that I don't know how to address. I've pasted the > error below. I'm attempting this installation with Perl v5.8.8 built for > MSWin32-x86-multi-thread. Here are my locally applied patched attained > by running perl -V. > > Locally applied patches: > ActivePerl Build 822 [280952] > Iin_load_module moved for compatibility with build 806 > PerlEx support in CGI::Carp > Less verbose ExtUtils::Install and Pod::Find > Patch for CAN-2005-0448 from Debian with modifications > Rearrange @INC so that 'site' is searched before 'perl' > Partly reverted 24733 to preserve binary compatibility > MAINT31223 plus additional changes > 31490 Problem bootstraping Win32CORE > 31324 Fix DynaLoader::dl_findfile() to locate .so files again > 31214 Win32::GetLastError fails when first called > 31211 Restore Windows NT support > 31188 Problem killing a pseudo-forked child on Win32 > 29732 ANSIfy the PATH environment variable on Windows > 27527,29868 win32_async_check() can loop indefinitely > 26970 Make Passive mode the default for Net::FTP > 26379 Fix alarm() for Windows 2003 > 24699 ICMP_UNREACHABLE handling in Net::Ping > > Can you offer any advice? > > Best, > > James > > Microsoft Windows XP [Version 5.1.2600] > (C) Copyright 1985-2001 Microsoft Corp. > > P:\LWP\Text-Similarity-0.07>perl makefile.pl > Checking if your kit is complete... > Looks good > Writing Makefile for Text::Similarity > > P:\LWP\Text-Similarity-0.07>nmake > > Microsoft (R) Program Maintenance Utility Version 1.50 > Copyright (c) Microsoft Corp 1988-94. All rights reserved. > > cp lib/Text/Similarity.pm blib\lib\Text\Similarity.pm > cp lib/Text/OverlapFinder.pm blib\lib\Text\OverlapFinder.pm > cp lib/Text/Similarity/Overlaps.pm blib\lib\Text\Similarity\Overlaps.pm > C:\Perl\bin\perl.exe -MExtUtils::Command -e cp > bin/text_similarity.pl bl > ib\script\text_similarity.pl > pl2bat.bat blib\script\text_similarity.pl > > P:\LWP\Text-Similarity-0.07>nmake test > > Microsoft (R) Program Maintenance Utility Version 1.50 > Copyright (c) Microsoft Corp 1988-94. All rights reserved. > > C:\Perl\bin\perl.exe -MExtUtils::Command -e cp > bin/text_similarity.pl bl > ib\script\text_similarity.pl > pl2bat.bat blib\script\text_similarity.pl > C:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" > "test_harness(0, 'bl > ib\lib', 'blib\arch')" t/*.t > t/getsimilaritystrings......ok > t/no-normalize..............ok > t/normalize.................ok > t/overlaps..................ok > t/text_similarity...........ok > t/text_similarity_string....ok 5/8 > # Failed test 'order doesn't affect score' > # at t/text_similarity_string.t line 69. > t/text_similarity_string....NOK 8/8# got: '0' > # expected: '1' > # Looks like you failed 1 test of 8. > t/text_similarity_string....dubious > Test returned status 1 (wstat 256, 0x100) > DIED. FAILED test 8 > Failed 1/8 tests, 87.50% okay > Failed Test Stat Wstat Total Fail List of Failed > ------------------------------------------------------------------------------- > t/text_similarity_string.t 1 256 8 1 8 > Failed 1/6 test scripts. 1/130 subtests failed. > Files=6, Tests=130, 6 wallclock secs ( 0.00 cusr + 0.00 csys = 0.00 CPU) > Failed 1/6 test programs. 1/130 subtests failed. > NMAKE : fatal error U1077: 'C:\WINDOWS\system32\cmd.exe' : return code '0x1' > Stop. > > -- > James Mahon > Research Professional > Becker Center, Chicago Booth, University of Chicago > Tel: 773.834.7369 > Fax: 773.834.3040 > Email: jm...@ch... > > > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Antonio T. <Ant...@il...> - 2009-01-25 19:01:20
|
Hi Ted, finally I figured out how to do it. I've changed this part of the code: String command = "text_similarity.pl --type=Text::Similarity::Overlaps -- string \"" + string1 + "\" \"" + string2 + "\""; Process p = Runtime.getRuntime().exec(command); with: Vector<String> command = new Vector<String>(); command.add("text_similarity.pl"); command.add("--type=Text::Similarity::Overlaps"); command.add("--string"); command.add(string1); command.add(string2); ProcessBuilder pb = new ProcessBuilder(command); Process p = pb.start(); and now it works smoothly. Hope it can be useful for someone. Regards, Antonio > Hi Antonio, > > I'm afraid I have very little experience with Java, so I don't really > know how to include Perl in a Java program. I do know that there are > Perl modules that let you do the opposite, that is include Java in > Perl programs .... this is most commonly done with Inline::Java, which > can be found here : > > http://search.cpan.org/~patl/Inline-Java/ > > I don't know if that would give any ideas about how to include Perl in > Java, but it's about the only thing I could think of to mention. > > Please do let us know if you figure this out, seems like potentially a > very useful technique. > > Cordially, > Ted |
From: Ted P. <dul...@gm...> - 2009-01-20 23:49:14
|
Hi Yashar, Thanks for your questions - see my responses inline... On Tue, Jan 20, 2009 at 9:31 AM, Yashar Mehdad <yas...@ya...> wrote: > Dear Ted > > I'm using your package in some of my experiments and in order to cite it I > need few clarifications : > > while "text_similarity.pl --type=Text::Similarity::Overlaps file1 file2" is > executed a normalized measure is obtained. what I understood from your > documentation this measure is raw normalized (F-measure = 2 * precision * > recall / (precision + recall)), is that right? Correct. Consider the following example... ted@ted-desktop:~$ text_similarity.pl test1 test2 --type=Text::Similarity::Overlaps --no-normalize 5 ted@ted-desktop:~$ text_similarity.pl test1 test2 --type=Text::Similarity::Overlaps 0.555555555555556 ted@ted-desktop:~$ more test1 this is test1 i am happy he is not ted@ted-desktop:~$ more test2 this is test2 i am hungry she is sad no-normalize for the second result shows that 5 words have matched (without regard to order or length of phrase) > > while "text_similarity.pl --type=Text::Similarity::Overlaps --no-normalize > file1 file2" is executed, the output would be a simple raw score of overlap > not lesk raw score? is it right? Correct! There is no "bonus" for phrasal matching in the overlap scoring. > > is there any way in which by using text_similarity.pl an reach the lesk > measure through defining any option? (im aware that by defining verbose > option we could get all measures but is there any way that directly lead us > to lesk measure). Not from the command line, however, you could edit Overlaps.pm to just output lesk.... Here's the relevant snippet, where I've added comments... if ($self->verbose) { # print " Raw score: $score\n"; # print " Precision: $prec\n"; # print " Recall : $recall\n"; # print " F-measure: $f\n"; # my $dice = 2 * $score / ($wc1 + $wc2) ; # print " Dice : $dice\n"; # my $e = 1 - $f; # print " E-measure: $e\n"; # my $cos = $score / sqrt ($wc1 * $wc2); # print " Cosine : $cos\n"; my $lesk = $raw_lesk/ ($wc1 * $wc2); # print " Raw lesk : $raw_lesk\n"; print " Lesk : $lesk\n"; I know that's a bit messy, but should be a fairly easy fix in the short term at least... I hope this helps! Ted > > Thanks in advacne for your reply and help. > > Best regards > Yashar. > > > > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Yashar M. <yas...@ya...> - 2009-01-20 15:31:57
|
Dear Ted I'm using your package in some of my experiments and in order to cite it I need few clarifications : while "text_similarity.pl --type=Text::Similarity::Overlaps file1 file2" is executed a normalized measure is obtained. what I understood from your documentation this measure is raw normalized (F-measure = 2 * precision * recall / (precision + recall)), is that right? while "text_similarity.pl --type=Text::Similarity::Overlaps --no-normalize file1 file2" is executed, the output would be a simple raw score of overlap not lesk raw score? is it right?is there any way in which by using text_similarity.pl an reach the lesk measure through defining any option? (im aware that by defining verbose option we could get all measures but is there any way that directly lead us to lesk measure).Thanks in advacne for your reply and help. Best regards Yashar. |
From: Ted P. <dul...@gm...> - 2009-01-20 14:54:26
|
Hi Antonio, I'm afraid I have very little experience with Java, so I don't really know how to include Perl in a Java program. I do know that there are Perl modules that let you do the opposite, that is include Java in Perl programs .... this is most commonly done with Inline::Java, which can be found here : http://search.cpan.org/~patl/Inline-Java/ I don't know if that would give any ideas about how to include Perl in Java, but it's about the only thing I could think of to mention. Please do let us know if you figure this out, seems like potentially a very useful technique. Cordially, Ted On Fri, Jan 16, 2009 at 11:32 AM, Antonio Toral <ant...@il...> wrote: > hi, > > i'd like to use text-similarity from a java program. So from this program I > call text-similarity and then I capture its output. This is the java code: > > String command > = "text_similarity.pl --type=Text::Similarity::Overlaps --string \"" + > string1 + "\" \"" + string2 + "\""; > String result = ""; > try { > Process p = Runtime.getRuntime().exec(command); > int command_exit = p.waitFor(); > System.err.println("Command ended with value: " + command_exit); > > BufferedReader stdInput = new BufferedReader(new > InputStreamReader(p.getInputStream())); > BufferedReader stdError = new BufferedReader(new > InputStreamReader(p.getErrorStream())); > String s = null; > while ((s = stdInput.readLine()) != null) { > System.out.println("\treading result: " + s); > result += s; > } > while ((s = stdError.readLine()) != null) { > System.out.println("\tERROR: " + s); > } > } > catch (IOException e) { > e.printStackTrace(); > System.exit(-1); > } > catch (InterruptedException i) { > i.printStackTrace(); > System.exit(-2); > } > > > However it does not work! I just get the string "0". > > If I run text-similarity from the command line it works fine. > If I call from my java program a shell command (like "ls") or a "hello world" > perl script then I can capture its output, so I guess the problem is somehow > related to the way text-similarity buffers its output. I've read of people > having this kind of issues when calling perl scripts from java and someone > proposes as a solution to put this at the beginning of perl scripts: > > use IO::Handle; > STDOUT->autoflush(1); > STDERR->autoflush(1); > [http://forums.sun.com/thread.jspa?threadID=189595&forumID=31] > > however, I've also tried to put this at the beginning of text_similarity.pl > but without luck! > > Does anyone know how should I do? > > > Thanks in advance, > Antonio Toral > > ------------------------------------------------------------------------------ > This SF.net email is sponsored by: > SourcForge Community > SourceForge wants to tell your story. > http://p.sf.net/sfu/sf-spreadtheword > _______________________________________________ > text-similarity-users mailing list > tex...@li... > https://lists.sourceforge.net/lists/listinfo/text-similarity-users > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Antonio T. <ant...@il...> - 2009-01-16 18:15:20
|
hi, i'd like to use text-similarity from a java program. So from this program I call text-similarity and then I capture its output. This is the java code: String command = "text_similarity.pl --type=Text::Similarity::Overlaps --string \"" + string1 + "\" \"" + string2 + "\""; String result = ""; try { Process p = Runtime.getRuntime().exec(command); int command_exit = p.waitFor(); System.err.println("Command ended with value: " + command_exit); BufferedReader stdInput = new BufferedReader(new InputStreamReader(p.getInputStream())); BufferedReader stdError = new BufferedReader(new InputStreamReader(p.getErrorStream())); String s = null; while ((s = stdInput.readLine()) != null) { System.out.println("\treading result: " + s); result += s; } while ((s = stdError.readLine()) != null) { System.out.println("\tERROR: " + s); } } catch (IOException e) { e.printStackTrace(); System.exit(-1); } catch (InterruptedException i) { i.printStackTrace(); System.exit(-2); } However it does not work! I just get the string "0". If I run text-similarity from the command line it works fine. If I call from my java program a shell command (like "ls") or a "hello world" perl script then I can capture its output, so I guess the problem is somehow related to the way text-similarity buffers its output. I've read of people having this kind of issues when calling perl scripts from java and someone proposes as a solution to put this at the beginning of perl scripts: use IO::Handle; STDOUT->autoflush(1); STDERR->autoflush(1); [http://forums.sun.com/thread.jspa?threadID=189595&forumID=31] however, I've also tried to put this at the beginning of text_similarity.pl but without luck! Does anyone know how should I do? Thanks in advance, Antonio Toral |
From: Ted P. <tpederse@d.umn.edu> - 2008-11-21 13:37:50
|
Hi Hamed, Thanks for your interest in Text-Similarity. See my comments inline... On Fri, Nov 21, 2008 at 1:13 AM, <kha...@pe...> wrote: > Dear Dr.Pedersen > > I need to use text::similarity module in my project so I wanted to know: > > 1- How to extract only Lesk measurement from text:similarity in case that > I need only that one since when I put > > ('normalize' => 1, 'verbose' => 1) it gives range of measures that I need > only Lesk to put in my program . is there any > > extra functions for that ? in other word I need to put Lesk measure into $my > score to be used for calculating > > sentences semantic relateness in my project. The following code (from http://search.cpan.org/dist/Text-Similarity/lib/Text/Similarity.pm) will give you just the Lesk (overlap) measure. This is not actually the same thing as "semantic relatedness", so if you are interested in that you might want to look at the lesk measure as found in WordNet::Similarity, which is based on the use of Text::Similarity but does some other things too. Text::Similarity (for lesk) simply finds the overlaps between two strings or files. use Text::Similarity::Overlaps; my $mod = Text::Similarity::Overlaps->new; defined $mod or die "Construction of Text::Similarity::Overlaps failed"; # adjust file names to reflect true relative position # these paths are valid from lib/Text/Similarity my $text_file1 = 'sent11.txt'; my $text_file2 = 'sent21.txt'; my $score = $mod->getSimilarity ($text_file1, $text_file2); print "The similarity of $text_file1 and $text_file2 is : $score\n"; > > 2- How to cite the job?in my project? You could use the following type of reference : Pedersen, Ted (2008) Text-Similarity (version 0.07) : A Perl Module to Measure the Pair-Wise Similarity of Files or Strings http://search.cpan.org/dist/Text-Similarity/ Good Luck! Ted > > Thank you very much. > > your attention would be most appreciated . > > Hamed Khanpour > > MCS student. > Malaysia > ------------------------------------------------------------------------------------ > UNIVERSITY OF MALAYA - " Producing Leaders Since 1905 " > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Ted P. <dul...@gm...> - 2008-11-16 00:04:24
|
We are pleased to announce the release of version 0.07 of Text-Similarity. This release has a single fix to a test case that has caused trouble for Windows installation, so you should only worry about upgrading if you are using Windows, or if you are using a version less than 0.06 (which had a number of significant changes). You can find download links from CPAN and sourceforge at http://text-similarity.sourceforge.net Please let us know if you have any questions or concerns! Cordially, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Ted P. <tpederse@d.umn.edu> - 2008-10-25 16:02:05
|
Hi Karthick, I'm glad to know you are finding Text::Similarity useful... I think the main documentation we have about these measures is found here : http://search.cpan.org/dist/Text-Similarity/lib/Text/Similarity/Overlaps.pm This gives the formulas that we use in the program - I think in general these are pretty commonly accepted definitions (except perhaps for lesk) so we didn't elaborate a great deal on them. However, I'm happy to add some details as needed. The lesk measure in terms of the overlap counting, etc. that we do is probably best described here (in section 7.3): An Adapted Lesk Algorithm for Word Sense Disambiguation using WordNet (Banerjee and Pedersen) - Appears in the Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, pp. 136-145, February 17-23, 2002, Mexico City. http://www.d.umn.edu/~tpederse/Pubs/cicling2002-b.pdf The other measures I *think* are fairly standard, although if you have doubts about what we have done with them let me know and I can hopefully clarify. Thanks! Ted On Sat, Oct 25, 2008 at 10:45 AM, Karthick Jayaraman <kar...@gm...> wrote: > Dear Professor, > > I am using your Text:Similarity package in one my current projects. Is > there any documentation on the details of the metrics such as > F-Measure, Precision, Recall, Cosine, and Lesk ? Kindly let me know. > > We are currently using your package to do establish similarity of > JavaScript programs that undergo certain forms of minor minor dynamic > updatings. > > We would like to cite your package and the reference on the metrics. > > -- > Cheers!, > Karthick Jayaraman > > You must do the things you think you cannot do. > Eleanor Roosevelt > > http://web.syr.edu/~kjayaram > -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Ted P. <dul...@gm...> - 2008-04-06 14:50:10
|
We are pleased to announce the release of version 0.06 of Text-Similarity. This is a module that WordNet-Similarity uses in the computation of the lesk measure, and one of the new features in this release is providing a "lesk" score that does our calculation for "lesk overlap" for any pair of files or strings you provide to it. As you may recall the lesk measure takes glosses and compares them for overlaps (matches) and then scores them by taking the length of each phrasal match, squaring it, and then summing those scores. Consider the following example (line breaks introduced for clarity) which measures the two given strings for similarity: text_similarity.pl --type Text::Similarity::Overlaps --verbose --stoplist stoplist.txt --string 'winston churchill was the prime minister of england' 'prime minister of england winston churchill came for a visit that day' keys: 2 -->'prime minister england' len(3) cnt(1) -->'winston churchill' len(2) cnt(1) wc 1: 5 wc 2: 7 Raw score: 5 Precision: 0.714285714285714 Recall : 1 F-measure: 0.833333333333333 Dice : 0.833333333333333 E-measure: 0.166666666666667 Cosine : 0.845154254728517 Raw lesk : 13 Lesk : 0.371428571428571 0.833333333333333 We find two phrasal matches of length 2 and 3, so those are scored (by raw lesk) as 2^2 + 3^2 = 13. That is then scaled by the product of the two string lengths to arrive at a normalized lesk score. By default WordNet Similarity uses raw lesk. Note that the raw score is simply the number of matching words (prime minister england winston churchill) without regard to their order, and that this value is the basis of all the other measures except for raw lesk and lesk. So, of the measures above, only lesk is really considering phrasal matches and treats them differently. This package provides both a command line program (text_similarity.pl) and Perl API calls (examples in the SYNOPSIS sections of the CPAN documentation). You can find more info and find download links at http://text-similarity.sourceforge.net I'm sure we'll continue to tinker with and extend Text Similarity, so please do let us know of any suggestions you have. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |
From: Ted P. <dul...@gm...> - 2008-04-06 14:40:35
|
We are pleased to announce the release of version 0.06 of Text-Similarity. This is a module that WordNet-Similarity uses in the computation of the lesk measure, and one of the new features in this release is providing a "lesk" score that does our calculation for "lesk overlap" for any pair of files or strings you provide to it. As you may recall the lesk measure takes glosses and compares them for overlaps (matches) and then scores them by taking the length of each phrasal match, squaring it, and then summing those scores. Consider the following example (line breaks introduced for clarity) which measures the two given strings for similarity: text_similarity.pl --type Text::Similarity::Overlaps --verbose --stoplist stoplist.txt --string 'winston churchill was the prime minister of england' 'prime minister of england winston churchill came for a visit that day' keys: 2 -->'prime minister england' len(3) cnt(1) -->'winston churchill' len(2) cnt(1) wc 1: 5 wc 2: 7 Raw score: 5 Precision: 0.714285714285714 Recall : 1 F-measure: 0.833333333333333 Dice : 0.833333333333333 E-measure: 0.166666666666667 Cosine : 0.845154254728517 Raw lesk : 13 Lesk : 0.371428571428571 0.833333333333333 We see two phrasal matches of length 2 and 3, so those are scored (by raw lesk) as 2^2 + 3^2 = 13. That is then scaled by the product of the two string lengths to arrive at a normalized lesk score. By default WordNet Similarity uses raw lesk. Note that the raw score is simply the number of matching words (prime minister england winston churchill) without regard to their order, and that this value is the basis of all the other measures except for raw lesk and lesk. So, of the measures above, only lesk is really considering phrasal matches and treating them differently. This package provides both a command line program (text_similarity.pl) and Perl API calls (examples in the SYNOPSIS sections of the CPAN documentation). You can find more info and find download links at http://text-similarity.sourceforge.net I'm sure we'll continue to tinker with and extend Text Similarity, so please do let us know of any suggestions you have. Enjoy, Ted -- Ted Pedersen http://www.d.umn.edu/~tpederse |