text-similarity-developers Mailing List for Text::Similarity

Status: Beta

Brought to you by: sidz1979, tpederse

text-similarity-developers — Text-Similarity Developers list

You can subscribe to this list here.

2004	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct	_Nov (2)	_Dec
2008	_Jan	_Feb	_Mar	_Apr (3)	_May	_Jun	_Jul	_Aug	_Sep (1)	_Oct	_Nov	_Dec
2010	_Jan	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2013	_Jan (1)	_Feb	_Mar	_Apr	_May	_Jun (1)	_Jul	_Aug	_Sep	_Oct	_Nov	_Dec
2015	_Jan	_Feb	_Mar	_Apr	_May	_Jun	_Jul	_Aug	_Sep	_Oct (1)	_Nov	_Dec

Flat | Threaded

[text-similarity-developers] Text::Similarity version 0.11 released (bug fix release)

From: Ted P. <dul...@gm...> - 2015-10-08 00:54:29

We are pleased to announce the release of version 0.11 of
Text::Similarity.  This includes a few fixes and corrections supplied by
users (which we are always most grateful for!).

You can download the new version from CPAN or sourceforge via links found
at http://text-similarity.sourceforge.net. Below is the change log for this
release. Finally,  we are very open to other patches or ideas that users
have, so please feel free to let us know!

0.11
        Released October 6, 2015 (all changes by TDP)

        Contributed enhancement by Tani Hosokawa

        Not a bug, but an optimization. Original version
        does inefficient repeated linear search over text
        that can't possibly match. Instead, precaches
        locations of keywords. Comparing 100 semi-randomly
        generated fairly similar documents of about 500
        words each results in approx 90% speed increase,
        the efficiency increases as the documents get larger.
        https://rt.cpan.org/Public/Ticket/Attachment/999948/520850

        Make various documentation/typo fixes as suggested by
        Alex Becker. Found in CPAN bug list.

Enjoy,
Ted

[text-similarity-developers] Text::Similarity v0.10 released

From: Ted P. <tpederse@d.umn.edu> - 2013-06-27 12:17:08

We are pleased to announce the release of version 0.10 of
Text-Similarity. This release only includes a single fix, and that is
a change to a test case that fails on Windows. Unless this sort of
thing really bothers you, you probably don't need to update. :)

You can find the most current version on CPAN or at sourceforge:
http://text-similarity.sourceforge.net

However, there is a more important announcement, and that is that as
of 0.10 Text-Similarity is again current in our sourceforge cvs
archive. There were some transitions happening at sourceforge when
0.09 came out, so we did not use cvs. But, we are back to using cvs
now, and that is always available for viewing or modifying if you are
interested. Note that the cvs module name is now TS. As of now the web
view hasn't been updated to include this new directory, but that
should occur in the next day or two. Additional instructions on using
cvs are available in sourceforge:

http://sourceforge.net/p/text-similarity/code/?source=navbar

Enjoy, and please let us know if any questions arise.
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-developers] Text::Similarity version 0.09 released

From: Ted P. <tpederse@d.umn.edu> - 2013-01-22 21:01:04

Version 0.09 of Text::Similarity has been released on CPAN and
sourceforge. This release includes two user contributions (that are
very much appreciated). See details below, and feel free to download
from http://text-similarity.sourceforge.net

 0.09
        Released January 22, 2013

        *   This release includes changes contributed by Myroslava Dzikovska
            that provide the full set of similarity scores programmatically.
            She modified the interface so that the getSimilarity function
            returns a pair ($score, %allScores) where %allScores is a hash
            of all possible scores that it computes. She made it so that in
            scalar context it will only return $score, so it is fully
            backwards compatible with the older versions. She also changed
            the printing to STDERR, to make it easier to use the code in
            filter scripts that depend on STDIN/STDOUT.

        *   This release also inludes changes ontributed by Nathan Glen to
            allow test cases to pass on Windows. The single quote used
            previously caused arguments to the script not to be passed
            corrected, leading to test failures. The single quotes have been
            changed to double quotes.

Enjoy,
Ted

[text-similarity-developers] Text-Similarity version 0.08 released

From: Ted P. <tpederse@d.umn.edu> - 2010-06-13 15:55:13

We are pleased to announce the release of version 0.08 of Text-Similarity.

This versions one important change - when you are using a stoplist,
you can now specify stop words using regular expressions.

In previous versions a stoplist can be specified as follows (in a
single file, one line per word)

a
of
in

This will cause a, of and in to be treated as stop words (and not use
them in computing similarity).

As of 0.08 you may continue to use the above format, or you can use
regular expressions...

For example...

/\b\w\b/
/\b\d+\b/

...would cause all single character words and numeric values to be removed...

You can get this new version via CPAN or sourceforge - find links to both at :

http://text-similarity.sourceforge.net

Enjoy,
Ted and Ying

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-developers] Fwd: CPAN Testers Daily Report

From: Ted P. <dul...@gm...> - 2008-09-24 03:08:02

---------- Forwarded message ----------
From: CPAN Tester Report Server <do_...@cp...>
Date: Tue, Sep 23, 2008 at 10:47 PM
Subject: CPAN Testers Daily Report
To: Ted Pedersen <TPE...@cp...>

Dear Ted Pedersen,

CPAN Testers Notifications have changed. This mail now comes from a
centralised server, and authors should no longer be receiving reports
directly from testers. If you do receive reports, please ask the
tester in question to update their version of Test-Reporter, which now
disables the CCing to authors. Thanks.

Please find below the latest reports for your distributions, generated
by CPAN Testers, from the last 24 hours.

Currently only FAIL reports are listed, with only the first instance
of a report for a distribution on a particular platform, using a
specific version of Perl. As such you may find further similar reports
at http://www.cpantesters.org.

Text-Similarity-0.06:
- MSWin32-x86-multi-thread / 5.10.0:
 - FAIL http://nntp.x.perl.org/group/perl.cpan.testers/2282033

This mail is generated by an automated system. If you do not wish to
receive these mails, please contact Barbie <ba...@cp...> and
request to be removed from the automatic mailings. If you have an
issue with a particular report, or wish to gain further information
from the tester, please use the 'Find A Tester' tool at
http://stats.cpantesters.org/cpanmail.html, using the NNTP ID of the
report to locate the correct email address.

Thanks,
The CPAN Testers
--
Reports: http://www.cpantesters.org

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [text-similarity-developers] [Senserelate-developers] WordNet::Tools and compoundify, version issues

From: Ted P. <dul...@gm...> - 2008-04-11 15:55:46

Hi Sid,

This looks great, and should actually be very helpful for both
Text-Similarity and SenseRelate, since both have compoundify operations.

I think having a new release of WordNet-Similarity with this and the other
changes you have in the cooker is a great idea. I was thinking of making
some small changes to the documentation in our /util programs and the web
interface programs, mostly so that they look a little better on CPAN (that
is cleaning up the NAME entries, things like that...) So I will tinker
around with that this morning, I'm sure it won't be very substantial nor
will it take very much time, then perhaps we can release thereafter....

Thanks!
Ted

On Fri, Apr 11, 2008 at 4:43 AM, Siddharth Patwardhan <si...@cs...>
wrote:

> Hi Ted,
>
> > Ah, very interesting. I didn't realize this was how things were
> > structured now,
> > but it makes good sense. I think that compounds.pl program is very
> > neat, and having a getCompounds method would actually be potentially
> > very useful for users. I think it's a natural enough question to ask -
> > that is, what are the compounds in WordNet...so having that as a part
> > of a Tools package makes good sense to me.
> >
> > I think what Text::Similarity needs is probably independent of WordNet
> > - that is it really just needs that string matching logic used in
> > compoundify - given a list of compounds find them in a given text - so
> > in that case a getCompounds method would be very handy (if we wanted
> > to find WordNet compounds) or the user could provide their own list
> > from some other source and then match in about the same way. The
> > matching logic is already in Text-Similarity and in fact it might work
> > as it is, I haven't looked at that too deeply as yet...
> >
> > So, anyway, I do think a getCompounds method in WordNet::Tools could
> > be very useful for those modules like Text-Similarity that might like
> > to go looking for WordNet compounds. Probably we wouldn't want to
> > build in a dependence on WordNet-Similarity though, so we'd just run
> > that once and then provide the compounds to Text-Similarity. Having
> > that list in a "Perl form" would be nice, as that would make it easy
> > to send into Text-Similarity...
>
> I just added a method getCompoundsList() to WordNet::Tools and committed
> it to CVS. A simple program that mimics compounds.pl, using this new
> method, will look like this:
>
> #! /usr/bin/perl
>
> use WordNet::QueryData;
> use WordNet::Tools;
>
> my $wn = WordNet::QueryData->new();
> die "Error: Unable to create WordNet::QueryData object.\n"
> if(!defined($wn));
>
> my $wntools = WordNet::Tools->new($wn);
> die "Error: Unable to create WordNet::Tools object.\n"
> if(!defined($wntools));
>
> my $arref = $wntools->getCompoundsList();
> die "Error: No list returned.\n" if(!defined($arref));
>
> foreach my $key (@{$arref})
> {
>  print "$key\n";
> }
>
>
> I guess, this new method will become available with the next release of
> WordNet-Similarity, which can be pretty soon.
>
> Thanks.
>
> -- Sid.
>
>
>


-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse <http://www.d.umn.edu/%7Etpederse>

Re: [text-similarity-developers] [Senserelate-developers] WordNet::Tools and compoundify, version issues

From: Ted P. <dul...@gm...> - 2008-04-10 17:49:03

Hi Sid,

See comments below...

On Wed, Apr 9, 2008 at 11:19 PM, Siddharth Patwardhan <si...@cs...>
wrote:

> Hi Ted,
>
> So... a getCompounds method could very easily be added to
> WordNet::Tools. Currently, the WordNet::Tools constructor (new())
> builds a list of compounds internally for use with compoundify.
> This is different from how we did things before -- i.e., first
> generate a compounds.txt file using compounds.pl, and then
> use the compounds.txt in compoundify. The compoundify in
> WordNet::Tools simply generates a list of compounds from WordNet
> at startup, and then uses this list as its list of compounds.
> A new getCompounds method in WordNet::Tools could simply return this
> internal list, if required.


Ah, very interesting. I didn't realize this was how things were structured
now,
but it makes good sense. I think that compounds.pl program is very neat, and
having a getCompounds method would actually be potentially very useful for
users. I think it's a natural enough question to ask - that is, what are the
compounds in WordNet...so having that as a part of a Tools package makes
good sense to me.

I think what Text::Similarity needs is probably independent of WordNet -
that is it really just needs that string matching logic used in compoundify
- given a list of compounds find them in a given text - so in that case a
getCompounds method would be very handy (if we wanted to find WordNet
compounds) or the user could provide their own list from some other source
and then match in about the same way. The matching logic is already in
Text-Similarity and in fact it might work as it is, I haven't looked at that
too deeply as yet...

So, anyway, I do think a getCompounds method in WordNet::Tools could be very
useful for those modules like Text-Similarity that might like to go looking
for WordNet compounds. Probably we wouldn't want to build in a dependence on
WordNet-Similarity though, so we'd just run that once and then provide the
compounds to Text-Similarity. Having that list in a "Perl form" would be
nice, as that would make it easy to send into Text-Similarity...

I just wanted to point out that the "hash-code" for the different
> versions WordNet isn't really a standard. It was just something we
> (rather Ben Haskel) came up with, to generate an identifier for
> WordNet, from the WordNet data files. We just run an SHA1 hash function
> over the WordNet data file names and their sizes, to get this unique
> identifier. But someone could easily come up with a different way to
> generate a WordNet version identifier. Also, if it so happens that two
> different WordNet versions have data files with the exact same sizes,
> then they would get the same identifier. So, this method is not perfect.
> But I think it works, in general, since different versions of WordNet
> are unlikely to have the exact same file sizes.


Thanks for clarifying this - I do think the SHA1 idea *should* provide
unique
identifiers, and in fact I think it might even be overly unique, in that a
Windows 2.0
and a Unix 2.0 should have different values (I assume there must be some
formatting differences that cause them to be rather different). But, I
actually think
that is good, in that it would make it possible to identify the exact
WordNet
version being used. But I do agree, we'll want to be on the alert for a
WordNet
that somehow has the same SHA1 values as another version. It doesn't seem
likely,
unless WordNet were to release a version that differed only in respect to
documentation
and not the data files, but that doesn't seem to be their style.

Anyway, if more and more people start using it, it could become a
> standard. But I guess for now maybe it would be better to refer to it
> as our internal WordNet version identifier, or something.


Agreed - best to make it clear we are the ones producing that hash, and not
potentially confuse WordNet users who then expect that elsewhere.

Thanks!
Ted


>
> -- Sid.
>
> On Wed, 2008-04-09 at 21:39 -0500, Ted Pedersen wrote:
> > Hi Sid,
> >
> > Yes, I think WordNet::Tools is terrific...there is in fact a kind of
> > interesting issue there - compoundify could even be viewed as WordNet
> > independent - it really just needs a list
> > of compounds from somewhere....and I think there are possibly some
> > issues like that
> > with the Freq.pl programs, not really compoundify issues, but
> > functionality that is primarily
> > text based and doesn't need WordNet, and indeed there is some redundancy
> between
> > those programs. Eliminating that redundancy has been on my list of
> things to do
> > for some time, and I think it would really be a nice enhancement to
> things...
> >
> > Anyway, the reason I was thinking about compoundify in a WordNet
> > independent sense
> > is that Text-Similarity wants to have a compounding operation
> > included, but it doesn't
> > currently have one (or the one it has doesn't seem to actually
> > work...) So..I don't
> > know if would make sense at all to think about a WordNet Tool that
> > just provided a
> > list of compounds and then a separate Text::Compoundify module...That
> actually
> > almost feels like a QueryData method....getCompounds or
> something....hmmm....
> >
> > As to other WordNet functionality, I just added some constants for my
> > hash values
> > to refer to wordnet versions more conveniently - I was kind of wishing
> that
> > WordNet-QueryData would go ahead and do that conversion so that we could
> get
> > reliable values from version() again, and in fact that's what confused
> > me earlier today.
> > I had thought that was done but I don't think it was...so
> > anyway....that does seem
> > like an operation that users? (maybe developers) might end up doing -
> > figuring out
> > a table of hash to wordnet version values....
> >
> > I wonder too, did we ever figure out if the hash values different on
> > Windows? I suppose
> > the must....so that's another possible point of failure, but...well,
> > one thing at a time. :)
> >
> > Otherwise, I think we've done a pretty good job of "exposing " the
> > functionalty of WordNet
> > Similarity so that people can get at some of the interesting functions
> > (like finding
> > hypernym trees, depths, etc.) , and i don't notice much duplication
> > any more except
> > as you say in some of the /utils... but, certainly worth thinking
> > about especially as
> > both SenseRelate and maybe even Text-Similarity start to grow up a bit
> > and make us
> > of different sorts of functionality....
> >
> > Thanks!
> > Ted
> >
> > On Wed, Apr 9, 2008 at 9:17 PM, Siddharth Patwardhan <si...@cs...>
> wrote:
> > > > WordNet::Tools (a module included in WordNet::Similarity) is
> something
> > > > we will need to
> > > > exploit in WordNet::SenseRelate  - it does two things that are
> > > > important for us there,
> > > > providing reliable version information, and then doing compoundify.
> We
> > > > do compoundify
> > > > in many different modules, but I think it makes sense to centralize
> it
> > > > it in one place,
> > > > and I think that place is WordNet::Tools...
> > >
> > > Right. That was the motivation behind creating WordNet::Tools...
> > > centralizing some common functions. Compoundify was present in many
> > > different modules and programs *within* WordNet::Similarity itself.
> > > And we updated the code to make it faster (twice I think). And each
> > > time we had to change all the different instances of the same
> function.
> > > So, we centralized it into WordNet::Tools.
> > >
> > > On that note, if you come across any other WordNet-specific function
> > > that can be centralized, you may want to consider putting it into
> > > WordNet::Tools. (Hmmm... now that I think about it, there is quite
> > > a bit of redundancy in the *Freq.pl programs... I wonder how much
> > > of that is WN-specific.)
> > >
> > > -- Sid.
> > >
> > >
> >
> >
> >
> > --
> > Ted Pedersen
> > http://www.d.umn.edu/~tpederse <http://www.d.umn.edu/%7Etpederse>
> >
> >
> -------------------------------------------------------------------------
> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> > Don't miss this year's exciting event. There's still time to save $100.
> > Use priority code J8TL2D2.
> >
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> > _______________________________________________
> > senserelate-developers mailing list
> > sen...@li...
> > https://lists.sourceforge.net/lists/listinfo/senserelate-developers
>
>


-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

[text-similarity-developers] text-similarity 0.05

From: Ted P. <tpederse@d.umn.edu> - 2008-04-04 19:34:43

Hi Sid,

I just released a version 0.05 of text-similarity, to address the issue of being
able to compare strings in addition to files.

The major change was adding a getSimilarityStrings method, which more or
less turns getSimilarity into a file processing front end to it - so the string
processing functionality was of course already in getSimilarity, it was just
not really exposed because of the file input, so I split them apart
more or less,
and now a user can input strings to getSimilarityStrings, or files to
getSimilarity.

I think WordNet-Similarity is unaffected by all this, and even if it was using
getSimilarity that functionality is still the same.

Also modified text_compare.pl to have a --string option so that a user can input
strings from the command line.

So, I guess the plan to give greater visibility to text-similarity is
working, although
it does lead to more work. :)

Let me know if you see anything amiss with this!

Thanks,
Ted

-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

Re: [Text-similarity-developers] Re: tagging and compoundifying

From: Jason R M. <mich0212@d.umn.edu> - 2004-11-12 21:53:35

There are 44 compounds that are both nouns and verbs in WordNet.  I 
think you're right that compounds are more likely to be nouns than verbs.

FYI, here are the 44 compounds:

split_up
water_ski
contra_danse
freak_out
roller_blade
single_crochet
ice_skate
bar_mitzvah
push_back
deep_freeze
gold_plate
ski_jump
single_stitch
turn_around
letter_bomb
black_marketeer
shell_stitch
nolle_prosequi
slam_dance
get_together
roller_skate
scotch_tape
test_drive
write_up
speed_skate
cave_in
black_market
tap_dance
strip_mine
bat_mitzvah
call_up
machine_gun
break_dance
folk_dance
square_dance
double_crochet
mop_up
goose_step
purl_stitch
double_cross
kick_up
roll_in_the_hay
belly_dance
double_stitch



ted pedersen wrote:
> Hi Jason,
> 
> This is really interesting. I hadn't thought of this before, but
> I see exactly what you are referring to.
> 
> I'd suggest the following - my experience has been that compounds
> are very often nouns (not always, but more often than they are
> verbs). So if we can't do any better, I'd suggest assuming that
> a compound is a noun.
> 
> Actually, now that I think of it - I wonder if there are many compounds
> that are both nouns and verbs (at least those known to WordNet)? I would
> doubt it. Would that be a useful fact?
> 
> I'll think about this some more...
> 
> Thanks!
> Ted

[Text-similarity-developers] Re: tagging and compoundifying

From: ted p. <tpederse@d.umn.edu> - 2004-11-12 21:37:35

Hi Jason,

This is really interesting. I hadn't thought of this before, but
I see exactly what you are referring to.

I'd suggest the following - my experience has been that compounds
are very often nouns (not always, but more often than they are
verbs). So if we can't do any better, I'd suggest assuming that
a compound is a noun.

Actually, now that I think of it - I wonder if there are many compounds
that are both nouns and verbs (at least those known to WordNet)? I would
doubt it. Would that be a useful fact?

I'll think about this some more...

Thanks!
Ted

On Fri, 12 Nov 2004, Jason Michelizzi wrote:

> I've come across a slight difficulty in working with compoundifying
> and converting POS tags from the Penn Treebank format to WN format.
> If we do compoundification on tagged words, it seems that we have to
> discard the POS tags.  The problem is that there are compound words
> that belong to more than one part of speech, such as machine_gun and
> goose_step (both of them can be either nouns or verbs).
>
> So if we came across text such as "goose/NN step/NN" or "machine/NN
> gun/NN", we could only turn that into "goose_step" and "machine_gun",
> but not "goose_step#n" or "machine_gun#v".  (The fact that step is
> tagged as a noun isn't much of a help, the Brill tagger always seems
> to tag it as a noun in the few experiments I tried, except when I had
> "stepped or stepping" instead).
>
> Jason
>

--
Ted Pedersen
http://www.d.umn.edu/~tpederse

Flat | Threaded

2004	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov (2)	Dec
2008	Jan	Feb	Mar	Apr (3)	May	Jun	Jul	Aug	Sep (1)	Oct	Nov	Dec
2010	Jan	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec
2013	Jan (1)	Feb	Mar	Apr	May	Jun (1)	Jul	Aug	Sep	Oct	Nov	Dec
2015	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct (1)	Nov	Dec