From: Ted P. <tpederse@d.umn.edu> - 2011-11-17 03:55:33
|
Hi Wesley, You are right - wsd.pl does not handle e.g. terribly well! We ran into this issue with i.e. and inserted a special case into the wsd.pl code which you can see below...wsd.pl does some very basic text cleaning where punctuation marks are removed except in those cases where it is important to WordNet to preserve (like i.e.) sub cleanLine { my $line = shift; chomp($line); my @words=split(/ +/,$line); foreach my $word (@words){ next if($word eq "i.e." || $word eq "ie." || $word eq "et_al." || $word eq "al."); $word =~ s/([A-Z])/\L$1/g; if ($word =~ m/_/){ $word =~ s/[.|!|?|,|;]+$/ /; } else{ $word =~ s/[^$OK_CHARS]/ /g; } } return join (' ', @words); } This should be expanded to include e.g. and perhaps other abbreviations, although I don't think I'll be able to do that really quickly (so you might want to modify yourself if Perl is familiar - if not we can try and expedite things a bit...so let us know). Thanks for pointing this out! Cordially, Ted On Wed, Nov 16, 2011 at 7:52 PM, Wesley May <wj...@gm...> wrote: > Ah, upon further review the mistake is mine, never mind :D > > Though there is another thing (the opposite problem perhaps!). Words > like "e.g." get split into two word, "e" and "g". Is there a... > nouncompoundify? :) > > Thanks! > > On Wed, Nov 16, 2011 at 8:40 PM, Ted Pedersen <tpederse@d.umn.edu> wrote: >> Hi Wesley, >> >> Actually that is what --nocompoundify is *supposed* to be doing - >> could you send me the command you are running and the output you are >> gettting? Then I can investigate a bit further. >> >> Thanks! >> Ted >> >> On Wed, Nov 16, 2011 at 4:43 PM, Wesley May <wj...@gm...> wrote: >>> Hi Ted, >>> >>> Is there a way to disable making WordNet compounds in >>> SenseRelate-AllWords? For instance, if I have the (stop-word removed) >>> sentence: >>> >>> "tires rattling wheels now roll off another day was valley" >>> >>> ...then SenseRelate tries to form the compound "roll_off". I thought >>> that the --nocompoundify was for this, but I guess I'm wrong because >>> it doesn't seem to stop that. >>> >>> Thanks! >>> Wesley May >>> >>> >>> On Tue, Nov 15, 2011 at 5:50 PM, Wesley May <wj...@gm...> wrote: >>>> Looks good, thanks very much! I'll let you know if I have any questions :) >>>> >>>> Wesley May >>>> >>>> On Sun, Nov 13, 2011 at 6:05 PM, Ted Pedersen <tpederse@d.umn.edu> wrote: >>>>> HI Wesley, >>>>> >>>>> I think in 2007 there were two systems that we'd call unsupervised. >>>>> One was knowledge based (WordNet::SenseRelate::AllWords) and the other >>>>> was a clustering approach (SenseClusters). You can find both of those >>>>> here: >>>>> >>>>> http://senserelate.sourceforge.net >>>>> http://senseclusters.sourceforge.net >>>>> >>>>> The 2007 systems used these pretty much out of the box, so it's simply >>>>> a matter of setting the command line parameters appropriately, which I >>>>> hope is documented in our system description papers (which you can >>>>> find on my publications page). >>>>> >>>>> But, if you have any questions about any of this, please don't >>>>> hestitate to let me know. >>>>> >>>>> Good luck! >>>>> Ted >>>>> >>>>> On Sun, Nov 13, 2011 at 3:40 PM, Wesley May <we...@cs...> wrote: >>>>>> Hi Dr. Pedersen, >>>>>> >>>>>> I'm a grad student at the University of Toronto, working with Suzanne >>>>>> Stevenson, and I'm looking for a good unsupervised, general-purpose >>>>>> WSD algorithm. >>>>>> Do you happen to have code available for your SemEval 2007 submission? >>>>>> >>>>>> Thanks very much! >>>>>> Wesley May >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Ted Pedersen >>>>> http://www.d.umn.edu/~tpederse >>>>> >>>> >>> >> >> >> >> -- >> Ted Pedersen >> http://www.d.umn.edu/~tpederse >> > -- Ted Pedersen http://www.d.umn.edu/~tpederse |