From: Ted P. <tpederse@d.umn.edu> - 2007-10-28 16:21:27
|
I submitted your entire file, and did get an error, and I think I understand the problem. First, the problem isn't being caused by not having one sentence per line (and one line per sentence), although you really should do that in order to get better results. The problem turns out to be a ' that appears in your text. WordNet-SenseRelate does not really use punctuation marks or function words in doing disambiguation, since it relies on WordNet words (and punctuation marks aren't in WordNet). So, WordNet-SenseRelate essentially ignores or removes anything not found in WordNet. There are other examples of this, for example you'll notice in your tagged output that "of" and "the" are not assigned senses - that is because they are not content words. WordNet-SenseRelate will only assign senses to nouns, verbs, adjectives and adverbs that are known to WordNet. For some reason the ' punctuation mark is sneaking past WordNet-SenseRelate and causing a problem because WordNet doesn't really include anything about '. WordNet-SenseRelate should in fact remove these kinds of punctuation marks, but it doesn't seem to handle this case. We'll fix that in future releases, although for the moment there is a pretty simple fix. Put your pos tagged output in one sentence per line format, and then remove all of the punctuation marks before submitting to WordNet-SenseRelate. I hope this helps! Below you can see the error that I got when running on your original input file. marimba(36): wsd.pl --context myfile --format tagged Current configuration: context file : myfile format : tagged scheme : normal tagged text : yes measure : WordNet::Similarity::lesk window : 4 contextScore : 0 pairScore : 0 measure config: (none) trace : no forcepos : no compound file : (none) stoplist : (none) Loading WordNet... done. (valid_forms) Invalid part-of-speech: ' at /usr/local/lib/perl5/site_perl/5.8.5/WordNet/QueryData.pm line 887. On 10/28/07, Ted Pedersen <tpederse@d.umn.edu> wrote: > I added a few more lines of your file (one line per sentence, one > sentence per line) and the following is the output that I got... > > wsd.pl --context file --format tagged > Current configuration: > context file : file > format : tagged > scheme : normal > tagged text : yes > measure : WordNet::Similarity::lesk > window : 4 > contextScore : 0 > pairScore : 0 > measure config: (none) > trace : no > forcepos : no > compound file : (none) > stoplist : (none) > Loading WordNet... done. > > Ad#n#1 sale#n#5 boost#v#3 Time Warner#n#1 profit#n#1 Quarterly#r#2 profit= s#n#1 a > t US#n#1 medium#n#1 giant#n#6 TimeWarner#n jump#v#1 76#a#1 %#n to $#v 1#a= #1 . > 13bn#n ( =C2#n =A3#n 600m#n ) for the three#a#1 month#n#2 to December , f= rom $#n 639 > m#n year#n#3 -#n earlier#r#2 . > The firm#n#1 , which is now#r#3 one#a#1 of the biggest investor#n#1 in Go= ogle#n# > 1 , benefit#v#1 from sale#n#5 of high#a#2 -#n speed#n#1 internet#n#1 conn= ection# > n#9 and high#a#2 advert#n#1 sale#n#4 . > TimeWarner#n say#v#8 fourth quarter#n#2 sale#n#5 rise#v#1 2#a#1 %#n to $#= v 11#a# > 1 . 1bn#n from $#n 10#a#1 . > 9bn#n . > Its profits#n#1 were buoy#v#3 by one#a#1 -#n off#r#2 gain#n#3 which offse= t#n#1 a > profit#n#2 dip#n#6 at Warner#n#2 Bros#n , and less user#n#1 for AOL#n . > Time Warner#n#2 say#v#2 on Friday that it now#r#3 own#v#1 8#a#1 %#n of se= arch#n# > 1 -#n engine#n#3 Google#n#1 . > But its own#a#1 internet#n#1 business#n#2 , AOL#n , had has mix#v#6 fortu= ne#n#4 > . It lose#v#3 464#a , 000#a subscriber#n#2 in the fourth quarter#n#2 prof= its#n#1 > were low#a#2 than in the precede#v#5 three#a#1 quarter#n#2 . > > > > On 10/28/07, Ted Pedersen <tpederse@d.umn.edu> wrote: > > Your input file should have just one sentence per line. I don't know > > if that explains the problem exactly or not, but when I ran with just > > one sentence on the first line, I got the output as shown below: > > > > marimba(6): wsd.pl --context file --format tagged > > Current configuration: > > context file : file > > format : tagged > > scheme : normal > > tagged text : yes > > measure : WordNet::Similarity::lesk > > window : 4 > > contextScore : 0 > > pairScore : 0 > > measure config: (none) > > trace : no > > forcepos : no > > compound file : (none) > > stoplist : (none) > > Loading WordNet... done. > > Ad#n#1 sale#n#5 boost#v#3 Time Warner#n#1 profit#n#1 Quarterly#r#2 > > profits#n#1 at US#n#1 medium#n#1 giant#n#6 TimeWarner#n jump#v#1 > > 76#a#1 %#n to $#v 1#a#1 . > > > > > > On 10/27/07, wael gomaa <drw...@ya...> wrote: > > > > > > I had used the Brill Tagger to tag my corpora using NLTK . > > > When typing this command to perform WSD wsd.pl -context myfile.t= xt > > > -format tagged > > > This error was [Invalid Part Of Speech in QueryData.pm at line 887] > > > I know that wsd modules use Penn Treebank tagset , is there a differa= nce > > > between Brill tagset and Penn Treebank dataset ? if yes how can i con= vert > > > from brill to treebank to support wsd module . > > > > > > 0001.txt Example of my tagged data is attached with my message. > > > __________________________________________________ > > > Do You Yahoo!? > > > Tired of spam? Yahoo! Mail has the best spam protection around > > > http://mail.yahoo.com > > > ---------------------------------------------------------------------= ---- > > > This SF.net email is sponsored by: Splunk Inc. > > > Still grepping through log files to find problems? Stop. > > > Now Search log events and configuration files using AJAX and a browse= r. > > > Download your FREE copy of Splunk now >> http://get.splunk.com/ > > > _______________________________________________ > > > senserelate-users mailing list > > > sen...@li... > > > https://lists.sourceforge.net/lists/listinfo/senserelate-users > > > > > > > > > > > > > > > -- > > Ted Pedersen > > http://www.d.umn.edu/~tpederse > > > > > -- > Ted Pedersen > http://www.d.umn.edu/~tpederse > --=20 Ted Pedersen http://www.d.umn.edu/~tpederse |