From: ted p. <tpederse@d.umn.edu> - 2005-03-14 12:50:36
|
We are pleased to announce the release of WordNet-SenseRelate version 0.03. This is an all words word sense disambiguation program that will assign WordNet senses to every word in a text that is known to WordNet. It is based on the use of the measures of semantic similarity and relatedness as found in WordNet::Similarity, and assigns the sense to a word that is most related to the senses of its neighbors. Our SenseRelate page has download and general info: http://senserelate.sourceforge.net ...and you can download directly from CPAN : http://search.cpan.org/dist/WordNet-SenseRelate/ or Sourceforge: http://sourceforge.net/projects/senserelate There are a number of significant changes to this version, perhaps the most interesting is that we now accept three different forms of input text. Examples of each are shown below, along with demonstrations of a few other options. In our humble opinion, there is a surprising richness to what seems like a simple program, so please do experiment with it a bit, and feel free to ask questions or make requests for additional features, etc. BTW, in this version we provide scripts that convert SemCor and the Senseval-2 and Senseval-3 all words data into a form that our program can use. This scripts support the creation of answer files that can be scored by the standard Senseval scoring program. ========================================================================= This version allows for three different types of input. Examples of each type are shown below. There are two required parameters to the wsd.pl program, --context and --format. The rest are all optional. Only a few of the options are shown below, please make sure to check the documentation for information on the rest of the options. 1) raw - plain text, where simple tokenization and sentence boundary detection will be carried out on your behalf prior to sense tagging. Your input: Red cars are faster than white cars. However, white cars are less expensive. You run: wsd.pl --context input.txt --window 1 --format raw --outfile output.txt Your output to STDOUT: Red#n#1 car#n#1 be#v#1 fast#a#1 than white#n#1 car#n#1 However#r#3 white#n#1 car#n#1 be#v#1 less#a#1 expensive#a#1 Your output to file output.txt Red Red n 1 cars car n 1 are be v 1 faster fast a 1 than than white white n 1 cars car n 1 However However r 3 white white n 1 cars car n 1 are be v 1 less less a 1 expensive expensive a 1 2) parsed - this is plain text that you have tokenized and done sentence boundary detection on, such that you have one line per sentence, one sentence per line. (Please note that this doesn't mean parsed in the sense of having tree structures, etc.) Your input file input.txt: Red cars are faster than white cars However white cars are less expensive You run: wsd.pl --context input.txt --window 1 --format parsed --outfile output.txt I won't show the output again, it's the same as above. But the idea here is that if you have particular tokenization requirements, or if you can do sentence boundary detection really well, you may want to control that, rather than allowing us to do that in the raw format. 3) tagged - part of speech tagged text, formatted as output from the Brill Tatgger would be formatted. That means one sentence per line, one line per sentence, and part of speech tags that look like this: Your input file input.txt: Red/JJ cars/NNS are/VBP faster/RBR than/IN white/JJ cars/NNS However/RB white/JJ cars/NNS are/VBP less/RBR expensive/JJ You run: wsd.pl --context input.txt --window 1 --format tagged --outfile output.txt Your output to STDOUT: Red#a#3 car#n#1 be#v#1 faster#r#1 than white#a#1 car#n#2 However#r#5 white#a#1 car#n#1 be#v#1 less#r#1 expensive#a#1 Your output to file output.txt: Red/JJ Red a 3 cars/NNS car n 1 are/VBP be v 1 faster/RBR faster r 1 than/IN than white/JJ white a 1 cars/NNS car n 2 However/RB However r 5 white/JJ white a 1 cars/NNS car n 1 are/VBP be v 1 less/RBR less r 1 expensive/JJ expensive a 1 There are a few things that we didn't show here that are probably really important. In particular --compounds and --stoplist. You will most likely want to specify a list of compounds that are known to WordNet, and perhaps some that are not. wsd.pl will identify these compounds in a text and disambiguate them as a unit, for example, it will treat Winston Churchill as a single word if you specify the compound winston_churchill in the --compounds file. (WordNet does have a sense for Winston Churchill You can get a complete list of all the WordNet compounds here: http://www.d.umn.edu/~tpederse/wordnet.html or in WordNet::Similarity. You may also want to specify compounds that are not known to WordNet, so you at least ignore these and don't do silly things, like assigning a sense to "Tom" and "Cruise" separately when encountering "Tom Cruise". If you specify the compound tom_cruise then this will be treated as a single word and not assigned a sense (since it is not known to WordNe). So, let's suppose we are disambiguating Tom Cruise is a great man and a fine actor. Well, if we run without the stop list or compounds, as in... wsd.pl --context sr1.txt --window 1 --format raw we get something like this... Tom#n#1 Cruise#v#3 be#v#1 a#n#5 great#a#4 man#n#5 and a#n#5 fine#a#4 actor#n#1 Now, this has nothing to do with Tom Cruise the actor, and note that we have also sense tagged "a" as a noun, which seems unlikely. So we should have used a --stoplist and a --compounds file. Your compounds file compounds.txt: tom_cruise and your stop list file stop.txt: /\ba\b/ Note that the stoplists follow the NSP convention, which allows each stop word to be specified by a regex. More details are available here: http://www.d.umn.edu/~tpederse/Code/Readme.nsp-v0.67.html#5.6._stopping_the_ngrams: Then you run... wsd.pl --context sr1.txt --window 1 --format raw --stoplist stop.txt --compounds compound.txt and you get the following, which does not tag Tom Cruise nor a. tom_cruise be#v#8 a great#a#5 man#n#5 and a fine#v#1 actor#n#2 Anyway, there are lots of wrinkles like this. Give 0.03 a try, it's pretty fun. :) Enjoy, Ted and Jason -- Ted Pedersen http://www.d.umn.edu/~tpederse |