From: ted p. <tpederse@d.umn.edu> - 2004-07-16 00:44:17
|
Hi Gil, Here are a few notes that Satanjeev Banerjee and I pulled together regarding your questions. > I want to create my own xml file to run on selected target words. Is the > XML > file the only way to feed input into disamb.pl? Yes, disamb.pl is expecing input in the Senseval-2 format. > If so, then I need to know > how to create my own SENSEVAL-2 format file. For example, I don't > understand > how to create the "answer instance", "senseid", and "docsrc" values. In > the > sample below, the "instance id" is "art.40004". Where did the 4004 come > from? And what about the docsrc "bnc_a6u_637"? > <answer instance="art.40004" senseid="art~1:04:00::"/> This line means that the human annotator decided that the sense of the word within the <head>...</head> tags in the instance with id "art.40004" (in this case, the word "art") was closest in meaning to the concept represented by the WordNet synset "art~1;04:00::". The senseid value is a mnemonic, and I think there is a file somewhere that maps mnemonics to the more useable word#pos#sense format. Disamb.pl I think, only runs with the word#pos#sense format. The "docsrc" field in the instance refers to the document source the instance was taken from (for example, bnc stands for the British National Corpus). Disamb.pl ignores the docsrc so you can safely drop it when you construct your xml file. Similarly the 40004 is some id that made sense to the makers of the data. Actually as long as all the instance ids in a given xml file are different, I think disamb.pl doesn't care. > I'm interested in utlimately taking a keyword phrase such as "olympics" or > "olympic runner" and comparing to N web pages to see which is page is most > relevant to my keyword. SenseRelate can do this, although I wonder if it might not be easier to use WordNet::Similarity directly, available via cpan from http://search.cpan.org/dist/WordNet-Similarity Another approach would be to simply take the content of a web page, and filter it such that you have only "content" words (ignore common stop words, etc.) and then run WordNet::Similarity to measure the relatedness between your query term and the content of the web page. SenseRelate would clearly do something very much like this, but it has lots of other features that are more specific to the disambiguation problem, rather than simply determining if a page is related to a particular query term. It might be more work to "retrofit" SenseRelate to this task, than it would be to create something with WordNet::Similarity. (Note that WordNet::Similarity is used in the background by SenseRelate.) I hope this is of some help. Let us know if additional questions arise, or if you want to pursue any of the above! Cordially, Ted (with lots of help from Bano!) |