RE: [Senserelate-users] format of senseval-2 data (fwd)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi Gil,

Here are a few notes that Satanjeev Banerjee and I pulled together
regarding your questions.

> I want to create my own xml file to run on selected target words. Is the
> XML
> file the only way to feed input into disamb.pl?

Yes, disamb.pl is expecing input in the Senseval-2 format.

> If so, then I need to know
> how to create my own SENSEVAL-2 format file. For example, I don't
> understand
> how to create the "answer instance", "senseid", and "docsrc" values. In
> the
> sample below, the "instance id" is "art.40004". Where did the 4004 come
> from? And what about the docsrc "bnc_a6u_637"?

> <answer instance="art.40004" senseid="art~1:04:00::"/>

This line means that the human annotator decided that the sense of the word
within the <head>...</head> tags in the instance with id "art.40004" (in
this case, the word "art") was closest in meaning to the concept represented
by the WordNet synset "art~1;04:00::". The senseid value is a mnemonic, and
I think there is a file somewhere that maps mnemonics to the more useable
word#pos#sense format. Disamb.pl  I think,  only runs with the
word#pos#sense format. The "docsrc" field in the instance refers to the
document source the instance was taken from (for example, bnc stands for
the British National Corpus). Disamb.pl ignores the docsrc so you can
safely drop it when you construct your xml file. Similarly the 40004 is
some id that made sense to the makers of the data. Actually as long as all
the instance ids in a given xml file are different, I think disamb.pl
doesn't care.

> I'm interested in utlimately taking a keyword phrase such as "olympics" or
> "olympic runner" and comparing to N web pages to see which is page is most
> relevant to my keyword.

SenseRelate can do this, although I wonder if it might not be easier to
use WordNet::Similarity directly, available via cpan from

http://search.cpan.org/dist/WordNet-Similarity

Another approach would be to simply take the content of a web page, and
filter it such that you have only "content" words (ignore common stop
words, etc.) and then run WordNet::Similarity to measure the relatedness
between your query term and the content of the web page.

SenseRelate would clearly do something very much like this, but it has
lots of other features that are more specific to the disambiguation
problem, rather than simply determining if a page is related to a
particular query term. It might be more work to "retrofit" SenseRelate to
this task, than it would be to create something with WordNet::Similarity.
(Note that WordNet::Similarity is used in the background by SenseRelate.)

I hope this is of some help. Let us know if additional questions arise, or
if you want to pursue any of the above!

Cordially,
Ted (with lots of help from Bano!)