(Mike Atkinson sent this message, with the code, and I'm posting it here so that I can respond here).
Jason,
I have attached a zip file of the MWE code plus data. Much of the data
was extracted
from WordNet, but the FixedLexical data comes from perusing various
dictionaries for
foriegn phrases (and is incomplete).
The code & data go in package opennlp.grok.preprocess.
I've added simple variable MWE expressions, with noun MWEs having a
variable last
word (to handle plurals) and verb MWEs having a variable first word (to
handle tenses).
While this will probably not work in every case, the vast majority will
work correctly, it
slightly over-generates, some miss-spellings will become part of a MWE.
I've tried to get the POSTagger to work, but it fails loading the data.
Having now tried this I think that the NLPDocument should have
<s>
<t type="mwe">
<w cat="N" pos="adv,adj">ad hoc</w>
</t>
<s>
at least for the FixedLexical MWE (foriegn words) as the cat-tagger and
pos-tagger
probably would not work correctly.
--
Mike Atkinson
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've tried out the package briefly --- everything compiles and it works fine in a pipeline, so I've committed it to the head of Grok cvs. Thanks! Though I haven't had a chance to grok your code in full, I'm impressed by what I saw, and will go ahead and add you as a Grok developer so that you can work on the package directly.
The reason the POSTagger failed to load the data is that the maxent model is stored in the grok/src/java/opennlp/grok/preprocess/postag/data directory, and you need to be sure that is in your classpath. I've added a script called "runpipe" in grok/samples/pipe that will make sure the classpath stuff works out fine (as long as you haev the environment variable GROK_HOME set to where you have installed Grok). Do a cvs update and try it.
Like I said, I haven't had time to look at everything in great detail, but here is part of the output of running the opennlp.grok.preprocess.mwe.EnglishFixedLexicalMWE
as part of a pipeline (the code is in SimplePipe.java of the pipe sample).
Notice that the POS tagger is only expecting single word tokens... oops! Would you be interested in updating the process(NLPDocument doc) method of POSTaggerME to handle MWEs?
Great stuff!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I notice that the POS tagger quite often gets the POS wrong as many words have multiple valid POS in the local context
WordNet gives three meanings for "ad hoc".
adj "an ad hoc committee meeting"
adj "a coordinated policy instead of ad hoc decisions"
adv "they were appointed ad hoc"
My version of the POS tagger gives them NN, JJ, NN, respectively. Ideally, we would want the POS tagger to always get it right, but I doubt its possible due to the many quirks of English and the limited training data to provide context.
Using the WordNet data I could POS tag the MWE as
<t type="mwe">
<w pos="JJ,RB">ad hoc</w>
</t>
and then change the logic in the POS tagger so that it takes into account previous pos attributes. Alternatives would be
where "a" is a new tag introduced to show alternative readings. "a" may be useful in other circumstances, for instance if the sentence structure is not clear.
<p>
<a>
<s> ... </s>
<s> ... </s>
</a>
</p>
alternatives using "a" could lead to a combinatorial explosion if not used sparingly, but they would capture constraints between different readings which span tagging (as used by Gate) do not.
It would also be possible, with a bit of work, to add relative frequencies, extracted from the WordNet data.
The semantics of multiple tags in the pos attribute for token is then that it is a ranked list of likely tags for the MWE contained therein. Since the frequencies are relative, the ranking should be sufficient and we don't need a freq attribute.
I am a bit concerned about use an "alternative" element, though I appreciate its utility. I'm not sure why span tagging wouldn't be able to do this however --- in fact I think it would be better suited for it. Can you explain?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(Mike Atkinson sent this message, with the code, and I'm posting it here so that I can respond here).
Jason,
I have attached a zip file of the MWE code plus data. Much of the data
was extracted
from WordNet, but the FixedLexical data comes from perusing various
dictionaries for
foriegn phrases (and is incomplete).
The code & data go in package opennlp.grok.preprocess.
I have been using the pipeline:
private String[] ppLinks = {
"opennlp.grok.preprocess.sentdetect.EnglishSentenceDetectorME" ,
"opennlp.grok.preprocess.tokenize.EnglishTokenizerME",
"opennlp.grok.preprocess.mwe.EnglishVariableLexicalMWE"};
I've added simple variable MWE expressions, with noun MWEs having a
variable last
word (to handle plurals) and verb MWEs having a variable first word (to
handle tenses).
While this will probably not work in every case, the vast majority will
work correctly, it
slightly over-generates, some miss-spellings will become part of a MWE.
I've tried to get the POSTagger to work, but it fails loading the data.
Having now tried this I think that the NLPDocument should have
<s>
<t type="mwe">
<w cat="N" pos="adv,adj">ad hoc</w>
</t>
<s>
at least for the FixedLexical MWE (foriegn words) as the cat-tagger and
pos-tagger
probably would not work correctly.
--
Mike Atkinson
I've tried out the package briefly --- everything compiles and it works fine in a pipeline, so I've committed it to the head of Grok cvs. Thanks! Though I haven't had a chance to grok your code in full, I'm impressed by what I saw, and will go ahead and add you as a Grok developer so that you can work on the package directly.
The reason the POSTagger failed to load the data is that the maxent model is stored in the grok/src/java/opennlp/grok/preprocess/postag/data directory, and you need to be sure that is in your classpath. I've added a script called "runpipe" in grok/samples/pipe that will make sure the classpath stuff works out fine (as long as you haev the environment variable GROK_HOME set to where you have installed Grok). Do a cvs update and try it.
Like I said, I haven't had time to look at everything in great detail, but here is part of the output of running the opennlp.grok.preprocess.mwe.EnglishFixedLexicalMWE
as part of a pipeline (the code is in SimplePipe.java of the pipe sample).
<t>
<w pos="RB">not</w>
</t>
<t>
<w pos="VB">be</w>
</t>
<t type="mwe">
<w pos="JJ">ad</w>
<w>hoc</w>
</t>
<t>
<w pos="CC">and</w>
</t>
Notice that the POS tagger is only expecting single word tokens... oops! Would you be interested in updating the process(NLPDocument doc) method of POSTaggerME to handle MWEs?
Great stuff!
Am I right in thinking you are using this POS tag set?
1. CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential `there'
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO `to'
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past participle
31. VBP Verb, non-3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh-pronoun
36. WRB Wh-adverb
37. " Simple double quote
38. $ Dollar sign
39. # Pound sign
40. ` Left single quote
41. ' Right single quote
42. `` Left double quote
43. '' Right double quote
44. ( Left parenthesis (round, square, curly or angle bracket)
45. ) Right parenthesis (round, square, curly or angle bracket)
46. , Comma
47. . Sentence-final punctuation
48. : Mid-sentence punctuation
The description of the tagset the tagger uses can be found here:
ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz
I notice that the POS tagger quite often gets the POS wrong as many words have multiple valid POS in the local context
WordNet gives three meanings for "ad hoc".
adj "an ad hoc committee meeting"
adj "a coordinated policy instead of ad hoc decisions"
adv "they were appointed ad hoc"
My version of the POS tagger gives them NN, JJ, NN, respectively. Ideally, we would want the POS tagger to always get it right, but I doubt its possible due to the many quirks of English and the limited training data to provide context.
Using the WordNet data I could POS tag the MWE as
<t type="mwe">
<w pos="JJ,RB">ad hoc</w>
</t>
and then change the logic in the POS tagger so that it takes into account previous pos attributes. Alternatives would be
<t type="mwe">
<w pos="JJ,RB">ad</w>
<w pos="JJ,RB">hoc</w>
</t>
or
<t type="mwe">
<a>
<w pos="JJ">ad hoc</w>
<w pos="RB">ad hoc</w>
</a>
</t>
where "a" is a new tag introduced to show alternative readings. "a" may be useful in other circumstances, for instance if the sentence structure is not clear.
<p>
<a>
<s> ... </s>
<s> ... </s>
</a>
</p>
alternatives using "a" could lead to a combinatorial explosion if not used sparingly, but they would capture constraints between different readings which span tagging (as used by Gate) do not.
It would also be possible, with a bit of work, to add relative frequencies, extracted from the WordNet data.
<t type="mwe">
<w pos="JJ,RB" freq="0.6,0.4">ad hoc</w>
</t>
How about:
<t type="mwe" pos="JJ RB">
<w>ad</w>
<w>hoc</w>
</t>
The semantics of multiple tags in the pos attribute for token is then that it is a ranked list of likely tags for the MWE contained therein. Since the frequencies are relative, the ranking should be sufficient and we don't need a freq attribute.
I am a bit concerned about use an "alternative" element, though I appreciate its utility. I'm not sure why span tagging wouldn't be able to do this however --- in fact I think it would be better suited for it. Can you explain?
I am probably wrong as I can't think of a good example of why <a> would be better than span tagging.