The OpenNLP Grok Library / Discussion / Open Discussion: Multiword expression package (Mike)

Jason Baldridge - 2002-03-12

(Mike Atkinson sent this message, with the code, and I'm posting it here so that I can respond here).

Jason,

I have attached a zip file of the MWE code plus data. Much of the data
was extracted
from WordNet, but the FixedLexical data comes from perusing various
dictionaries for
foriegn phrases (and is incomplete).

The code & data go in package opennlp.grok.preprocess.

I have been using the pipeline:

    private String[] ppLinks = {
"opennlp.grok.preprocess.sentdetect.EnglishSentenceDetectorME" ,
"opennlp.grok.preprocess.tokenize.EnglishTokenizerME",
"opennlp.grok.preprocess.mwe.EnglishVariableLexicalMWE"};

I've added simple variable MWE expressions, with noun MWEs having a
variable last
word (to handle plurals) and verb MWEs having a variable first word (to
handle tenses).
While this will probably not work in every case, the vast majority will
work correctly, it
slightly over-generates, some miss-spellings will become part of a MWE.

I've tried to get the POSTagger to work, but it fails loading the data.

Having now tried this I think that the NLPDocument should have
<s>
   <t type="mwe">
      <w cat="N" pos="adv,adj">ad hoc</w>
   </t>
<s>
at least for the FixedLexical MWE (foriegn words) as the cat-tagger and
pos-tagger
probably would not work correctly.

--
Mike Atkinson

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jason Baldridge - 2002-03-12
  
  I've tried out the package briefly --- everything compiles and it works fine in a pipeline, so I've committed it to the head of Grok cvs. Thanks! Though I haven't had a chance to grok your code in full, I'm impressed by what I saw, and will go ahead and add you as a Grok developer so that you can work on the package directly.
  
  The reason the POSTagger failed to load the data is that the maxent model is stored in the grok/src/java/opennlp/grok/preprocess/postag/data directory, and you need to be sure that is in your classpath. I've added a script called "runpipe" in grok/samples/pipe that will make sure the classpath stuff works out fine (as long as you haev the environment variable GROK_HOME set to where you have installed Grok). Do a cvs update and try it.
  
  Like I said, I haven't had time to look at everything in great detail, but here is part of the output of running the opennlp.grok.preprocess.mwe.EnglishFixedLexicalMWE
  as part of a pipeline (the code is in SimplePipe.java of the pipe sample).
  
        <t>
            <w pos="RB">not</w>
          </t>
          <t>
            <w pos="VB">be</w>
          </t>
          <t type="mwe">
            <w pos="JJ">ad</w>
            <w>hoc</w>
          </t>
          <t>
            <w pos="CC">and</w>
          </t>
  
  Notice that the POS tagger is only expecting single word tokens... oops! Would you be interested in updating the process(NLPDocument doc) method of POSTaggerME to handle MWEs?
  Great stuff!
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Atkinson - 2002-03-13
  
  Am I right in thinking you are using this POS tag set?
  
  1.   CC     Coordinating conjunction
  2.   CD     Cardinal number
  3.   DT     Determiner
  4.   EX     Existential `there'
  5.   FW     Foreign word
  6.   IN     Preposition or subordinating conjunction
  7.   JJ     Adjective
  8.   JJR    Adjective, comparative
  9.   JJS    Adjective, superlative
  10. LS     List item marker
  11. MD     Modal
  12. NN     Noun, singular or mass
  13. NNS    Noun, plural
  14. NNP     Proper noun, singular
  15. NNPS    Proper noun, plural
  16. PDT    Predeterminer
  17. POS    Possessive ending
  18. PRP     Personal pronoun
  19. PRP$    Possessive pronoun
  20. RB     Adverb
  21. RBR    Adverb, comparative
  22. RBS    Adverb, superlative
  23. RP     Particle
  24. SYM    Symbol
  25. TO     `to'
  26. UH     Interjection
  27. VB     Verb, base form
  28. VBD    Verb, past tense
  29. VBG    Verb, gerund or present participle
  30. VBN    Verb, past participle
  31. VBP    Verb, non-3rd person singular present
  32. VBZ    Verb, 3rd person singular present
  33. WDT    Wh-determiner
  34. WP     Wh-pronoun
  35. WP$    Possessive wh-pronoun
  36. WRB    Wh-adverb
  37. "      Simple double quote
  38. $      Dollar sign
  39. #      Pound sign
  40. `      Left single quote
  41. '      Right single quote
  42. ``     Left double quote
  43. ''     Right double quote
  44. (      Left parenthesis (round, square, curly or angle bracket)
  45. )      Right parenthesis (round, square, curly or angle bracket)
  46. ,      Comma
  47. .      Sentence-final punctuation
  48. :      Mid-sentence punctuation
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jason Baldridge - 2002-03-13
    
    The description of the tagset the tagger uses can be found here:
    
    ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Atkinson - 2002-03-13
  
  I notice that the POS tagger quite often gets the POS wrong as many words have multiple valid POS in the local context
  
  WordNet gives three meanings for "ad hoc".
  
  adj "an ad hoc committee meeting"
  adj "a coordinated policy instead of ad hoc decisions"
  adv "they were appointed ad hoc"
  
  My version of the POS tagger gives them NN, JJ, NN, respectively. Ideally, we would want the POS tagger to always get it right, but I doubt its possible due to the many quirks of English and the limited training data to provide context.
  
  Using the WordNet data I could POS tag the MWE as
  <t type="mwe">
  <w pos="JJ,RB">ad hoc</w>
  </t>
  
  and then change the logic in the POS tagger so that it takes into account previous pos attributes. Alternatives would be
  
  <t type="mwe">
  <w pos="JJ,RB">ad</w>
  <w pos="JJ,RB">hoc</w>
  </t>
  
  or
  <t type="mwe">
  <a>
      <w pos="JJ">ad hoc</w>
      <w pos="RB">ad hoc</w>
  </a>
  </t>
  
  where "a" is a new tag introduced to show alternative readings. "a" may be useful in other circumstances, for instance if the sentence structure is not clear.
  
  <p>
  <a>
      <s> ... </s>
      <s> ... </s>
  </a>
  </p>
  
  alternatives using "a" could lead to a combinatorial explosion if not used sparingly, but they would capture constraints between different readings which span tagging (as used by Gate) do not.
  
  It would also be possible, with a bit of work, to add relative frequencies, extracted from the WordNet data.
  
  <t type="mwe">
  <w pos="JJ,RB" freq="0.6,0.4">ad hoc</w>
  </t>
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jason Baldridge - 2002-03-13
    
    How about:
    
    <t type="mwe" pos="JJ RB">
    <w>ad</w>
    <w>hoc</w>
    </t>
    
    The semantics of multiple tags in the pos attribute for token is then that it is a ranked list of likely tags for the MWE contained therein. Since the frequencies are relative, the ranking should be sufficient and we don't need a freq attribute.
    I am a bit concerned about use an "alternative" element, though I appreciate its utility. I'm not sure why span tagging wouldn't be able to do this however --- in fact I think it would be better suited for it. Can you explain?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Atkinson - 2002-03-15
  
  I am probably wrong as I can't think of a good example of why <a> would be better than span tagging.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Multiword expression package (Mike)

Forums

Help

Multiword expression package (Mike)

Multiword expression package (Mike)

Forums

Help

Multiword expression package (Mike) document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Multiword expression package (Mike)