Menu

Multiword expression package (Mike)

2002-03-12
2002-03-15
  • Jason Baldridge

    Jason Baldridge - 2002-03-12

    (Mike Atkinson sent this message, with the code, and I'm posting it here so that I can respond here).

    Jason,

    I have attached a zip file of the MWE code plus data. Much of the data
    was extracted
    from WordNet, but the FixedLexical data comes from perusing various
    dictionaries for
    foriegn phrases (and is incomplete).

    The code & data go in package opennlp.grok.preprocess.

    I have been using the pipeline:

        private String[] ppLinks = {
    "opennlp.grok.preprocess.sentdetect.EnglishSentenceDetectorME" ,
    "opennlp.grok.preprocess.tokenize.EnglishTokenizerME",
    "opennlp.grok.preprocess.mwe.EnglishVariableLexicalMWE"};

    I've added simple variable MWE expressions, with noun MWEs having a
    variable last
    word (to handle plurals) and verb MWEs having a variable first word (to
    handle tenses).
    While this will probably not work in every case, the vast majority will
    work correctly, it
    slightly over-generates, some miss-spellings will become part of a MWE.

    I've tried to get the POSTagger to work, but it fails loading the data.

    Having now tried this I think that the NLPDocument should have
    <s>
       <t type="mwe">
          <w cat="N" pos="adv,adj">ad hoc</w>
       </t>
    <s>
    at least for the FixedLexical  MWE (foriegn words) as the cat-tagger and
    pos-tagger
    probably would not work correctly.

    --
    Mike Atkinson

     
    • Jason Baldridge

      Jason Baldridge - 2002-03-12

      I've tried out the package briefly --- everything compiles and it works fine in a pipeline, so I've committed it to the head of Grok cvs. Thanks!  Though I haven't had a chance to grok your code in full, I'm impressed by what I saw, and will go ahead and add you as a Grok developer so that you can work on the package directly.

      The reason the POSTagger failed to load the data is that the maxent model is stored in the grok/src/java/opennlp/grok/preprocess/postag/data directory, and you need to be sure that is in your classpath.  I've added a script called "runpipe" in grok/samples/pipe that will make sure the classpath stuff works out fine (as long as you haev the environment variable GROK_HOME set to where you have installed Grok).  Do a cvs update and try it.

      Like I said, I haven't had time to look at everything in great detail, but here is part of the output of running the opennlp.grok.preprocess.mwe.EnglishFixedLexicalMWE
      as part of a pipeline (the code is in SimplePipe.java of the pipe sample).

            <t>
                <w pos="RB">not</w>
              </t>
              <t>
                <w pos="VB">be</w>
              </t>
              <t type="mwe">
                <w pos="JJ">ad</w>
                <w>hoc</w>
              </t>
              <t>
                <w pos="CC">and</w>
              </t>

      Notice that the POS tagger is only expecting single word tokens... oops! Would you be interested in updating the process(NLPDocument doc) method of POSTaggerME to handle MWEs?
      Great stuff!

       
    • Mike Atkinson

      Mike Atkinson - 2002-03-13

      Am I right in thinking you are using this POS tag set?

      1.   CC     Coordinating conjunction
      2.   CD     Cardinal number
      3.   DT     Determiner
      4.   EX     Existential `there'
      5.   FW     Foreign word
      6.   IN     Preposition or subordinating conjunction
      7.   JJ     Adjective
      8.   JJR    Adjective, comparative
      9.   JJS    Adjective, superlative
      10.  LS     List item marker
      11.  MD     Modal
      12.  NN     Noun, singular or mass
      13.  NNS    Noun, plural
      14.  NNP     Proper noun, singular
      15.  NNPS    Proper noun, plural
      16.  PDT    Predeterminer
      17.  POS    Possessive ending
      18.  PRP     Personal pronoun
      19.  PRP$    Possessive pronoun
      20.  RB     Adverb
      21.  RBR    Adverb, comparative
      22.  RBS    Adverb, superlative
      23.  RP     Particle
      24.  SYM    Symbol
      25.  TO     `to'
      26.  UH     Interjection
      27.  VB     Verb, base form
      28.  VBD    Verb, past tense
      29.  VBG    Verb, gerund or present participle
      30.  VBN    Verb, past participle
      31.  VBP    Verb, non-3rd person singular present
      32.  VBZ    Verb, 3rd person singular present
      33.  WDT    Wh-determiner
      34.  WP     Wh-pronoun
      35.  WP$    Possessive wh-pronoun
      36.  WRB    Wh-adverb
      37.  "      Simple double quote
      38.  $      Dollar sign
      39.  #      Pound sign
      40.  `      Left single quote
      41.  '      Right single quote
      42.  ``     Left double quote
      43.  ''     Right double quote
      44.  (      Left parenthesis (round, square, curly or angle bracket)
      45.  )      Right parenthesis (round, square, curly or angle bracket)
      46.  ,      Comma
      47.  .      Sentence-final punctuation
      48.  :      Mid-sentence punctuation

       
      • Jason Baldridge

        Jason Baldridge - 2002-03-13

        The description of the tagset the tagger uses can be found here:

        ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz

         
    • Mike Atkinson

      Mike Atkinson - 2002-03-13

      I notice that the POS tagger quite often gets the POS wrong as many words have multiple valid POS in the local context

      WordNet gives three meanings for "ad hoc".

      adj "an ad hoc committee meeting"
      adj "a coordinated policy instead of ad hoc decisions"
      adv "they were appointed ad hoc"

      My version of the POS tagger gives them NN, JJ, NN, respectively. Ideally, we would want the POS tagger to always get it right, but I doubt its possible due to the many quirks of English and the limited training data to provide context.

      Using the WordNet data I could POS tag the MWE as
      <t type="mwe">
        <w pos="JJ,RB">ad hoc</w>
      </t>

      and then change the logic in the POS tagger so that it takes into account previous pos attributes. Alternatives would be

      <t type="mwe">
        <w pos="JJ,RB">ad</w>
        <w pos="JJ,RB">hoc</w>
      </t>

      or
      <t type="mwe">
        <a>
          <w pos="JJ">ad hoc</w>
          <w pos="RB">ad hoc</w>
        </a>
      </t>

      where "a" is a new tag introduced to show alternative readings. "a" may be useful in other circumstances, for instance if the sentence structure is not clear.

      <p>
        <a>
          <s>  ... </s>
          <s>  ... </s>
        </a>
      </p>

      alternatives using "a" could lead to a combinatorial explosion if not used sparingly, but they would capture constraints between different readings which span tagging (as used by Gate) do not.

      It would also be possible, with a bit of work, to add relative frequencies, extracted from the WordNet data.

      <t type="mwe">
        <w pos="JJ,RB" freq="0.6,0.4">ad hoc</w>
      </t>

       
      • Jason Baldridge

        Jason Baldridge - 2002-03-13

        How about:

        <t type="mwe" pos="JJ RB">
          <w>ad</w>
          <w>hoc</w>
        </t>

        The semantics of multiple tags in the pos attribute for token is then that it is a ranked list of likely tags for the MWE contained therein.  Since the frequencies are relative, the ranking should be sufficient and we don't need a freq attribute.
        I am a bit concerned about use an "alternative" element, though I appreciate its utility.  I'm not sure why span tagging wouldn't be able to do this however --- in fact I think it would be better suited for it.  Can you explain?

         
    • Mike Atkinson

      Mike Atkinson - 2002-03-15

      I am probably wrong as I can't think of a good example of why <a> would be better than span tagging.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.