OpenNLP / Discussion / Open Discussion: Lexicon for multiword detection

Anonymous - 2004-12-16

Hello everybody!

Does OpenNLP use any lexicons or some heuristics to detect multiword expressions or compounds (e.g. green card, computer science etc.) in sentences? Or is it possible to integrate my own lexicon at some stage (e.g. before applying the Tagger)? How would the tagger treat such words / phrases?

Many thanks.

Cheers, Aly

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Thomas Morton - 2004-12-16
  
  Hi,
     There are not mechanisms in place to detect these sorts of constructions. The parser just puts NP labels around them and the tagger assigns the tags to the specific words. The multi-word aspect doesn't really affect the parse of the sentece.
  
  If these are pre-nominal modifiers then the are typically hypenated and treated as a single token (at least in the wsj):
  green-card application
  green-card here will be tagged as a JJ
  If the hyphen is missing then the NP structure is still the same according to current treebank guidelines as multi-word pre-nominal modifiers are not indicated by the bracketing.
  (NP green card application )
  
  Here is the parser for sentences with these phrases:
  
  (TOP
  (S (NP (NNP Peter))
      (VP (VBD got)
        (NP (PRP$ his) (JJ green) (NN card)))
  (. .)))
  
  (TOP
  (S (NP (NNP Peter))
      (VP (VBZ likes)
        (NP (NN computer) (NN science)))
  (. .)))
  
  Hope this helps...Tom
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-12-16
  
  Hi Tom,
  
  thank you for the quick answer! Though, it's a bit disappointing for me as a linguist: A 'green card' is not a card, which is green, and should be thus tagged as NN, right?
  
  Cheers,
  Aly
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Thomas Morton - 2004-12-23
  
  Hi,
  Sorry it took me a while I didn't catch the question at the end. Should green be tagged as an adjective or a noun in "green card"? hmm. I think the according to Hoyle tag should be "JJ" since the guidelines indicate noun tags for words acting adjectival only in cases where the word typically occurs as a noun like "snow blower" (i think; I'm not supper up on the PTB tag guideline but I have read them before).
  
  I have heard many people make such comments in the past and they greatly irk me so I'm going to rant a bit. Please don't take it personally since you are not the first person to make this comment and probably won't be the last.
  
  <rant>
  The tagger does not do semantic interpretation and is strictly concerned with the syntactic role of this word. As far as syntax goes I don't see much of any difference between these two separate cases, they are both noun modifications. This may be the role of some word sense mechanism but its not something a pos-tagger is ever going to concern itself with.
  
  Lets for a moment assume it is the tagger responsibility. If the tags were perfect for all such cases (according to whoever is asking) how would that help you assign meaningful semantics to "green card" right to work thing vs "green card" ones Eagle's emblazoned credit card? I suspect that whatever mechanism allows for that kind of interpretation will not be dependent on whether green is a JJ or a NN.
  
  So as a person who want to do something useful with nlp software, I'm not so disappointed.
  </rant>
  
  The short version of this is: it's not the taggers problem and it isn't even a tagging problem but the above was more fun to write. Thanks for reading...Tom
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2004-12-28
  
  Hi Tom,
  thanks a lot for your answer. I agree, that PoS Tagger is not responsible for semantics. My problem is not, how to tag the word "green". What I'm looking for, is a tool, which would detect multi-word expression or compounds, such as "green card". I think it should be done in the tokenizing stage. After that the PoS tagger should tag a sentence like "Peter got his green card" as
  (TOP
  (S (NP (NNP Peter))
  (VP (VBD got)
  (NP (PRP$ his) (NN 'green card'))).
  I appreciate your idea to create an open source package with important linguistic tools. There are some similar projects of that kind (e.g. FreeLing - http://www.lsi.upc.es/~nlp/freeling/ and Unitex - http://www-igm.univ-mlv.fr/~unitex/\). The first one is comparable to OpenNLP and supports multi-word detection (if a compounds dictionary is provided). Having something similar in OpenNLP would be great!
  Cheers, Aly
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Thomas Morton - 2005-02-05
  
  Hi Aly,
  Thanks for the links. Sorry it's taken me so long to report back.
  
  It would be pretty striaght forward to extend the tokenizer to add such a capability but I still don't see much utility in doing so. Unless the constituent boundries split this word up then you could use the same multi-word mechanism to put your words together before you do what ever processing that helps with. From there you might have some idea about how the pos-tags should be assigned to this multi-word or may not even care.
  
  The tokenization approach has the advantage that it does prevent the parse from putting a constituent in between the two words in such a way that you couldn't combine them w/o crossing some other tree structure. That shouldn't happen too often however. The downside is now you have to integrate this into the pos-tagging so that you get a reasonable tag for things that get combined and at first blush that looks like a head-rule system of sorts so you know which tag is most important and yuck, it just seems messy.
  
  I think I would be much more inclined to do this at the parser level like to:
  (TOP
  (S (NP (NNP Peter))
  (VP (VBD got)
  (NP (PRP$ his) (CMP (JJ green) (NN card))).
  
  The parser already has head-rules too. Still I don't really see it buying anyone too much over just having the application which uses this infomation add the CMP constituent to an existing parse itself. Hope this helps...Tom
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Lexicon for multiword detection

Forums

Help

Lexicon for multiword detection

Lexicon for multiword detection

Forums

Help

Lexicon for multiword detection document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Lexicon for multiword detection