The OpenNLP Grok Library / Discussion / Open Discussion: answering some questions about Grok

Jason Baldridge - 2002-03-05

This is in reply to a post by Mike Atkinson on the OpenNLP forum, which I thought would be better answered here. -j

Mike's post:
>Hi,
>
>I've downloaded and experimented with OpenNLP, >and in particular with Grok, for a day or so and >have some questions.
>
>1. Grok seems to be missing many features of a >full English NLP system. It does not seem to >handle verb tenses, irregular verbs, auxiliary >verbs, agreement, movement, subcat, WH queries, >etc. I assume that you have plans to add these >and have, at least in outline, ideas about how >they might be implemented. Could you please give >some indication about future plans?
>
>2. The feature system does not seem to support >sets, these are probably needed for agreement and >verb form. Have I missed it, are they still to be >implemented or do you have a different method in >mind?
>
>3. How do you plan to add a lexicon? For testing >a small lexicon is all that is required and this >can be done manually. A full lexicon is a *lot* >of work and I assume you have no plans to produce >one, is there a suitable free lexicon we can use?
>
>4. At a higher level, are there any plans to use >an ontology? I understand that CYCL (www.cyc.com) >are about to open-source their ontology and this >might be suitable.
>
>5. Finally, are there any tasks which I can help >with. A couple of ideas are:
>A high quality sentence break detector (i.e. Mr. >M.R.King Jn. d.o.b. 20.5.65).
>Phrase detection and taging ("John kicked the >bucket").
>
>Mike Atkinson

Grok has been going through a major reworking since basically last October. Before that, it did have a wide-coverage, semi-automatically acquired lexicon, and support for most of the features you described above. However, I broke Grok into bits and have been rebuilding it ever since so that the Grok core is more efficient, more modular, and more suitable for me to use as a platform to test out my Ph.D. research into categorial grammars. At the moment, there are still a number of features which I have not brought back into the system since I am only working on functionality central to my dissertation, so Grok is not at the moment a stable platform for parsing. It is getting there, though, and should be a big improvement over the Grok of a year ago when it arrives.

The architecture for the feature system is quite general --- attributes are paired with Objects in the opennlp.common.unify.FeatureStructure interface, so it is ready to accept set valued features. Not implemented yet, but it is in the plans.

The plan for adding a lexicon is basically to use the technique we used before for Grok 0.4.0, which is to hook up an automatically acquired lexicon to a "frame" lexicon that doesn't have many entries, but contains the semantic and feature info for a large number of constructions. You can see the details of how this was done before in the following paper:

http://www.cogsci.ed.ac.uk/~jmb/ccgcover.ps.gz

Anyhow, it is not always the desirable thing to build a "full" lexicon -- small, domain-specific lexicons can be of tremendous use and have higher accuracy and performance on many tasks.

That's interesting news about Cyc. We don't have any particular plans to use ontologies, but it would certaily be interesting if someone wanted to work on improving parsing performance or coverage through the use of an ontology.

It would be great if you wanted to help out! We already have a a sentence break detector that performs quite well on Wall Street Journal text, and it would be interesting to have it adapted to more common text genres. A multi-word expression recognizer would be a valuable addition, actually. The Lingo group at Stanford have become interested in multi-word expresions, and though I haven't yet looked at their papers on it yet, they are probably good places to start:

http://lingo.stanford.edu/pubs/WP-2001-01.pdf
http://lingo.stanford.edu/pubs/WP-2001-03.pdf
So, if you want to come on board and get coding, that would be great. We'd be more than happy to help you get started coding. The only thing is that Grok really is unstable at the moment, so you'll likely find a steep learning curve and lots of surprises.

If that is a bit daunting, you should definitely check out the Gate system (http://gate.ac.uk), which is just nearing a major stable release, and Stanford's LKB system (http://www-csli.stanford.edu/~aac/lkb.html), which is a very well-developed system for working with grammars defined via type-feature structures.

Jason

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Atkinson - 2002-03-06
  
  Thanks for the swift response, I've read the papers suggested and much of the stuf on the Stanford MWE reading group page.
  
  My current thinking is to start small and easy with Fixed Expressions.
  
  "There is a large class of immutable expressions in English that defy conventions of grammar and compositional interpretation. This class includes by and large, in short, kingdom come, and every which way. Many other MWEs, though perhaps analyzable to scholars of the languages whence they were borrowed, belong in this class as well, at least for the majority of speakers: ad hoc (cf. ad nauseum, ad libitum, ad hominem,...), Palo Alto (cf. Los Altos, Alta Vista,...), etc.
  Fixed expressions are fully lexicalized and undergo neither morphosyntactic variation (cf. *in shorter) nor internal modification (cf. *in very short)). As
  such, a simple words-with-spaces representation is sufficient. If we were to adopt a compositional account of fixed expressions, we would have to introduce a lexical entry for words such as hoc, resulting in overgeneration and the idiomaticity
  problem (see above)." - Multiword Expressions:
  A Pain in the Neck for NLP? Ivan A. Sag, et al.
  
  "ad hoc" can be pre-processed into a 'word' ad_hoc which then can be entered into the main lexicon in the usual way. This has minimal interaction with other parts of grok, all other types of MWE require interaction and coordination with the main lexicon and processing rules.
  
  It seems a good idea to define a XML schema which can be used to hold the MWE 'lexicon', this can start off quite simple, but then be extended to handle more complex MWE types. It seems a good idea to define a multi-level MWE lexicon so that individual portions can be added for domain specific processing.
  
  Mike Atkinson
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jason Baldridge - 2002-03-06
    
    Good idea to start with fixed expressions.
    
    I think the way to go is to mark things like ad hoc as a multiword token in the NLPDocument, e.g,:
    
    <t>
    <w>ad</w>
    <w>hoc</w>
    </t>
    
    Then a lexicon implementation that can work with NLPDocuments can provide syntactic categories for entire tokens and just keep entries like "ad hoc" directly inside the lexicon.
    
    The issue of identifying what are the fixed expressions is another matter. Using a MWE lexicon is a good idea, and we can start it off with entries entered by hand to test it out. Ultimately, it would be nice to have a semi-automated MWE finder that combs through lots of texts and suggests likely MWEs. Maybe someone has already been working on this --- I recall seeing a paper on this somewhere. Anyway, it is useful to be able to declare your own MWEs anyway when you are creating a domain specific grammar.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Mike Atkinson - 2002-03-08
      
      OK, I'll start with lecicalised fixed expressions, as these are easy to do.
      
      After looking at Gate I've decided to just use lists of the MWE rather than XML (to start with at least). This should make using the same lists trivial with Gate/Jade.
      
      True lexicalised fixed expressions seem to be quite small in number and so fairly easy to add by hand. Discovering other types of MWE seems to be a research topic. When I get lexicalised fixed expressions completed I will probably attack MWE where one word is variable ("taxi rank"/"taxi ranks", "{rain/raining/rained} cats and dogs").
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Atkinson - 2002-03-06
  
  Another question, what is the schema for the NLPDocument, apart from the obvious <s><p><t><w>word</w></t></p></s>?
  
  Mike Atkinson
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jason Baldridge - 2002-03-06
    
    The schema is that there is no schema at the moment. since it is under development. But what you've put there is the essential back-bone of the schema. The attributes that are currently in use are "type" under "t" (Token), and "pos" under "w" (Word). Type indicates things such the token being a name, a date, etc. POS is the part-of-speech.
    
    My intent is to merge some of the ideas that have come out of the use of the NLPDocuments with the stand-off mark-up architecture employed by Gate, but this won't happen for a while yet. Anyway, check out http://gate.ac.uk/sale/tao/index.html for details. The Javadoc for Gate is a good thing to look at in this respect: http://gate.ac.uk/gate/doc/javadoc/index.html, and in particular: http://gate.ac.uk/gate/doc/javadoc/gate/package-summary.html
    
    The Grok preprocessing components would then become something like the gate.creole package.
    
    Jason
    
    But it should not
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Mike Atkinson - 2002-03-08
  
  OK, I've coded something up which produced this as output.
  
  <nlpDocument>
  <text>
      <p>
        <s>
          <t type="mwe">
            <w>ad</w>
            <w>hoc</w>
          </t>
        </s>
      </p>
  </text>
  </nlpDocument>
  
  If that is OK I'll post be ready to commit Sunday/Monday night or if you prefer I could send a
  zip file with the changes in.
  
  Mike Atkinson (mike@ladyshot.demon.co.uk)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jason Baldridge - 2002-03-11
    
    Cool. Just send it along to me as an attachment and I'll test it out and commit it to the repository.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

answering some questions about Grok

Forums

Help

answering some questions about Grok

answering some questions about Grok

Forums

Help

answering some questions about Grok document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

answering some questions about Grok