OpenNLP / Discussion / Open Discussion: pos tagging interface

Anonymous - 2000-06-06

Im really impressed that this effort is up and running.

A query about the pos tagger interface. Is it possible to permit the tagger
to accept pos tags in the tokenized input? i.e. to have input in the form
of a set of tokens with an optional set of pos tags per token?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Gann Bierner - 2000-06-06
  
  Thanks!
  
  We're definitely open to suggestions for the interfaces. Could you give me an example of where this type of input would be useful for a pos tagger? Is it that you have some sort of prior knowledge about what the possible tags are?
  
  Gann
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Anonymous - 2000-06-07
    
    (im still in the process of looking through the system, please excuse any oversites).
    
    tagging is usually thought of as an initial process, however if you're dealing with
    messy data, you need to do a lot of pre-processing before you can even
    get to tagging. this preprocessing may identify, for example, known chunks of text
    which can be offered as single units which may or may not have associated features.
    
    an example of the type of input would be
    
    the
    vw beetle NN
    is
    groovy JJ
    .
    
    as you can see, the text is already tokenized and there are some pos tags associated with
    some of these tokens. ive noticed when looking at other pos taggers that this type of
    feature is present in a few but seems to be not present in the majority. of course,
    if you have it, it simply removes a certain amount of work that the taggr itself has
    to do (or replaces it).
    
    however, one overhead of this approach is that it stops the string being the universal
    data representation and introduces a more complex object. however, i believe that a simple
    representation would be a string plus an attribute feature list.
    
    if you think this is a reasonable extension of the tagger interface, id be happy to draft
    a more formal specification etc.
    
    matt
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Gann Bierner - 2000-06-07
      
      Okay, I understand now.
      
      So, you are suggesting that there might be an earlier module that identifies certain tokens and knows their part of speech. That seems reasonable-- esp in restricted domains.
      
      Fortunately, we already have a nice structured data representation that will make this pretty easy to do. All of our preprocessing components use XML as its data rep. Some of them, like pos tagging, also have a lower level data representation that can be used if desired.
      
      Basically, one would have to change our tagger so that when a pos tag is specified for a token in the XML, it will only consider that when searching for the highest probability tag set. This should have the added bonus of speeding up the search. The pre-preprocessing you are suggesting would simply go earlier in the pipeline.
      
      Are you at all interested in implementing this? It would be a good chance to get into the system... although I'm afraid that that area isn't particularly well documented.
      
      Gann
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

pos tagging interface

Forums

Help

pos tagging interface

pos tagging interface

Forums

Help

pos tagging interface document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

pos tagging interface