Menu

pos tagging interface

Anonymous
2000-06-06
2000-06-07
  • Anonymous

    Anonymous - 2000-06-06

    Im really impressed that this effort is up and running.

    A query about the pos tagger interface. Is it possible to permit the tagger
    to accept pos tags in the tokenized input? i.e. to have input in the form
    of a set of tokens with an optional set of pos tags per token?

     
    • Gann Bierner

      Gann Bierner - 2000-06-06

      Thanks!

      We're definitely open to suggestions for the interfaces.  Could you give me an example of where this type of input would be useful for a pos tagger?  Is it that you have some sort of prior knowledge about what the possible tags are?

      Gann

       
      • Anonymous

        Anonymous - 2000-06-07

        (im still in the process of looking through the system, please excuse any oversites).

        tagging is usually thought of as an initial process, however if you're dealing with
        messy data, you need to do a lot of pre-processing before you can even
        get to tagging. this preprocessing may identify, for example, known chunks of text
        which can be offered as single units which may or may not have associated features.

        an example of the type of input would be

        the
        vw beetle NN
        is
        groovy     JJ
        .

        as you can see, the text is already tokenized and there are some pos tags associated with
        some of these tokens. ive noticed when looking at other pos taggers that this type of
        feature is present in a few but seems to be not present in the majority. of course,
        if you have it, it simply removes a certain amount of work that the taggr itself has
        to do (or replaces it).

        however, one overhead of this approach is that it stops the string being the universal
        data representation and introduces a more complex object. however, i believe that a simple
        representation would be a string plus an attribute feature list.

        if you think this is a reasonable extension of the tagger interface, id be happy to draft
        a more formal specification etc.

        matt

         
        • Gann Bierner

          Gann Bierner - 2000-06-07

          Okay, I understand now. 

          So, you are suggesting that there might be an earlier module that identifies certain tokens and knows their part of speech.  That seems reasonable-- esp in restricted domains.

          Fortunately, we already have a nice structured data representation that will make this pretty easy to do.  All of our preprocessing components use XML as its data rep.  Some of them, like pos tagging, also have a lower level data representation that can be used if desired. 

          Basically, one would have to change our tagger so that when a pos tag is specified for a token in the XML, it will only consider that when searching for the highest probability tag set.  This should have the added bonus of speeding up the search.  The pre-preprocessing you are suggesting would simply go earlier in the pipeline.

          Are you at all interested in implementing this?  It would be a good chance to get into the system... although I'm afraid that that area isn't particularly well documented.

          Gann

           

Log in to post a comment.