Menu

Training data for openNLP

2008-09-25
2013-04-16
  • Vaijanath N. Rao

    Hi Morton and group members,

    I am using openNLP namefinder to extracts the named-entities from the given web-page. The process I am currently following is (Only for person, but can be extended to any other entites as well) ;

    a. Get the news feeds and extract the named-entity from the news feed using the existing model.  If the model fails to identify the name of the person, it's tagged using the simple regex.

    b. Using this data I re-train the model (I have reduced the cut-of to 3).

    c. Than I repeat the step a on previous feeds to see if the original missed names get identified and on some test feeds which is used to put the improvement.

    But after 2-3 runs the model starts showing degrading quality in other words the original model looks to be more clean than what I am using for training.

    Can anyone of you guide me in getting my training data proper.

    --Thanks and Regards
    Vaijanath

     
    • Thomas Morton

      Thomas Morton - 2008-09-26

      Hi,
         Are you starting with the model that is distributed with OpenNlp or only with your own model?  If you're starting with the opennlp one, then the issue is that you don't have the data that the original model was trained on.  When you re-train it you lose the information contained in the original model.  The degradation probably starts after the first re-training.

         If you are just training on your own data, then this suggests that the new data you are adding might be noise and is degrading your models performance.  Also if you are only looking at a couple of cases, the model  now misses some cases it didn't before but is preforming ok on the whole.  You just need to make sure you have a reasonable sized testing corpus to evaluate performance improvements.

      Hope this helps...Tom

       
      • ashu

        ashu - 2008-09-30

        Hi,

        But what is the reasonable(minimum) size for the testing corpus.
        Where could I get the data from which the original model was trained.
        How could I train the model in the incremental manner. Is there any other approach to do it because we don't have the original data.

        Thanks
        Ashu

         
        • Thomas Morton

          Thomas Morton - 2008-10-03

          Hi,
             I would say a reasonable minimum is about 10k-15k sentences.  You can get about 13k of data via:

          http://www.cnts.ua.ac.be/conll2003/ner/ but you'll also need to order the text from NIST which isn't too big of a deal.

          There isn't currently a good work around for this.  I've been looking at setting up a service to let people annotate their own data and train models based on that and other data that I can't distribute, and then let them down load their model but am not there yet. 

          Hope this helps...Tom

           
    • ashu

      ashu - 2008-10-10

      Thanks Tom,

      But unable to open the specified link. Could you provide me another link where I can get conll2003 corpus.

      Thanks
      Ashu

       
      • Thomas Morton

        Thomas Morton - 2008-10-10

        Hi,
           This is the only source I know of for this data...Tom

         

Log in to post a comment.