Menu

Details on how models were trained

Jeff
2008-11-08
2013-04-16
  • Jeff

    Jeff - 2008-11-08

    Hi,

    I am using the OpenNLP named entity recognizer for a project and I'd like to know how the downloadable model's were trained to include this information in the report.

    In particular, I am using the english person, location, and organization recognizers with the current 1.4 distribution.

    Are there any documents that describe these details? Training corpus, size, any specific parameters that may have been set?

    Thanks!
    Jeff

     
    • Thomas Morton

      Thomas Morton - 2008-11-11

      Hi,
         Sorry its taken a bit to get back to you.  This isn't documented at present. I'm working on a white paper for researchers who use opennlp with this sort of info and something to cite, but its not ready yet.

      So each of the models is trained on about 2.4 millions words of data from Associated Press, Foreign Broadcast Information Service, Finical Times, LA Times, New York Times, San Jose Mercury News, and the Wall Street Journal.  

      A small amount of data from all these sources has been entirely hand annotated (probably about 30k) and a larger amount of data from ap, nyt, and wsj has been automatically annotated, had some systematic errors involving quotes removed with a script, and some portion of the data hand corrected as well.

      Hope this helps...Tom

       
    • Jeff

      Jeff - 2008-11-12

      Thanks Tom, those details are great.

      --
      Jeff

       

Log in to post a comment.