OpenNLP / Discussion / Open Discussion: Model Training Data

Model Training Data

Forum: Open Discussion

Creator: Matt

Created: 2006-11-17

Updated: 2013-04-16

Matt - 2006-11-17

What parts of the Penn Treebank corpus were the models trained on? Usually, section 23 is left out of training to be tested on - was this the case here?

Much Thanks,
Matt

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Thomas Morton - 2006-11-17
  
  Hi,
  The goal of the models provided is to work as well as they can out of the box on most text. As such they trained on more material and from a greater variety of sources than just the Penn Treebank and as such deal well with non-wsj text and punctuation. Specifically, the 1.3 version of the parser is trained on sections 2-22 and a portion of Brown. The pos tagger has been trained on that and a sample of various other files. The postag dictionary also has been hand edited at various times. If you want to do comparisons for the basis of research you'll need to re-train the models on just wsj 02-22 and construct a new tag dictionary from that data. Hope this helps...Tom
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.