
Model Training Data

  • Matt

    Matt - 2006-11-17

    What parts of the Penn Treebank corpus were the models trained on?  Usually, section 23 is left out of training to be tested on - was this the case here?

    Much Thanks,

    • Thomas Morton

      Thomas Morton - 2006-11-17

            The goal of the models provided is to work as well as they can out of the box on most text.  As such they trained on more material and from a greater variety of sources than just the Penn Treebank and as such deal well with non-wsj text and punctuation.  Specifically, the 1.3 version of the parser is trained on sections 2-22 and a portion of Brown.  The pos tagger has been trained on that and a sample of various other files.  The postag dictionary also has been hand edited at various times.  If you want to do comparisons for the basis of research you'll need to re-train the models on just wsj 02-22 and construct a new tag dictionary from that data.  Hope this helps...Tom


Log in to post a comment.