Jason Baldridge - 2001-06-11

James Devlan Nicholson writes:
>
> Hi, we are looking at your part of speech tagger and trying to figure out
> how to train it.  Any help would be appreciated.

Well, you don't have to train it if you want an English POS tagger.
If you want to use it straight away, you just need to get an instance
of it in a java program:

quipu.grok.preprocess.postag.EnglishPOSTaggerME tagger =
  new quipu.grok.preprocess.postag.EnglishPOSTaggerME();

and then call the methods that you'll find in the javadoc of the
POSTagger interface:

http://grok.sourceforge.net/api/quipu/opennlp/preprocess/POSTagger.html

For example,

String taggedSent = tagger.tag("John walks");

the value of taggedSent should then be something like
"John/NNP walks/VB".

Training is a bit more involved, and certainly
underdocumented... (sorry!)  To train the POSTagger, you need to have
data in format WORD/TAG, such as the following:

Pierre/NNP Vinken/NNP ,/, 61/CD years/NNS old/JJ ,/, will/MD join/VB
the/DT board/NN as/IN a/DT nonexecutive/JJ director/NN Nov./NNP 29/CD
./.

Then feed that into the training procedure as follows:

java quipu.grok.preprocess.postag.POSTaggerME -l -d PathToWhereYouWantToHaveTheModelSaved -s DesiredNameO\fYourModel PathToYourDataFile

For example, for me I often use the following arguments:

java -mx1024m quipu.grok.preprocess.postag.POSTaggerME -l -d ./ -s EnglishPOS /projects/grok/data/tagger.\train

I should mention that the English tagger in the CVS is much improved
over the one in the last release, and I would recommend using that
instead.

Hope that helps!

Cheers,
jason