Menu

Tree [r2] /
 History

HTTPS access


File Date Author Commit
 bin 2011-02-15 rentier [r1] initial import: rudify-0.1.14
 lib 2011-02-17 rentier [r2] copyright statements corrected
 share 2011-02-15 rentier [r1] initial import: rudify-0.1.14
 COPYING 2011-02-15 rentier [r1] initial import: rudify-0.1.14
 Changelog 2011-02-15 rentier [r1] initial import: rudify-0.1.14
 README 2011-02-15 rentier [r1] initial import: rudify-0.1.14
 README.taggers 2011-02-15 rentier [r1] initial import: rudify-0.1.14

Read Me

Taggers provided with this release
----------------------------------

All taggers in share/taggers/ were created using  bin/mktagger.py.
Naming schema for all taggers within this directory:

<ISO language code>-<training corpus>-<tagger type>-tagger.pickled

Every tagger is accompanied by a logfile that documents the training.

Available taggers:

 * deu-conll2006-3gram-tagger.pickled
   (TIGER data sets for CoNLL-X shared task of 2006, 39573 sentences)
 * eng-brown-3gram-tagger.pickled
   (the Brown corpus, 57340 sentences)
 * esp-conll2002-3gram-tagger.pickled
   (data set for the CoNLL 2002 shared task, 11755 sentences)
 * eus-conll2007-3gram-tagger.pickled
   (data set for the CoNLL 2007 shared task, 3175 sentences)
 * ita-evalita2009-3gram-tagger.pickled
   (TANL dependency data set for EVALITA 2009 pilot task, 3247 sentences)
 * nld-conll2002-3gram-tagger.pickled
   (data set for the CoNLL 2002 shared task, 23896 sentences)


Building taggers using non-NLTK ressources
------------------------------------------

For Italian and German, no tagged corpus ressources
are provided by the NLTK corpus collection as of NLTK-0.9.9.
In order to train taggers for use with rudify yourself
you need to obtain additional ressources and incorporate
them into your $NLTK_DATA/corpora directory.

Easy to integrate corpora are:

 * the Italian training/test corpora for the EVALITA 2009 Italian parsing task
   (http://poesix1.ilc.cnr.it/evalita2009/)

 * the German training/test corpora for the CoNLL-X shared task
   (http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/)

See lib/Rudify/non_nltk/ for further information on how to
access the data sets.