Rudify Code
Status: Alpha
Brought to you by:
rentier
File | Date | Author | Commit |
---|---|---|---|
bin | 2011-02-15 | rentier | [r1] initial import: rudify-0.1.14 |
lib | 2011-02-17 | rentier | [r2] copyright statements corrected |
share | 2011-02-15 | rentier | [r1] initial import: rudify-0.1.14 |
COPYING | 2011-02-15 | rentier | [r1] initial import: rudify-0.1.14 |
Changelog | 2011-02-15 | rentier | [r1] initial import: rudify-0.1.14 |
README | 2011-02-15 | rentier | [r1] initial import: rudify-0.1.14 |
README.taggers | 2011-02-15 | rentier | [r1] initial import: rudify-0.1.14 |
Taggers provided with this release ---------------------------------- All taggers in share/taggers/ were created using bin/mktagger.py. Naming schema for all taggers within this directory: <ISO language code>-<training corpus>-<tagger type>-tagger.pickled Every tagger is accompanied by a logfile that documents the training. Available taggers: * deu-conll2006-3gram-tagger.pickled (TIGER data sets for CoNLL-X shared task of 2006, 39573 sentences) * eng-brown-3gram-tagger.pickled (the Brown corpus, 57340 sentences) * esp-conll2002-3gram-tagger.pickled (data set for the CoNLL 2002 shared task, 11755 sentences) * eus-conll2007-3gram-tagger.pickled (data set for the CoNLL 2007 shared task, 3175 sentences) * ita-evalita2009-3gram-tagger.pickled (TANL dependency data set for EVALITA 2009 pilot task, 3247 sentences) * nld-conll2002-3gram-tagger.pickled (data set for the CoNLL 2002 shared task, 23896 sentences) Building taggers using non-NLTK ressources ------------------------------------------ For Italian and German, no tagged corpus ressources are provided by the NLTK corpus collection as of NLTK-0.9.9. In order to train taggers for use with rudify yourself you need to obtain additional ressources and incorporate them into your $NLTK_DATA/corpora directory. Easy to integrate corpora are: * the Italian training/test corpora for the EVALITA 2009 Italian parsing task (http://poesix1.ilc.cnr.it/evalita2009/) * the German training/test corpora for the CoNLL-X shared task (http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/) See lib/Rudify/non_nltk/ for further information on how to access the data sets.