This version is substantially revised and expanded from version 0.7. The code now includes improved interfaces to chunkers, grammars, frequency distributions, full integration with WordNet 3.0 and implementations of WordNet similarity measures, the Lancaster Stemmer, simpler conventions for importing modules, and simpler installation. A new corpus package supports caching, slicing, a corpus search path permitting corpora to be stored in multiple locations, and provides a more convenient API. The book contains substantial revision of Part I (tokenization, tagging, chunking) and Part II (grammars and parsing), making it accessible to a broader audience. NLTK-Lite 0.8 has several new corpora and interfaces including the Switchboard Telephone Speech Corpus transcript sample (Talkbank Project), CMU Problem Reports Corpus sample, CONLL2002 POS+NER data, Patient Information Leaflet corpus sample, Indian POS-Tagged data (Bangla, Hindi, Marathi, Telugu), Shakespeare XML corpus sample, and the UDHR corpus with text samples in 300+ languages. The nltk.contrib package is now a new top-level nltk_contrib package, and includes DRT and Glue Semantics (Dan Garrette), Punkt sentence segmenter (Willy), LPath interpreter (Haejoong Lee), classifiers (Sumukh Ghodke), Kimmo finite-state morphology system (Rob Speer), Lambek calculus system (Edward Loper).
Log in to post a comment.