NLTK version 1.3 is now available on SourceForge:
NLTK, the Natural Language Toolkit, is a suite of Python libraries and programs for symbolic and statistical natural language processing. NLTK includes graphical demonstrations and sample data. It is accompanied by extensive documentation, including tutorials that explain the underlying concepts behind the language processing tasks supported by the toolkit.
NLTK is ideally suited to students who are learning NLP (natural language processing) or conducting research in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems.
For version 1.3, we made some significant changes to NLTK's basic architecture. These changes make the basic processing tasks easier to use; and make it easier to combine different processing tasks into a single system. Under the new architecture:
- Tokens are encoded as mutable mappings from properties to values
- Tokens are used to encode all units of language (words, sentences, syntax trees, documents, etc).
- Tokens can can contain other tokens (e.g. a document token's SUBTOKENS property might contain a list of the document's words); or pointers to other tokens (e.g., a parse constituent's PARENT property might contain a pointer to the constituent's parent).
- Processing tasks (parsing, tokenizing, tagging, etc) work by adding new information to existing tokens (e.g., a new TREE property or a new SUBTOKENS property).
- "Property indirection" can be used to control which properties a given processing task uses for input and output (e.g., whether a parser uses the words' TEXT or TAG as the LEAF property.
- Locations (such as character spans) can be added to each token, to provide unique identifiers.
- Specialized token subclasses provide extra methods. For example, TreeToken defines methods like height() and leaves().
NLTK 1.3 also includes a number of additions and improvements:
- Witten-Bell and Good-Turing smoothing for probability distributions
- A regular expression based tagger
- A regular expression based stemmer
- An implementation of the Earley parser
- Feature structures, including unification with variables and reentrance.
- Support for parsing CFGs and CFG productions
- Support for trees that automatically maintain parent pointers.
- Redesign of the chart parser system to improve flexibility and efficiency. (Chart parsers now run 10-20x faster!)
- Improved the chart parser demo: Runs 5-10x faster; Improved matrix window; Added an optional "results" window
- Added a corpus viewer, which can be used to browse corpora and select items within each corpus.
- Added support for drawing feature structures
- Improved font support in the graphical demos.
Because of the significant architectural changes, NLTK 1.3 is considered a "testing" release. It should also be noted that the text classification package has been temporarily removed, pending a redesign that will significantly increase its ease of use. If stability is a concern, or if you need the text classification package, then you may wish to wait until NLTK 1.4 is released to upgrade. The expected release date for NLTK 1.4 is late April.
Log in to post a comment.