A handful of major new features have been implemented:
* Fast, robust stochastic gradient descent using Periodic Stepsize Adjustment (PSA)
* Disk-caching. Instantiated features for sequences can be cached on disk, allowing training over datasets that can't fit in main memory.
* Resulting models from training contain all the options used for that training run. The call to the decoder need not be responsible for keeping track of which options were passed to the trainer (this information is stored in the 'model file').
A general framework for using non-factored features has been added to Carafe - these features don't explicitly predicate over the output variable assignments (as is typical with CRFs). This can be used for discriminative word-alignment or "sequence re-ranking" tasks.
An initial version of the standard "ranking formulation" of Maximum Entropy has been added to Carafe. This is useful for using MaxEnt as a re-ranker for parsing, semantic role labeling, answering/ranking answers to questions, learning similarity metrics, etc.
It can be used with the "-rank" option to the training "mxtrain(.opt)" and "mxtest(.opt)" (those are described in the file 'maxeml/README' in the distribution.
If you are interested in using Carafe and are having problems compiling or using the software, please let me know via an email (wellner _at_ cs _dot_ brandeis _dot_ edu ) or using one of the forums on this page. I'll be happy to help. Releases are not tested on all recent versions of the OCaml compiler and the build process may be sensitive to compiler version and platform variations (I fix these as I see them - remember, this is grad-student-ware of the single-programmer variety). ... read more
At the First Workshop on NLP Challenges in Clinical Data, a Carafe-built system achieved the best overall performance (out of 7 teams) as part of a challenge task in "De-identification". The task required identifying DATES, LOCATIONS, PATIENTs, DOCTORs and other information from medical records. The plan is to make pre-built binary versions and source code available for this specific task soon. More to follow.
Carafe now includes a long-awaited Pre-Processor which takes care of tokenization and sentence detection. This is an early release of the pre-processor and is targeted now for Latin-1 chracater sets. A general Unicode tokenizer is planned for the future.
Due to files missing from the previous distribution, that release (0.6.6) was completely broken. The new release (0.6.7) includes all the library files and works properly.