I have read this interesting blog post by Tom Morton on faster model training using averaged perceptron vs. the default MaxEnt implementation of OpenNLP:
If am not mistaken, "MaxEnt" is the NLP community name for "Logistic Regression" in the statistical learning community. Furthermore there exist a very scalable implementation of regularized Logistic Regression and Linear Support Vector Classifiers that has been ported to java under a BSD license here:
Maybe liblinear would be a good alternative to the MaxEnt and averaged perceptron learners in OpenNLP? Is this part pluggable enough in the current source base? It does not seem to be the case in the latest stable release.
Yeah, averaged perceptrons rock - amazingly few lines of code, incredibly robust, and usually just a little shy of the performance you get with maxent models and SVMs. (And, yep, maxent is basically just multinomial logistic regression. See the nice discussion of this in the second edition of Jurafsky and Martin.)
I don't know about the liblinear code at all, but I don't think we'd gain much by adding a new dependency, especially as we are hoping to become an Apache project and fewer dependencies means fewer licensing questions (and build issues). That said, I'd love to have an implementation of L-BFGS instead of GIS for training maxent models in OpenNLP. Perhaps one day I'll find time to do that, but welcome volunteers who might be able to get to it first.
Our current plan is to have a new package opennlp.ml (machine learning) that is part of the same Apache project as the OpenNLP toolkit, as the current "maxent" name is too specific to what the code actually does.
Right now we are not really open to support other machine learning libraries/tools than our maxent/perceptron implementation. We might never support anything else out of the box, but we should consider to open up our APIs in a way that it is possible to plugin other ML tooling too.