a while back we removed trove from the maxent package. The trove object to int
hash map was replaced by a java.util.HashMap<String, Integer>. This results on
my macbook in a performance regression, the parser is roughly by factor 5 slower
than the parser in 1.4 with trove.
The reason trove was removed, was that Apache projects should be able to use
OpenNLP. The LGPL license trove uses is incompatible to the ASL.
Any ideas on what we should do ?
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't understand why LGPL is incompatible with the ASL. If a company can use an LGPL library in their proprietary software,
why can't an ASL licensed one?
The idea to make Gnu Trove actually came in large part out of the needs of OpenNLP maxent, so it is somewhat ironic if it can't
be used there… I could check with Eric Friedman to see what his interpretation is.
Jason
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I corresponded with Eric, and he has the same interpretation as me: trove can certainly be used with an ASL licensed project,
so there is no problem. In his words:
Someone who uses an LGPL library isn't required to LGPL their own code. Now they're not allowed to change the LGPL
library without respecting the LGPL, but changing and using are different things.
So, example:
1. you use ASL + trove. Someone wants to change your code under ASL. No problem: doesn't affect trove.
2. you use ASL + trove. Someone (incl you) wants to change trove. Those changes have to meet the LGPL req'ts.
3. someone wants to use openNLP + trove in a commercial app, closed source. They may do so, but if they change
trove itself, LGPL applies to those changes only. changes to openNLP are subject to ASL.
So no issue for all scenarios except the one in whichsomeone wants to change trove and bypass the LGPL.
So, I say bring back trove - it's a great package, especially when you need collections of primitive types.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The ASF has a pretty high standards about which licenses can be used by ASF projects and which not.
All ASF projects cannot use libraries which are licensed under LGPL, LGPL among other licenses
are categorized as X.
So in practice this means, that a project like Lucene or UIMA cannot depend on OpenNLP when it brings
in dependencies which are licensed under a category x license.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I honestly don't understand what would be the problem with including LGPL libraries. Is there an actual explanation for this
somewhere? It doesn't infect larger systems with the LGPL license, it just says any changes to the library itself must be made
available.
I also don't understand the statement in the above link that says "The drawback to this approach is that every new system
requirement narrows the potential user base for the product." I think having a great library like trove can make OpenNLP more
useful, and thus broaden the potential user base. Its just a library the project can take advantage of and it seems silly not to
use it.
Am I interpreting the ASF policy incorrectly?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The mapping trove offers is also in my opinion not the optimal way to do the job.
If you look at how our features are generated, the following pattern can be observed over and over again:
String feature = "word=" + contextWord;
This line is quite wasteful, because the chars must be copied into a StringBuilder and are then
turned into a String again. An approach which can do that with zero string copies might be much faster.
If we write our own hash map it could just take an array of Char Sequence objects, and treat them
as they where one "string" of chars.
For that we need a special hash and compare method which can handle such an array.
Such a map just works perfectly like the trove map, or the java HashMap, in order to map
perfectly it has to keep all the strings it contains in memory.
A second interchangeable map, could loosen the requirement on mapping perfectly,.
It could use a 64 bit hash function to map our strings
to longs, which takes into account that it cannot guarantee correctness, but if it makes a mistake every few thousand
documents it might be insignificant. The advantage of this approach is that we do not need to keep all the strings
in memory, which will save lots of ram and depending on the model size it might even fit into the CPU L2/L3 caches.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Have to search a little for what have been the reason to exclude LGPL stuff over at Apache UIMA (I am actually a committer)
and will then answer your post.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In software I've written after my OpenNLP work, I've dealt with the problem you describe by having byte or int arrays that describe the features and values -- and I've used trove datastructures to store the mapping from such features to their values.
There are also lots of other tricks that can be done to deal with the fact that features in a maxent model are actually a combination of contextual predicate plus a class. For example, if you do it naively, you have a map from features to parameter values that has keys like "word=book,pos=noun" and "word=book,pos=verb". This leads to lots of hash lookups, which can be avoided by having just the key "word=book" return the index into an array that stores the parameter values, and then you just add an offset to that value to get the parameter for each class (which in my example is part-of-speech). This makes a big difference when there are 40 or more classes. I was pretty much forced to do this when I built a supertagger, which involved up to 1200 different classes. (You can see some of this in my TexNLP package, some of which I may move into OpenNLP if it seems appropriate: http://comp.ling.utexas.edu/software/texnlp)
However, the other thing you mention is randomized hashing, which is very cool and very useful. Bloom filters have been used to do exactly what you are talking about, and they are great. Here is software to do it:
In software I've written after my OpenNLP work, I've dealt with the problem you describe by having byte or int arrays that describe the features and values -- and I've used trove datastructures to store the mapping from such features to their values.
Thats much more efficient than using Strings to model features. A feature could also be encoded into a long. Doing anything
into this direction will lead to lots of changes in the feature generation code and should be evaluated before. It only makes
sense if it significantly speeds up OpenNLP or dramatically lowers the memory footprint.
There are also lots of other tricks that can be done to deal with the fact that features in a maxent model are actually a combination of contextual predicate plus a class.
For now I would like to get the release out quickly, and suggest that we just write our own hash table to map Objects to ints. What do you think ?
After 1.5 we can start to work on optimizing the feature generation and mapping code. If it is just left as it is
there might be a few simple options to increase the speed. The feature generation is cached for some taggers
e.g. name finde, pos tagger, chunker. The cache could be change to remember the already mapped features.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay. I still don't see what the problem is with using an LGPL library, since it doesn't put any restrictions on the code using it. But, okay, so be it.
I did a quick search to look for ASL licensed collections, and here is something that looks like it could be useful:
Then the company I work for cannot use it. So we want that fixed before we release.
In general, I like the release often, release early approach and see no harm in making several smaller releases followed by a major one every now and then.
Anyway, if it is important to have the primitive maps in there soon, cool - can you do it? Or find an ASL licensed substitute for Trove?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In general, I like the release often, release early approach and see no harm in making several smaller releases followed by a major one every now and then.
+1, but up to now it was just not possible because the time it takes to re-train and test for a release is just too long and stands in no relation. I hope we now made a big step to improve that with the integrated evaluation and new training API.
Anyway, if it is important to have the primitive maps in there soon, cool - can you do it?
Yes it is important because we can not release with a serious regression. Its better to not release at all,
instead to release something which we know is broken in my opinion. Sure I will take the
responsibility to fix that. I also feel responsible to fix anything else which keeps us away from the release.
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Checked in a new hash table which can map entries of an array to their index. I still have to do a performance test with a really large dataset to be able to compare the performance to 1.4 with trove. But up to now it looks like they are similar fast.
Anyway a review of the code would be really welcome.
The new class can be found here: opennlp.model.IndexHashTable
Jörn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
a while back we removed trove from the maxent package. The trove object to int
hash map was replaced by a java.util.HashMap<String, Integer>. This results on
my macbook in a performance regression, the parser is roughly by factor 5 slower
than the parser in 1.4 with trove.
The reason trove was removed, was that Apache projects should be able to use
OpenNLP. The LGPL license trove uses is incompatible to the ASL.
Any ideas on what we should do ?
Jörn
I don't understand why LGPL is incompatible with the ASL. If a company can use an LGPL library in their proprietary software,
why can't an ASL licensed one?
The idea to make Gnu Trove actually came in large part out of the needs of OpenNLP maxent, so it is somewhat ironic if it can't
be used there… I could check with Eric Friedman to see what his interpretation is.
Jason
I corresponded with Eric, and he has the same interpretation as me: trove can certainly be used with an ASL licensed project,
so there is no problem. In his words:
So, example:
1. you use ASL + trove. Someone wants to change your code under ASL. No problem: doesn't affect trove.
2. you use ASL + trove. Someone (incl you) wants to change trove. Those changes have to meet the LGPL req'ts.
3. someone wants to use openNLP + trove in a commercial app, closed source. They may do so, but if they change
trove itself, LGPL applies to those changes only. changes to openNLP are subject to ASL.
So no issue for all scenarios except the one in whichsomeone wants to change trove and bypass the LGPL.
So, I say bring back trove - it's a great package, especially when you need collections of primitive types.
The ASF has a pretty high standards about which licenses can be used by ASF projects and which not.
All ASF projects cannot use libraries which are licensed under LGPL, LGPL among other licenses
are categorized as X.
See this link for more details:
http://www.apache.org/legal/3party.html
So in practice this means, that a project like Lucene or UIMA cannot depend on OpenNLP when it brings
in dependencies which are licensed under a category x license.
Jörn
But Trove is used as a system requirement, so it seems to me that it is okay:
http://www.apache.org/legal/3party.html#options-systemrequirements
I honestly don't understand what would be the problem with including LGPL libraries. Is there an actual explanation for this
somewhere? It doesn't infect larger systems with the LGPL license, it just says any changes to the library itself must be made
available.
I also don't understand the statement in the above link that says "The drawback to this approach is that every new system
requirement narrows the potential user base for the product." I think having a great library like trove can make OpenNLP more
useful, and thus broaden the potential user base. Its just a library the project can take advantage of and it seems silly not to
use it.
Am I interpreting the ASF policy incorrectly?
The mapping trove offers is also in my opinion not the optimal way to do the job.
If you look at how our features are generated, the following pattern can be observed over and over again:
String feature = "word=" + contextWord;
This line is quite wasteful, because the chars must be copied into a StringBuilder and are then
turned into a String again. An approach which can do that with zero string copies might be much faster.
If we write our own hash map it could just take an array of Char Sequence objects, and treat them
as they where one "string" of chars.
For that we need a special hash and compare method which can handle such an array.
Such a map just works perfectly like the trove map, or the java HashMap, in order to map
perfectly it has to keep all the strings it contains in memory.
A second interchangeable map, could loosen the requirement on mapping perfectly,.
It could use a 64 bit hash function to map our strings
to longs, which takes into account that it cannot guarantee correctness, but if it makes a mistake every few thousand
documents it might be insignificant. The advantage of this approach is that we do not need to keep all the strings
in memory, which will save lots of ram and depending on the model size it might even fit into the CPU L2/L3 caches.
Jörn
Have to search a little for what have been the reason to exclude LGPL stuff over at Apache UIMA (I am actually a committer)
and will then answer your post.
Jörn
In software I've written after my OpenNLP work, I've dealt with the problem you describe by having byte or int arrays that describe the features and values -- and I've used trove datastructures to store the mapping from such features to their values.
There are also lots of other tricks that can be done to deal with the fact that features in a maxent model are actually a combination of contextual predicate plus a class. For example, if you do it naively, you have a map from features to parameter values that has keys like "word=book,pos=noun" and "word=book,pos=verb". This leads to lots of hash lookups, which can be avoided by having just the key "word=book" return the index into an array that stores the parameter values, and then you just add an offset to that value to get the parameter for each class (which in my example is part-of-speech). This makes a big difference when there are 40 or more classes. I was pretty much forced to do this when I built a supertagger, which involved up to 1200 different classes. (You can see some of this in my TexNLP package, some of which I may move into OpenNLP if it seems appropriate: http://comp.ling.utexas.edu/software/texnlp)
However, the other thing you mention is randomized hashing, which is very cool and very useful. Bloom filters have been used to do exactly what you are talking about, and they are great. Here is software to do it:
https://sourceforge.net/projects/randlm/
Which is based on these papers:
http://www.iccs.inf.ed.ac.uk/~osborne/papers/emnlp07.pdf
http://www.iccs.inf.ed.ac.uk/~osborne/papers/acl07.pdf
I'd love to have such an implementation available open source in Java.
Thanks for checking into the Apache LGPL thing!
Here is an interesting link to the Apache LGPL policy:
http://mail-archives.apache.org/mod_mbox/www-legal-discuss/200508.mbox/%3C96D75844-B15A-45E3-BF8D-5CF4517A09E2@apache.org%3E
It seems reason is that people can expect that what every they download from Apache is licensed under ASL.
In the first link (http://www.apache.org/legal/3party.html) they say thats a principle:
If OpenNLP has a LGPL dependency, than Apache projects cannot use it because they are forbidden to redistribute
this LGPL dependency.
Jörn
Thats much more efficient than using Strings to model features. A feature could also be encoded into a long. Doing anything
into this direction will lead to lots of changes in the feature generation code and should be evaluated before. It only makes
sense if it significantly speeds up OpenNLP or dramatically lowers the memory footprint.
For now I would like to get the release out quickly, and suggest that we just write our own hash table to map Objects to ints. What do you think ?
After 1.5 we can start to work on optimizing the feature generation and mapping code. If it is just left as it is
there might be a few simple options to increase the speed. The feature generation is cached for some taggers
e.g. name finde, pos tagger, chunker. The cache could be change to remember the already mapped features.
Jörn
Okay. I still don't see what the problem is with using an LGPL library, since it doesn't put any restrictions on the code using it. But, okay, so be it.
I did a quick search to look for ASL licensed collections, and here is something that looks like it could be useful:
http://commons.apache.org/primitives/
Except it doesn't look like it supports hashmaps.
(This is frustrating because I really like trove…)
Absolutely. I'd say to go ahead and release without even doing that
Then the company I work for cannot use it. So we want that fixed before we release.
Jörn
In general, I like the release often, release early approach and see no harm in making several smaller releases followed by a major one every now and then.
Anyway, if it is important to have the primitive maps in there soon, cool - can you do it? Or find an ASL licensed substitute for Trove?
+1, but up to now it was just not possible because the time it takes to re-train and test for a release is just too long and stands in no relation. I hope we now made a big step to improve that with the integrated evaluation and new training API.
Yes it is important because we can not release with a serious regression. Its better to not release at all,
instead to release something which we know is broken in my opinion. Sure I will take the
responsibility to fix that. I also feel responsible to fix anything else which keeps us away from the release.
Jörn
Checked in a new hash table which can map entries of an array to their index. I still have to do a performance test with a really large dataset to be able to compare the performance to 1.4 with trove. But up to now it looks like they are similar fast.
Anyway a review of the code would be really welcome.
The new class can be found here: opennlp.model.IndexHashTable
Jörn