Menu

Troveless maxent too slow

Developers
2010-06-03
2013-04-16
  • Joern Kottmann

    Joern Kottmann - 2010-06-03

    Hello,

    a while back we removed trove from the maxent package. The trove object to int
    hash map was replaced by a java.util.HashMap<String, Integer>. This results on
    my macbook in a performance regression, the parser is roughly by factor 5 slower
    than the parser in 1.4 with trove.
    The reason trove was removed, was that Apache projects should be able to use
    OpenNLP. The LGPL license trove uses is incompatible to the ASL.

    Any ideas on what we should do ?

    Jörn

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-03

    I don't understand why LGPL is incompatible with the ASL. If a company can use an LGPL library in their proprietary software,
    why can't an ASL licensed one?

    The idea to make Gnu Trove actually came in large part out of the needs of OpenNLP maxent, so it is somewhat ironic if it can't
    be used there… I could check with Eric Friedman to see what his interpretation is.

    Jason

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-04

    I corresponded with Eric, and he has the same interpretation as me: trove can certainly be used with an ASL licensed project,
    so there is no problem. In his words:

    Someone who uses an LGPL library isn't required to LGPL their own code.  Now they're not allowed to change the LGPL
    library without respecting the LGPL, but changing and using are different things.

    So, example:

    1. you use ASL + trove.  Someone wants to change your code under ASL.  No problem: doesn't affect trove.
    2. you use ASL + trove.  Someone (incl you) wants to change trove.  Those changes have to meet the LGPL req'ts.
    3. someone wants to use openNLP + trove in a commercial app, closed source.  They may do so, but if they change
    trove itself, LGPL applies to those changes only.  changes to openNLP are subject to ASL.

    So no issue for all scenarios except the one in whichsomeone wants to change trove and bypass the LGPL. 

    So, I say bring back trove - it's a great package, especially when you need collections of primitive types.

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-04

    The ASF has a pretty high standards about which licenses can be used by ASF projects and which not.
    All ASF projects cannot use libraries which are licensed under LGPL, LGPL among other licenses
    are categorized as X.

    See this link for more details:
    http://www.apache.org/legal/3party.html

    So in practice this means, that a project like Lucene or UIMA cannot depend on OpenNLP when it brings
    in dependencies which are licensed under a category x license.

    Jörn

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-04

    But Trove is used as a system requirement, so it seems to me that it is okay:

      http://www.apache.org/legal/3party.html#options-systemrequirements

    I honestly don't understand what would be the problem with including LGPL libraries. Is there an actual explanation for this
    somewhere? It doesn't infect larger systems with the LGPL license, it just says any changes to the library itself must be made
    available.

    I also don't understand the statement in the above link that says "The drawback to this approach is that every new system
    requirement narrows the potential user base for the product." I think having a great library like trove can make OpenNLP more
    useful, and thus broaden the potential user base. Its just a library the project can take advantage of and it seems silly not to
    use it.

    Am I interpreting the ASF policy incorrectly?

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-04

    The  mapping trove offers is also in my opinion not the optimal way to do the job.

    If you look at how our features are generated, the following pattern can be observed over and over again:
    String feature = "word=" + contextWord;

    This line is quite wasteful, because the chars must be copied into a StringBuilder and are then
    turned into a String again. An approach which can do that with zero string copies might be much faster.

    If we write our own hash map it could just take an array of  Char Sequence objects, and treat them
    as they where one "string" of chars.
    For that we need  a special hash and compare method which can handle such an array.
    Such a map just works perfectly like the trove map, or the java HashMap, in order to map
    perfectly it has to keep all the strings it contains in memory.

    A second interchangeable map, could loosen the requirement on mapping perfectly,.
    It could use a 64 bit hash function to map our strings
    to longs, which takes into account that it cannot guarantee correctness, but if it makes a mistake every few thousand
    documents it might be insignificant. The advantage of this approach is that we do not need to keep all the strings
    in memory, which will save lots of ram and depending on the model size it might even fit into the CPU L2/L3 caches.

    Jörn

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-04

    Have to search a little for what have been the reason to exclude LGPL stuff over at Apache UIMA (I am actually a committer)
    and will then answer your post.

    Jörn

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-05

    In software I've written after my OpenNLP work, I've dealt with the problem you describe by having byte or int arrays that describe the features and values -- and I've used trove datastructures to store the mapping from such features to their values.

    There are also lots of other tricks that can be done to deal with the fact that features in a maxent model are actually a combination of contextual predicate plus a class. For example, if you do it naively, you have a map from features to parameter values that has keys like "word=book,pos=noun" and "word=book,pos=verb". This leads to lots of hash lookups, which can be avoided by having just the key "word=book" return the index into an array that stores the parameter values, and then you just add an offset to that value to get the parameter for each class (which in my example is part-of-speech). This makes a big difference when there are 40 or more classes. I was pretty much forced to do this when I built a supertagger, which involved up to 1200 different classes. (You can see some of this in my TexNLP package, some of which I may move into OpenNLP if it seems appropriate: http://comp.ling.utexas.edu/software/texnlp)

    However, the other thing you mention is randomized hashing, which is very cool and very useful. Bloom filters have been used to do exactly what you are talking about, and they are great.  Here is software to do it:

    https://sourceforge.net/projects/randlm/

    Which is based on these papers:

    http://www.iccs.inf.ed.ac.uk/~osborne/papers/emnlp07.pdf
    http://www.iccs.inf.ed.ac.uk/~osborne/papers/acl07.pdf

    I'd love to have such an implementation available open source in Java.

    Thanks for checking into the Apache LGPL thing!

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-08

    In software I've written after my OpenNLP work, I've dealt with the problem you describe by having byte or int arrays that describe the features and values -- and I've used trove datastructures to store the mapping from such features to their values.

    Thats much more efficient than using Strings to model features. A feature could also be encoded into a long. Doing anything
    into this direction will lead to lots of changes in the feature generation code and should be evaluated before. It only makes
    sense if it significantly speeds up OpenNLP or dramatically lowers the memory footprint.

    There are also lots of other tricks that can be done to deal with the fact that features in a maxent model are actually a combination of contextual predicate plus a class.

    For now I would like to get the release out quickly, and suggest that we just write our own hash table to map Objects to ints. What do you think ?

    After 1.5 we can start to work on optimizing the feature generation and mapping code. If it is just left as it is
    there might be a few simple options to increase the speed. The feature generation is cached for some taggers
    e.g. name finde, pos tagger, chunker. The cache could be change to remember the already mapped features.

    Jörn

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-08

    Okay. I still don't see what the problem is with using an LGPL library, since it doesn't put any restrictions on the code using it. But, okay, so be it.

    I did a quick search to look for ASL licensed collections, and here is something that looks like it could be useful:

    http://commons.apache.org/primitives/

    Except it doesn't look like it supports hashmaps.

    (This is frustrating because I really like trove…)

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-08

    For now I would like to get the release out quickly, and suggest that we just write our own hash table to map Objects to ints. What do you think ?

    Absolutely. I'd say to go ahead and release without even doing that

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-11

    Absolutely. I'd say to go ahead and release without even doing that

    Then the company I work for cannot use it. So we want that fixed before we release.

    Jörn

     
  • Jason Baldridge

    Jason Baldridge - 2010-06-11

    Then the company I work for cannot use it. So we want that fixed before we release.

    In general, I like the release often, release early approach and see no harm in making several smaller releases followed by a major one every now and then.

    Anyway, if it is important to have the primitive maps in there soon, cool - can you do it? Or find an ASL licensed substitute for Trove?

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-11

    In general, I like the release often, release early approach and see no harm in making several smaller releases followed by a major one every now and then.

    +1, but up to now it was just not possible because the time it takes to re-train and test for a release is just too long and stands in no relation. I hope we now made a big step to improve that with the integrated evaluation and new training API.

    Anyway, if it is important to have the primitive maps in there soon, cool - can you do it?

    Yes it is important because we can not release with a serious regression. Its better to not release at all,
    instead to release something which we know is broken in my opinion. Sure I will take the
    responsibility to fix that. I also feel responsible to fix anything else which keeps us away from the release.

    Jörn

     
  • Joern Kottmann

    Joern Kottmann - 2010-06-23

    Checked in a new hash table which can map entries of an array to their index. I still have to do a performance test with a really large dataset to be able to compare the performance to 1.4 with trove. But up to now it looks like they are similar fast.

    Anyway a review of the code would be really welcome.
    The new class can be found here: opennlp.model.IndexHashTable

    Jörn

     

Log in to post a comment.