The OpenNLP Grok Library / Discussion / Open Discussion: performance tuning, with suggested fixes

Eric Friedman - 2001-10-22

I was experimenting over the weekend with different training data (tried using a bunch of gutenberg etexts), and I discovered that sentence detector model training takes *forever* when confronted with large data sets. It also huges a great deal of memory: 15 Mb of training data caused a 2 CPU Ultra60 with 1Gb of memory to thrash for hours. I never did get the training process to complete -- it just hung at "indexing events."

I happen to have a copy of OptimizeIt, so I profiled the code while using a more reasonably sized data set. I then refactored the code using these results and got some very different results.
Some of this is specific to sentence detection; other parts belong in the maxent.DataIndexer bucket, which is more generally applicable.

Here's what I found:

50% of the time spent training a sentence detection model is spent cloning gnu.regexp objects.

LinkedList is used a lot. This is expensive because of the overhead of creating LinkedList nodes for each element in the list.

String concatenation is used in some tight loops.

Integer objects are used in lots of places.

The constructor of DataIndexer needs help in a few places:

For starters, it holds onto several large data structures that could be nulled out (and hence GC'ed), freeing up memory for the structures that get created subsequently.

The `useablePreds' HashSet grows to include every predicate that passes the `cutoff'. But this information is already available by using an Iterator on the keySet of the `count' HashMap and calling Iterator.remove() whenever a particular pred shouldn't remain in the Map. Same loop, but one copy of the data instead of two.

The loop to compute predicates over/under the cutoff is always done, even when cutoff == 0.

Here are the changes I made:

(controversial) - stop using gnu.regexp altogether. For sentence detection, a simple switch statement does the same job for much, much less at the cost of configurability.

stop using LinkedList - use ArrayList instead, and always try to presize the list when possible.

In the case of SDEventStream, I was able to get rid of a `List' queue altogether. I simply subclassed opennlp.maxent.Event and put a `next' node on the subclass so that I could traverse the list by pointers. Since Event creation and list traversal are encapsulated in that class, this seemed like a reasonable concession.

use StringBuffers for string concatenation. Recycle buffers (setLength(0)) to cut down on object creation. In SDContextGenerator, I identified 2 places where String.trim() could be avoided by choosing the arguments to String.substring() more carefully. In that same class, the regexp for "starts with capital letter" was replaced with a Character.isUpperCase(String.charAt(0)).

ArrayLists could also be recyled in some places.

Wrote a simple IntegerPool class that provides read-only Integer objects in a fixed, non sparse range. Used this in DataIndexer and in SDEventStream.

In DataIndexer, removed the HashSet and used the Iterator.remove() approach described above. Eliminated the loop over the HashMap for cases where cutoff == 0.

There are probably a few things I've forgotten, but it was late by the time I was done. :-)

Most of these recommendations/changes seem pretty straightforward to me, with the exception of the gnu.regexp stuff. Using a switch statement for EOS characters means that you cannot add new EOS characters with a simple method call.

That said, SDEventStream doesn't really allow this today anyway, because the two args constructor (which includes a user-specified regexp) doesn't set the regexp until *after* the one arg constructor has been called. The one arg constructor thus uses the default eosRE to process the first data in the stream.

To be fair, that's obviously a bug, not a design flaw. The fact remains, however, that SD event generation is crippled by the performance of gnu.regexp: everything else I've mentioned pales in comparison. I'm not sure how others feel about the loss of flexibility (can't choose EOS characters at runtime) vs. the performance gain. I suppose one could compromise and define an interface that would allow people to swap out a regexp-based approach (the default) with a lower-level, performance-oriented implementation....

Jason - where are you on your big code re-org project? Please let me know if you want any of these changes (at a minimum, the SDEventStream bug should probably go in).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jason Baldridge - 2001-10-23
  
  It's really cool that you did that Eric. I've always wanted to do profiling, but never had the time to get around to it. I'll look more closely at your suggested fixes tomorrow, but taking a quick look, they look fine. I don't mind trading some configurability for the speed gain.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Jason Baldridge - 2001-10-25
  
  >(controversial) - stop using gnu.regexp altogether.
  
  That's fine. Perhaps we can look for another regular expression package that would be more efficient for the cases where we do want regexps for configurability. I know that Apache has a regexp package, but I don't know anything about it.
  
  >stop using LinkedList - use ArrayList instead, and >always try to presize the list when possible.
  
  That's news to me that LinkedList isn't good in the context I used it. Since we don't know how big many of the lists will be, I thought it would be better to use LinkedLists since adding new elements would always be a constant time operation, and the lists are only ever iterated through, not indexed into, again constant time. Is this a problem with the Java implementation of LinkedList that they have all this overhead? It seems that if we could find a paired down implementation of linked lists that it would be better than ArrayList.
  
  >In the case of SDEventStream, I was able to get rid of a >`List' queue altogether. <cut>
  
  Well done!
  
  >use StringBuffers for string concatenation.
  
  Indeed! This is something I learned about after writing the code and never got around to changing in various places in Grok, Maxent, and OpenNLP... thanks!
  
  >ArrayLists could also be recyled in some places.
  
  Okay.
  
  >Wrote a simple IntegerPool class...
  
  Great -- this could be useful in other contexts as well.
  
  >In DataIndexer, removed the HashSet and used the >Iterator.remove() approach described above. >Eliminated the loop over the HashMap for cases where >cutoff == 0.
  
  Once again -- excellent!
  
  This is what open source is all about!
  
  If you don't have the time just now to sync youself up with the new CVS of maxent and grok, you can send me your changes and I'll put evertying together, make sure it works and check it in, and then make new releases since this would be significant.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eric Friedman - 2001-10-27
  
  Hi Jason,
  
  Apache has a couple of regexp engines. There's also one built into the 1.4 release of the JDK. Given the plethora of options, my opinion is that we should define "Scanner" interfaces that can be implemented using regular expressions or other means (switch statements, hand-coded DFAs and NFAs, etc.).
  
  About LinkedList - the issue isn't the way you're using it. It's that the out-of-the-box LinkedList implementation always creates a "holder" object to represent the ListNode for every object in the list. With large datasets, this quickly becomes intolerable b/c of the creation/destruction time for all of those objects. The odd thing is that they could have provided an implementation which managed a collection of "Linkable" objects (a trivial interface to implement) so that the wrapping (and hence object creation/collection) would have been unnecessary. A less charitable person than I might observe that Sun is in the business of selling hardware and that this class of problem responds well to additioanl hardware purchases.... <smile>
  
  So... I think the thing to do is to use ArrayList whenever possible. In some cases, it makes sense to write a custom linked list implementation, but for those cases I prefer to use direct pointer access. In either case, I think that the interface types "List," "Map," and "Set," should *always* be used as the declared type of these collections, since that makes it much easier to swap out one implementation for another when profiling reveals a big difference in performance characterstics. In other words, I try to always write:
  
  List list = new ArrayList();
  and not
  ArrayList list = new ArrayList();
  
  Eric
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eric Friedman - 2001-10-28
  
  I checked these changes last night. Most of them
  are in grok's sentence detector, with the exception
  of the refactoring of maxent's DataIndexer.
  
  Unfortunately I screwed up the CVS comment when
  doing a commit, so the commitlog shows an empty
  comment. I had two make a minor change in a second commit, so I included the comments there.
  I'll put them here too, just in case:
  
  refactored SDEventStream to use a referenced-based linked list instead of
  a java.util.LinkedList for the event queue (reduces object creation)
  
  added package-private SDEvent class, which holds a reference to another
  SDEvent and so can be used in a linked list by SDEventStream.
  
  added EndOfSentenceScanner interface, which defines an API for classes which
  are able to find end of sentence tokens in Strings, StringBuffers, and char[]
  arrays.
  
  implemented DefaultEndOfSentenceScanner using a simple switch statement.
  
  implemented RegexEndOfSentenceScanner for users who need/want the ability to
  use GNURegex regular expressions to match end of sentence tokens.
  
  removed gnu.regexp code from the rest of the classes, making it possible
  to load/run without having GNURegex present.
  
  added "induced abbreviations" features to SDContextGenerator.
  Note: not currently enabled, pending Jason/Gann's review.
  
  added IntegerPool, which manages a pool of shareable, read-only Integer
  objects in a fixed range. This is used in several places where Integer
  objects were previously created and then GCed.
  
  fixed bug in SDEventStream, which would start reading events in the
  default constructor, even though the constructor being used hadn't had
  an opportunity to set up the end-of-sentence criteria yet.
  
  removed obsolete build files.
  
  added .cvsignore to keep CVS quiet about the ant-generated "build" directory.
  
  removed string concatenation in SDContextGenerator. Added recyclable
  arraylists and stringbuffers (note: makes this class not thread-safe).
  Changed implementation so that tests that drive feature selection are
  not performed repeatedly for related features. Got rid of regex that
  looked to see if the first letter of a word was in uppercase. Note
  that the new implementation is now unicode compliant, which was not
  true of the regex approach.
  
  in various places, used Collections API features to speed things up.
  For example, java.util.List.toArray() will do a System.arraycopy, if
  given a big enough array to copy into. This is, therefore, much faster
  than a for loop.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

performance tuning, with suggested fixes

Forums

Help

performance tuning, with suggested fixes

performance tuning, with suggested fixes

Forums

Help

performance tuning, with suggested fixes document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

performance tuning, with suggested fixes