Menu

Memory Issues of GISTrainer

Wilson Na
2002-03-07
2002-03-07
  • Wilson Na

    Wilson Na - 2002-03-07

    Hi, first off, I must commend Jason and Gann for such a useful package. :D

    I've been working with Grok the past few weeks and recently when I was playing around with the POS tagger, I realised that GISTrainer was a real memory hog. It wasn't evident during Sentence Detection since there are only 2 possible outcomes and much fewer events. But POS tagging 1 million words of WSJ with some decent contextual templates required around 2 gigs of memory. Yup, 2 gigs! This wouldn't have been a problem coz' the server I'm working on has 4 gigs, but the problem is, I found out that there's a bug with Java where we can't allocate more than 1.85 gigs to the program.

    I converted the objects used in GISTrainer to primitive types and managed to reduce the requirement to around 600 megs... and it was more than twice as fast as well. :)

    I also realised the initial phase of collecting Events for the POS tagger gobbles up an incredible amount of memory (ard 600megs for me). This is then passed as a whole to GISTrainer and which then DataIndex(es) it. I was wondering if the DataIndexer and the collection of events can be merged so as to save up some memory. In other words, is it possible to collect the events and index it at the same time, dumping old events after they are indexed? There could be some implications here that I might not be aware of, so apologies in advance :)

    Alternatively, I think a clearEvents() method can be implemented under EventStream to gc the events after indexing. I did just that and it really helped :)

    Also, it seems that assigning unused objects to null don't work very well in clearing up memory in certain cases unless we specifically gc and finalize it.

    Btw, considering myself not exactly proficient with Java, I don't think I should commit my "modifications" to the CVS, but I think the suggestions might help :) hopefully...

     
    • Wilson Na

      Wilson Na - 2002-03-07

      Sorry... I guess this is the wrong forum. It should belong to Maxent. :(

       
    • Jason Baldridge

      Jason Baldridge - 2002-03-07

      I'm glad to hear you were able to reduce the memory consumption (and speed things up!).  I've wanted to spend some time doing some cleaning up, but I've been too busy working on more pressing matters.  If you send me your modified java files, I'll test them out and commit it if all looks well.

      At the moment, we cannot dump events until they've all been read in --- the DataIndexer performs a cutoff that requires having all of the events around.  Of course, it should be quite possible to implement a version of the DataIndexer that reads events in and starts discarding events that have already passed the cutoff, that could go a long way toward cutting down on this frontal memory strain of the training procedure.  Any takers?!

      I'm glad to hear that a few simple changes have made such a difference!  Thanks for getting your hands dirty when you came across a problem like that --- that's what open source is all about!

      BTW, let's indeed carry out any further discussions of this thread in the maxent forums.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.