From: Julien N. <J.N...@dc...> - 2007-04-04 16:48:00
|
Hi, The memory problem happens in the method attemptAdvance of jape.SinglePhaseTransducer, more precisely in the bindings section, where an awful lot of AnnotationSetImpl objects are created from existing ones and their content is cloned (it is not only memory greedy, it also takes a lot of CPU :-) ). I am not sure there is an obvious solution to this problem, but since I am not an expert in the way JAPE works I might be wrong. A new version of JAPE will be available at some point and should hopefully tackle this issue, in the meantime an option could be to embed JAPEC in a modified version of the SentenceSplitter. That should make the sentence splitting more efficient both in memory and speed. Otherwise there is always the obvious trick of creating a PR which could rewrite the content of the document before the tokenisation and merge the space characters. Not very pretty, but certainly efficient. Hope that helps Julien > Hi all, > > We've been having some problems with the SentenceSplitter component of > ANNIE - when it's used to annotate certain documents it produces a > java.lang.OutOfMemoryError. This is quite a serious problem for us as > we're running some lengthy batch jobs that crash at intervals when the > sentenceSplitter gets into trouble. We've narrowed down the problem to > the find.jape grammar of the sentenceSplitter (using the SVN head > revision of code). > > The documents that fail all seem to contain large sections of > whitespace. Specifically the problem is consecutive lines containing > only whitespace characters and nothing else. To recreate the problem > simply: > > 1. Create a document that contains 200 lines, where each line consists > of 80 space characters (and nothing else); > 2. Create a controller containing: > * Document Reset PR > * Default Tokenizer > * SentenceSplitter > 3. Push the document through the pipeline; > > This will crash the IDE when it gets to the find.jape phase of the > sentenceSplitter. I'll submit a test-harness to the patches area on > sourceforge thats produce a nice clean java.lang.OutOfMemoryError to > show that this is a memory consumption problem… (Obviously upping the > memory allocated to the JVM will help for shorter runs of whitespace > but it's not really a solution to the problem) > > Any help / suggestions for fixes are very welcome! > > Cheers, > > Jon > > Background info: > ----------------------- > > Platform: Windows XP > Java Version: 1.5.0_09 > > > > > This message should be regarded as confidential. If you have received > this email in error please notify the sender and destroy it immediately. > Statements of intent shall only become binding when confirmed in hard > copy by an authorised signatory. The contents of this email may relate > to dealings with other companies within the Detica Group plc group of > companies. > > Detica Limited is registered in England under No: 1337451. > > Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, > England. > > ------------------------------------------------------------------------ > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > ------------------------------------------------------------------------ > > _______________________________________________ > GATE-users mailing list > GAT...@li... > https://lists.sourceforge.net/lists/listinfo/gate-users > |