|
From: Jon R. <Jon...@de...> - 2007-04-04 16:48:03
|
Hi,
=20
For those that are interested I've uploaded a patch onto sourceforge to
illustrate the problem (entitled "Sentence Splitter Bug Demo:
OutOfMemoryError").
=20
Thanks,
=20
Jon
________________________________
From: gat...@li...
[mailto:gat...@li...] On Behalf Of Jon
Roberts
Sent: 04 April 2007 16:46
To: gat...@li...
Subject: [gate-users] SentenceSplitter Bug: java.lang.OutOfMemoryError
Hi all,=20
We've been having some problems with the SentenceSplitter component of
ANNIE - when it's used to annotate certain documents it produces a
java.lang.OutOfMemoryError. This is quite a serious problem for us as
we're running some lengthy batch jobs that crash at intervals when the
sentenceSplitter gets into trouble. We've narrowed down the problem to
the find.jape grammar of the sentenceSplitter (using the SVN head
revision of code).
The documents that fail all seem to contain large sections of
whitespace. Specifically the problem is consecutive lines containing
only whitespace characters and nothing else. To recreate the problem
simply:
1. Create a document that contains 200 lines, where each line consists
of 80 space characters (and nothing else);=20
2. Create a controller containing:=20
* Document Reset PR=20
* Default Tokenizer=20
* SentenceSplitter=20
3. Push the document through the pipeline;=20
This will crash the IDE when it gets to the find.jape phase of the
sentenceSplitter. I'll submit a test-harness to the patches area on
sourceforge thats produce a nice clean java.lang.OutOfMemoryError to
show that this is a memory consumption problem... (Obviously upping the
memory allocated to the JVM will help for shorter runs of whitespace but
it's not really a solution to the problem)
Any help / suggestions for fixes are very welcome!=20
Cheers,=20
Jon=20
Background info:=20
-----------------------=20
Platform: Windows XP=20
Java Version: 1.5.0_09=20
This message should be regarded as confidential. If you have received
this email in error please notify the sender and destroy it immediately.
Statements of intent shall only become binding when confirmed in hard
copy by an authorised signatory. The contents of this email may relate
to dealings with other companies within the Detica Group plc group of
companies.
Detica Limited is registered in England under No: 1337451.
Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
England.
|