Re: memory errors (heap space)

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The multiple instances route is the one that other people have chosen
in the past - e.g. Luca Toldo at Merck did this a couple of years back
for crunching medline, and OntoText/DERI are doing this in the SWAN project.
It works fine for document-level processing. The tricky bit is when you
want to do corpus-level processing; OntoText KIM now supports a distributed
architecture that minimises the centralised processing element so that the
basic per-document annotation throughput scales linearly with each additional
machine. Early days though...

If you just want to crunch a lot of documents individually I would
recommend using some existing load-balancing software, of which there
seem to be a good number.

Good luck and keep us posted on how you get on.

H

sr...@ug... wrote:
> On Wed, Jan 19, 2005 at 03:32:59AM +0000, Valentin Tablan wrote:
> 
>>If you want to optimise the way threads are used in a multi-CPU 
>>environment, we found that one thing that helps is to set the -server 
>>option on the JVM call.
> 
> 
> That does seem to help, though I haven't actually benched it.
> 
> 
>>About processing large corpora, yes we did at some point process the 
>>entire BNC successfully.
>>The best way to speed up large corpora processing is to use several 
>>instances of GATE (if you have the RAM).
> 
> 
> Now there's an interesting idea.  I wonder if it would be possible
> to start two instances of a program that uses the GATE libraries,
> and then use RMI to divvy up the job (I'm designing a user space
> tool, and so I don't want the user to have to worry about the
> divvying).  A bit of voodoo to be sure, but it just might work . . .
> 
> 
>>Health warning: the amount of RAM required foe each instance as well as 
>>the CPU usage depend largely on the JVM implementation and underlying 
>>platform. You'll need to experiment on your actual set-up to find the 
>>optimal values.
> 
> 
> True.  For the record, I ran the pre-packaged ANNIE pipeline with
> defaults on the complete works of Shakespeare.  The machine --
> a 2-way Xeon hyperthreaded server with 8 gig of ram running Gentoo
> Linux (with a heavily optimized kernel) -- accomplished the task in
> 1 hour, 36 minutes.
> 
> I'll be sure to post if I have a any speed breakthroughs on this
> platform.
> 
> Thanks again!
> 
> Steve
> 

-- 
Hamish
http://www.dcs.shef.ac.uk/~hamish/

[I get too much email, and I use
  junk filters. If I don't reply,
  please resend, or phone!]