From: Hamish C. <H.C...@dc...> - 2005-01-19 17:12:10
|
The multiple instances route is the one that other people have chosen in the past - e.g. Luca Toldo at Merck did this a couple of years back for crunching medline, and OntoText/DERI are doing this in the SWAN project. It works fine for document-level processing. The tricky bit is when you want to do corpus-level processing; OntoText KIM now supports a distributed architecture that minimises the centralised processing element so that the basic per-document annotation throughput scales linearly with each additional machine. Early days though... If you just want to crunch a lot of documents individually I would recommend using some existing load-balancing software, of which there seem to be a good number. Good luck and keep us posted on how you get on. H sr...@ug... wrote: > On Wed, Jan 19, 2005 at 03:32:59AM +0000, Valentin Tablan wrote: > >>If you want to optimise the way threads are used in a multi-CPU >>environment, we found that one thing that helps is to set the -server >>option on the JVM call. > > > That does seem to help, though I haven't actually benched it. > > >>About processing large corpora, yes we did at some point process the >>entire BNC successfully. >>The best way to speed up large corpora processing is to use several >>instances of GATE (if you have the RAM). > > > Now there's an interesting idea. I wonder if it would be possible > to start two instances of a program that uses the GATE libraries, > and then use RMI to divvy up the job (I'm designing a user space > tool, and so I don't want the user to have to worry about the > divvying). A bit of voodoo to be sure, but it just might work . . . > > >>Health warning: the amount of RAM required foe each instance as well as >>the CPU usage depend largely on the JVM implementation and underlying >>platform. You'll need to experiment on your actual set-up to find the >>optimal values. > > > True. For the record, I ran the pre-packaged ANNIE pipeline with > defaults on the complete works of Shakespeare. The machine -- > a 2-way Xeon hyperthreaded server with 8 gig of ram running Gentoo > Linux (with a heavily optimized kernel) -- accomplished the task in > 1 hour, 36 minutes. > > I'll be sure to post if I have a any speed breakthroughs on this > platform. > > Thanks again! > > Steve > -- Hamish http://www.dcs.shef.ac.uk/~hamish/ [I get too much email, and I use junk filters. If I don't reply, please resend, or phone!] |