|
From: John H. L. <jl...@ar...> - 2007-09-27 22:38:42
|
Hi Ignacio. It would be helpful if you posted the following information: - Are you using standalone or mapreduce? - If mapreduce, what are your mapred.map.tasks and mapred.reduce.tasks properties set to? - If mapreduce, how many slaves do you have and how much memory do they have? - How many ARCs are you trying to index? - Did the map reach 100% completion before the failure occurred? Some things you may want to try: - Set both -Xmx and -Xmx to the maximum available on your systems - Increase one or both of mapred.map.tasks and mapred.reduce.tasks, depending where the failure occurred - Break your job up into smaller chunks of say, 1000 or 5000 ARCs -J On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote: > Hello, > > I've been doing some testing with nutchwax and I have never had any > major problems. > However, right now I am trying to index a collection that is over > 100 Gb big, and for some reason the indexing is crashing while it > tries to populate 'crawldb' > > The job will run fine at the beginning importing the information > from the ARCs and creating the "segments" section. > > The error I get is an outOfMemory error when the system is > processing each of the part.xx in the segments previously created. > > I tried increasing the following setting on the hadoop-default.xml > config file: mapred.child.java.opts to 1GB, but it still failed in > the same part. > > Is there any way to reduce the amount of memory used by nutchwax/ > hadoop to make the process more efficient and be able to index such > a collection? > > Thank you. > ---------------------------------------------------------------------- > --- > This SF.net email is sponsored by: Microsoft > Defy all challenges. Microsoft(R) Visual Studio 2005. > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |