Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi Ignacio.

It would be helpful if you posted the following information:
- Are you using standalone or mapreduce?
- If mapreduce, what are your mapred.map.tasks and  
mapred.reduce.tasks properties set to?
- If mapreduce, how many slaves do you have and how much memory do  
they have?
- How many ARCs are you trying to index?
- Did the map reach 100% completion before the failure occurred?

Some things you may want to try:
- Set both -Xmx and -Xmx to the maximum available on your systems
- Increase one or both of mapred.map.tasks and mapred.reduce.tasks,  
depending where the failure occurred
- Break your job up into smaller chunks of say, 1000 or 5000 ARCs

-J

On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:

> Hello,
>
> I've been doing some testing with nutchwax and I have never had any  
> major problems.
> However, right now I am trying to index a collection that is over  
> 100 Gb big, and for some reason the indexing is crashing while it  
> tries to populate 'crawldb'
>
> The job will run fine at the beginning importing the information  
> from the ARCs and creating the "segments" section.
>
> The error I get is an outOfMemory error when the system is  
> processing each of the part.xx in the segments previously created.
>
> I tried increasing the following setting on the hadoop-default.xml  
> config file: mapred.child.java.opts to 1GB, but it still failed in  
> the same part.
>
> Is there any way to reduce the amount of memory used by nutchwax/ 
> hadoop to make the process more efficient and be able to index such  
> a collection?
>
> Thank you.
> ---------------------------------------------------------------------- 
> ---
> This SF.net email is sponsored by: Microsoft
> Defy all challenges. Microsoft(R) Visual Studio 2005.
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/ 
> _______________________________________________
> Archive-access-discuss mailing list
> Arc...@li...
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss