Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I had already increased the -Xmx to 2Gb, and it still failed.

For everything else I am using the default settings and following the "get
started" guide on the nutchwax site, so i am using the
sudo {HADOOP_HOME}/bin/hadoop jar {NUTCHWAX_HOME}/nutchwax.jar all input
output collection

And I believe I am using mapreduce...

The number of ARCS is 3521, with an average size of 30Mb/ARC.

I am trying right now breaking the job in several chunks, to see if that
helps.

If it fails again I will grab as much information as I can as to when it
exactly failed.

Thank you.

On 9/27/07, John H. Lee <jl...@ar...> wrote:
>
> Hi Ignacio.
>
> It would be helpful if you posted the following information:
> - Are you using standalone or mapreduce?
> - If mapreduce, what are your mapred.map.tasks and
> mapred.reduce.tasks properties set to?
> - If mapreduce, how many slaves do you have and how much memory do
> they have?
> - How many ARCs are you trying to index?
> - Did the map reach 100% completion before the failure occurred?
>
> Some things you may want to try:
> - Set both -Xmx and -Xmx to the maximum available on your systems
> - Increase one or both of mapred.map.tasks and mapred.reduce.tasks,
> depending where the failure occurred
> - Break your job up into smaller chunks of say, 1000 or 5000 ARCs
>
> -J
>
> On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
>
> > Hello,
> >
> > I've been doing some testing with nutchwax and I have never had any
> > major problems.
> > However, right now I am trying to index a collection that is over
> > 100 Gb big, and for some reason the indexing is crashing while it
> > tries to populate 'crawldb'
> >
> > The job will run fine at the beginning importing the information
> > from the ARCs and creating the "segments" section.
> >
> > The error I get is an outOfMemory error when the system is
> > processing each of the part.xx in the segments previously created.
> >
> > I tried increasing the following setting on the hadoop-default.xml
> > config file: mapred.child.java.opts to 1GB, but it still failed in
> > the same part.
> >
> > Is there any way to reduce the amount of memory used by nutchwax/
> > hadoop to make the process more efficient and be able to index such
> > a collection?
> >
> > Thank you.
> > ----------------------------------------------------------------------
> > ---
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
>
>