Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Michael, I do not know if it failed on the same record...

the first time it failed I assumed that increasing the -Xmx parameters would
solve it, since the OOME has happened before when indexing with Wayback.

I will try to narrow it as much as I can if it fails again.

On 9/27/07, Michael Stack <st...@du...> wrote:
>
> What John says and then
>
> + The OOME exception stack trace might tell us something.
> + Is the OOME always in same place processing same record?  If so, take
> a look at it in the ARC.
>
> St.Ack
>
> John H. Lee wrote:
> > Hi Ignacio.
> >
> > It would be helpful if you posted the following information:
> > - Are you using standalone or mapreduce?
> > - If mapreduce, what are your mapred.map.tasks and
> > mapred.reduce.tasks properties set to?
> > - If mapreduce, how many slaves do you have and how much memory do
> > they have?
> > - How many ARCs are you trying to index?
> > - Did the map reach 100% completion before the failure occurred?
> >
> > Some things you may want to try:
> > - Set both -Xmx and -Xmx to the maximum available on your systems
> > - Increase one or both of mapred.map.tasks and mapred.reduce.tasks,
> > depending where the failure occurred
> > - Break your job up into smaller chunks of say, 1000 or 5000 ARCs
> >
> > -J
> >
> > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
> >
> >
> >> Hello,
> >>
> >> I've been doing some testing with nutchwax and I have never had any
> >> major problems.
> >> However, right now I am trying to index a collection that is over
> >> 100 Gb big, and for some reason the indexing is crashing while it
> >> tries to populate 'crawldb'
> >>
> >> The job will run fine at the beginning importing the information
> >> from the ARCs and creating the "segments" section.
> >>
> >> The error I get is an outOfMemory error when the system is
> >> processing each of the part.xx in the segments previously created.
> >>
> >> I tried increasing the following setting on the hadoop-default.xml
> >> config file: mapred.child.java.opts to 1GB, but it still failed in
> >> the same part.
> >>
> >> Is there any way to reduce the amount of memory used by nutchwax/
> >> hadoop to make the process more efficient and be able to index such
> >> a collection?
> >>
> >> Thank you.
> >> ----------------------------------------------------------------------
> >> ---
> >> This SF.net email is sponsored by: Microsoft
> >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> >> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >> _______________________________________________
> >> Archive-access-discuss mailing list
> >> Arc...@li...
> >> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >>
> >
> >
> >
> -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Microsoft
> > Defy all challenges. Microsoft(R) Visual Studio 2005.
> > http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >
>
>