Re: [Archive-access-discuss] [Nutchwax] Problems indexing large collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello Michael,

Where can I find the tasktracker log?? Is it under hadoop? nutchwax? or in a
temp location?

Also, I tried using JConsole to track the memory management on the process,
but unfortunately the hadoop process does not have the "management agent"
activated, so it cannot be tracked by JConsole.
Is there any way to activate it using java options?

I will use the environment variables Michael pointed me to, to try to
increase the Perm Gem size that way.

Thank you.

On 10/15/07, Michael Stack <st...@du...> wrote:
>
> Ignacio Garcia wrote:
> > Hello Andrea,
> >
> > I tried increasing the PermGem size, but it still failed with the same
> > error...
> >
> > I modified the following settings on "hadoop-default.xml":
> >
> > <name>mapred.child.java.opts</name>
> >   <value>-Xmx2048m -Xms1024m -XX:PermSize=256m
> > -XX:MaxPermSize=512m</value>
> >
> > That is the only place I could find where I could include Java Opts...
> > Should I increase it even more or is this property ignored when doing
> > the indexing?
>
> The OOME looks to be in the startup of the update task.  The error.txt
> log you pasted was from the command-line.  Have you tried looking in the
> remote tasktracker log?  It might have more info on where the OOME is
> happening.
>
> The above setting is for each child process run by each of the
> tasktrackers of your cluster.  The child process does the heavy-lifting
> so I'm guessing its where you are seeing the OOME'ing.
>
> Regards how to set the memory for tasktrackers, etc., the notes here
> still apply I believe
> http://archive-access.sourceforge.net/projects/nutch/faq.html#env (Do a
> search for the referred-to environment variables).
>
> St.Ack
>
>
> >
> > Any help would be greatly appreciated. Thank you.
> >
> > On 10/5/07, *Ignacio Garcia* <igc...@gm...
> > <mailto:igc...@gm...> > wrote:
> >
> >     I will try increasing the PermGem space as shown in the reference
> >     you provided.
> >     However, in my case the process is not acting as a webapp, so it
> >     does not related completely to the information displayed in the
> >     article.
> >
> >     Do you think that shutting down every java application and just
> >     running the nutchwax job would have any benefits in this case?
> >     Since I cannot control the number of class loaders created (I'm
> >     just running the code, I did not modify it in any way), I do not
> >     have any control over this problem.
> >
> >     Thank you for the pointers.
> >
> >
> >     On 10/5/07, *Andrea Goethals* < an...@hu...
> >     <mailto:an...@hu...>> wrote:
> >
> >         On Fri, 5 Oct 2007 13:11:28 -0400, Ignacio Garcia wrote
> >         > That might work, but it is not the way that I would like to
> >         use Nutchwax.
> >         >
> >         > If I am forced to divide up one of my small collections
> >         (~100Gb), I don't
> >         > want to even think how many partitions the big collections
> >         are going
> >         > to require. Which means, time wasted partitioning, starting
> >         several
> >         > jobs, merging the created indexes and more...
> >         >
> >         > I even tried increasing the heap size to 4Gb, the max size of
> >         RAM in
> >         > my system, and that did not work.
> >         >
> >         > I have attached the last lines of the output provided by
> >         Nutchwax,
> >         > to see if you can point me to a possible solution to this
> >         problem.
> >
> >         Your output shows that the error is
> >         java.lang.OutOfMemoryError: PermGen space
> >
> >         Is that always the case? If so I don't think that increasing
> >         the heap size is
> >         going to help. This page explains the PermGen space well:
> >
> http://blogs.sun.com/fkieviet/entry/classloader_leaks_the_dreaded_java
> >
> >         Andrea
> >
> >         >
> >         > Also... is there any way to know if it crashed on a particular
> >         > record / arc file or action to try and avoid it?? and is
> >         there a way
> >         > to resume the job from the moment it crashed?
> >         >
> >         > Thank you.
> >         >
> >         > On 10/2/07, John H. Lee < jl...@ar...
> >         <mailto:jl...@ar...>> wrote:
> >         > >
> >         > > The idea is that for each of the N sets of ~500 ARCs,
> >         you'll have one
> >         > > index and one segment. That way, you can distribute the
> >         index-segment pairs
> >         > > across multiple disks or hosts.
> >         > > /search/indexes/indexA/
> >         > > /search/indexes/indexB/
> >         > > ...
> >         > > /search/segments/segmentA/
> >         > > /search/segments/segmentB/
> >         > > ...
> >         > >
> >         > > and point searcher.dir at /search. The webapp will then
> >         search all indexes
> >         > > under /search/indexes. Alternatively, you can merge all of
> >         the indexes as
> >         > > Stack pointed out.
> >         > >
> >         > > Hope this helps.
> >         > >
> >         > > -J
> >         > >
> >         > >
> >         > >
> >         > > On Oct 2, 2007, at 5:09 AM, Ignacio Garcia wrote:
> >         > >
> >         > > Hello,
> >         > >
> >         > > I tried separating the list of ARCs on smaller sets of ~500
> >         ARCs.
> >         > >
> >         > > The first batch run to completion without problems,
> >         however, the second
> >         > > batch failed because I was using the same output directory
> >         as I used for the
> >         > > first one.
> >         > >
> >         > > Why can't I use the same output directory??? Wouldn't it
> >         make sense to
> >         > > have all the info the same place, so I can access
> >         everything at a time?
> >         > >
> >         > > How do I divide the collection in smaller portions and then
> >         combine
> >         > > everything on a single index? If I just keep everything
> >         separated I would
> >         > > loose a lot of time looking in different indexes and
> >         configuring the web-app
> >         > > to be able to look everywhere.
> >         > >
> >         > > On 9/28/07, Ignacio Garcia <igc...@gm...
> >         <mailto:igc...@gm...>> wrote:
> >         > > >
> >         > > > Michael, I do not know if it failed on the same record...
> >         > > >
> >         > > > the first time it failed I assumed that increasing the
> >         -Xmx parameters
> >         > > > would solve it, since the OOME has happened before when
> >         indexing with
> >         > > > Wayback.
> >         > > >
> >         > > > I will try to narrow it as much as I can if it fails
> again.
> >         > > >
> >         > > >
> >         > > > On 9/27/07, Michael Stack < st...@du...
> >         <mailto:st...@du...>> wrote:
> >         > > > >
> >         > > > > What John says and then
> >         > > > >
> >         > > > > + The OOME exception stack trace might tell us
> something.
> >         > > > > + Is the OOME always in same place processing same
> >         record?  If so,
> >         > > > > take
> >         > > > > a look at it in the ARC.
> >         > > > >
> >         > > > > St.Ack
> >         > > > >
> >         > > > > John H. Lee wrote:
> >         > > > > > Hi Ignacio.
> >         > > > > >
> >         > > > > > It would be helpful if you posted the following
> >         information:
> >         > > > > > - Are you using standalone or mapreduce?
> >         > > > > > - If mapreduce, what are your mapred.map.tasks and
> >         > > > > > mapred.reduce.tasks properties set to?
> >         > > > > > - If mapreduce, how many slaves do you have and how
> >         much memory do
> >         > > > > > they have?
> >         > > > > > - How many ARCs are you trying to index?
> >         > > > > > - Did the map reach 100% completion before the
> >         failure occurred?
> >         > > > > >
> >         > > > > > Some things you may want to try:
> >         > > > > > - Set both -Xmx and -Xmx to the maximum available on
> >         your systems
> >         > > > > > - Increase one or both of mapred.map.tasks and
> >         mapred.reduce.tasks,
> >         > > > > > depending where the failure occurred
> >         > > > > > - Break your job up into smaller chunks of say, 1000
> >         or 5000 ARCs
> >         > > > > >
> >         > > > > > -J
> >         > > > > >
> >         > > > > > On Sep 27, 2007, at 10:47 AM, Ignacio Garcia wrote:
> >         > > > > >
> >         > > > > >
> >         > > > > >> Hello,
> >         > > > > >>
> >         > > > > >> I've been doing some testing with nutchwax and I
> >         have never had any
> >         > > > > >> major problems.
> >         > > > > >> However, right now I am trying to index a collection
> >         that is over
> >         > > > > >> 100 Gb big, and for some reason the indexing is
> >         crashing while it
> >         > > > > >> tries to populate 'crawldb'
> >         > > > > >>
> >         > > > > >> The job will run fine at the beginning importing the
> >         information
> >         > > > > >> from the ARCs and creating the "segments" section.
> >         > > > > >>
> >         > > > > >> The error I get is an outOfMemory error when the
> >         system is
> >         > > > > >> processing each of the part.xx in the segments
> >         previously created.
> >         > > > > >>
> >         > > > > >> I tried increasing the following setting on the
> >         hadoop-default.xml
> >         > > > > >> config file: mapred.child.java.opts to 1GB, but it
> >         still failed in
> >         > > > > >> the same part.
> >         > > > > >>
> >         > > > > >> Is there any way to reduce the amount of memory used
> >         by nutchwax/
> >         > > > > >> hadoop to make the process more efficient and be
> >         able to index such
> >         > > > > >> a collection?
> >         > > > > >>
> >         > > > > >> Thank you.
> >         > > > > >>
> >         > > > >
> >
> ----------------------------------------------------------------------
> >
> >         > > > > >> ---
> >         > > > > >> This SF.net email is sponsored by: Microsoft
> >         > > > > >> Defy all challenges. Microsoft(R) Visual Studio 2005.
> >         > > > > >>
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >         > > > > >> _______________________________________________
> >         > > > > >> Archive-access-discuss mailing list
> >         > > > > >> Arc...@li...
> >         <mailto:Arc...@li...>
> >         > > > > >>
> >
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >         > > > >
> >         > > > > >>
> >         > > > > >
> >         > > > > >
> >         > > > > >
> >         > > > >
> >
> -------------------------------------------------------------------------
> >
> >         > > > > > This SF.net email is sponsored by: Microsoft
> >         > > > > > Defy all challenges. Microsoft(R) Visual Studio 2005.
> >         > > > > >
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> >         > > > > > _______________________________________________
> >         > > > > > Archive-access-discuss mailing list
> >         > > > > > Arc...@li...
> >         <mailto:Arc...@li...>
> >         > > > > >
> >
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >         <
> https://lists.sourceforge.net/lists/listinfo/archive-access-discuss>
> >         > > > > >
> >         > > > >
> >         > > > >
> >         > > >
> >         > >
> >         > >
> >
> >
> >         --
> >         Harvard University Library
> >         Powered by Open WebMail
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> >
> -------------------------------------------------------------------------
> > This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems?  Stop.
> > Now Search log events and configuration files using AJAX and a browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Archive-access-discuss mailing list
> > Arc...@li...
> > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss
> >
>
>