[Archive-access-discuss] [Nutchwax] Problems indexing large collections

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello,

I've been doing some testing with nutchwax and I have never had any major
problems.
However, right now I am trying to index a collection that is over 100 Gb
big, and for some reason the indexing is crashing while it tries to populate
'crawldb'

The job will run fine at the beginning importing the information from the
ARCs and creating the "segments" section.

The error I get is an outOfMemory error when the system is processing each
of the part.xx in the segments previously created.

I tried increasing the following setting on the hadoop-default.xml config
file: mapred.child.java.opts to 1GB, but it still failed in the same part.

Is there any way to reduce the amount of memory used by nutchwax/hadoop to
make the process more efficient and be able to index such a collection?

Thank you.