Re: [Archive-access-discuss] Nutchwax OutOfMemoryError

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Gerard Suades i Méndez <gs...@ce...> writes:

>>> We tried both approaches for the entire ARC collection:
>>>
>>> a) IndexMerger Lucene API (inside TNH).
>>> index size: 813GB
>>>
>>> b) Re-built the entire index giving as input both old and new NutchWAX
>>> segments of the ARC files.
>>> index size: 563GB
>>>
>>> is it normal that there is this difference of sizes in the indexes?
>
> If I don't get the wrong idea, NutchWAX 0.13 (official release), which
> is the version we've used in method b), doesn't de-duplicate. So if
> neither of the two methods de-duplicates, could it be any other reason
> for such a difference in indexes sizes?

De-duplication while index building:

 NO   NutchWAX 0.13 
 YES  NutchWAX 0.13-JIRA-WAX-75
 YES  JBs

I double-checked the source code.  Sorry if I said something different
before.

So, it sounds like you re-built the entire index in one job using
NutchWAX 0.13 (NO deduplication) and yielded a much smaller index.

That is strange.

Can you confirm that the number of documents in the indexes?  You can
use a utility in TNH do dump out the counts of documents by mime-type
and compare the totals:

  $ mkdir tnh
  $ cd tnh
  $ jar xvf ../tnh-*.war
  $ java -cp WEB-INF/classes:WEB-INF/lib/lucene-core-3.0.1.jar TermDumper -c type 

This will print out the number of documents for each mime-type.  They
should be the same (or at least the same total) for both indexes.

>>> 3.- We have only one collection for all the ARC files. We have our
>>> collection on open access and the service is load balanced through
>>> several nodes. That's the scenario in where several tomcats are
>>> accessing the same indexes.
>>>     
>
> Yes, the index is on an NFS shared storage system accessed by several nodes.

Hmmm, you might consider running some performance tests with the index
on local disk.  Maybe I'm just an old Unix guy, but I would expect a big
performance hit of searching a Lucene index over NFS.

> We will try both JBs and NutchWAX-with-deduplication. By the way, the
> SVN branch you pointed out (
>
> http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive
>
> ) is it possible that it is only suitable for nutch 1.1 and not for
> nutch 1.0 as it is said in INSTALL.txt?

Yes, you are correct.  Since this hasn't been made into an official
release, I have not updated all the documentation yet.

> * TNH
>    - Is it possible to define collections?

Yes, and there are two ways to do this:

 1. Index each collection separately.  If you have your (W)ARC files
    grouped by collection, process each one separately.  For example if
    you had a "election 2011 collection" and a "word cup 2010"
    collection, with separate groups of (W)ARCs for each, you would
    import and index each collection totally separately.

    Then, in TNH, you would put them as sibling directories in the
    root 'index' directory, such as:

      /var/lib/tnh/index
                        /2011-election
                        /2010-world-cup

    And set your index directory in TNH's web.xml to be

      /var/lib/tnh/index

    TNH will automatically traverse the directory tree, finding the
    indexes and mapping them internally according to their directory
    name.

    Then, on the URL, you can use the 'i' parameter to specify which
    index to search, such as:

       search?q=winner&i=2011-election
       search?q=winner&i=2010-world-cup

    Multiple 'i' parameters can be used to search multiple collections,
    and if no 'i' parameter is given, then all collections are searched
    by default.

 2. When importing, put the name of the collection (W)ARC file next to
    the (W)ARC URL, e.g.

      /mnt/data/warcs/foo.warc.gz 2011-election
      /mnt/data/warcs/bar.warc.gz 2010-world-cup

    During the import process, the Importer will decorate each record
    with a metadata field "collection" with the appropriate value.

    Then, during indexing, the value will be added to the Lucene index
    in the field titled "collection".  This field can be searched
    by adding the 'c' parameter to the OpenSearch URL, such as:

       search?q=winner&c=2011-election
       search?q=winner&c=2010-world-cup

    In this case, however, we only build one Lucene index, rather than
    one for each collection.  In this case, the collection name is just
    another field, like mime-type.

I prefer method #1.  For us, it's much easier to manage.  In fact, in
our Archive-It.org hosted service, we have close to 2000 collections by
over 150 partners.  By keeping each collection in a separate index
allows us to manage them much better/easier than if they were in one
giant index with 1.2 billion documents and 4.3 TB in size.

With separate indexes for each collection, you have managable sized
indexes on disk, with the ability to arbitrarily combine them at search
time via the 'i' parameter.

>    - How are the results sorted?

The results are ordered by rank, from best to worst.  The ranking is
determined by Lucene.  Also, 'site collapsing' is performed, so that
if you get multiple hits from one website, we only show the top 1,
or 2, or however many you want, specified by the 'h' parameter on the
OpenSearch URL

   search?q=winner&h=3

would show up to 3 of the top hits from any one website.

This is pretty much what web search companies have been doing for a long
time.  That way if you search for "Facebook" on Yahoo!, you don't get
the first 500000 hits from facebook.com, but get a mix of things from
their site, news coverage about Facebook, etc.

> * JB/NutchWAX
>    - de-duplication is not possible in either tool (JB or NutchWAX) if
> we want to add new crawls to an existing index. ¿?

The most straightforward way to do this is to re-build the entire index,
with all the data -- new and old -- in a single indexing job.

Both the JBs and NutchWAX 0.13-JIRA-WAX-75 support this.

There is another way to do this, but it is more complicated.  You have
to analyze the CDX files and extract out all the lines for duplicate
captures, then feed that to the 'import' command to use as an exlusion
filter, telling it which captures to ignore when importing.  In this
case you are de-duplicating at the front of the processing chain: during
the import stage.

>    - We've tried JB with hadoop 20.2, but it didn't end up
> well. org.apache.hadoop.util.DiskChecker$DiskErrorException was thrown
> so I guess there wasn't enough space in /tmp. If segments have
> somewhere around 650 GB (having removed crawl_*/ and content/), how
> much free space should be left on disk in order to carry out the index
> process? any estimate size? based on our first try 2TB doesn't seem to
> be enough.

Yes, lots of disk space is needed.  There is close to a 1:1 mapping of
the size of the segment to the size of the index.

Plus, with Hadoop you keep a copy of the map output and reduce input in
/tmp during the Map/Reduce job.  For an index 500MB in size, here at IA
we'd use our Hadoop cluster of at least 10 machines.

How many machines are you using?

> * #crawls (WAYBACK 1.6.0/CDX)

I'll have to let Brad tackle this one.

Aaron