Re: [Archive-access-discuss] Nutchwax OutOfMemoryError

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Gerard Suades i Méndez <gs...@ce...> writes:

> If segments needs to be kept in order to update the indexes with new
> crawls then we need to bear in mind that indexes+segments size
> represents somewhere around 50% of the all ARCs, specially in terms of
> scalability. Are these numbers usual?

That seems large to me.  In your segments, which of the following
sub-directories appear?

  <segment>/crawl_data
            crawl_?
            parse_text
            parse_data
            content

I'll have to check the SVN revision in NutchWAX where I removed the
dependencies on the "crawl_*" and "content" sub-dirs.  Those sub-dirs
are left-overs from Nutch and we don't use them at all for NutchWAX.

If you see those sub-dirs, re-calculate the sizes w/o them for a more
realistic measurement.  Once I can confirm the SVN revision where I
removed their use in NutchWAX (never used by JBs), then you can delete
them as they aren't used at all.  In earlier versions of NutchWAX, there
was code to check that the sub-dirs were there, even though they weren't
used.

For the JBs and more recent versions of NutchWAX, only 

  parse_text
  parse_data

are used.  In my indexing projects, after the (w)arcs are all imported,
I just do a 

   rm -rf segments/*/c*

to remove the sub-dirs starting with 'c'.

> We tried both approaches for the entire ARC collection:
>
> a) IndexMerger Lucene API (inside TNH).
> index size: 813GB
>
> b) Re-built the entire index giving as input both old and new NutchWAX
> segments of the ARC files.
> index size: 563GB
>
> is it normal that there is this difference of sizes in the indexes?

It's quite possible.  If there was a lot of duplicate captures, then you
could see such a large reduction in size.  Method A would preserve the
duplicates whereas method B de-duplicates.

In later versions of NutchWAX and the JBs, de-duplication happens
automatically, but only *within a single indexing job*.  If you
index all the segments in one job, then they will be de-duplicated.

If you index subsets of segments, creating multiple indexes, then there
could be duplicates across the indexes.

NutchWAX segments and their sub-directories are rather simple data
structures actually.  They are in a compressed binary (Hadoop) format,
so you can't simply 'cat' them, but they are in essence:

  [<unique-id>, <set of key/value properties], ...

Each record has a unique key, for which we use "<url> <hash>".  Then the
record is simply key/value pairs of properties.

In the parse_data sub-dir, we have records following the form

  ['http://example.com/ 123456...',
    ["title" => "My webpage",
     "date"  => "20101202092343",
     "type"  => "text/hmtl",
     ....]
  ]
  ['http://example.com/contact.html 3452...',
    ["title" => "Contact us",
     "date"  => "20101202092355",
     "type"  => "text/hmtl",
     ....]
  ]

And in the pase_text, we have

  ['http://example.com/ 123456...',
    ["body" => "Here is the body of the webpage."]
  ]
  ...

When indexing, with either NutchWAX or the JBs, the sub-dirs are opened
up and Hadoop combines the records together from the sub-dirs, matching
according to unique-id.  In NutchWAX and JBs, we also detect multiple
merged records with the same unique-id and then perform our own merging
by retaining only 1 key/value pair for properties such as "title" and
combine values for the "date" property so that we have *all* the capture
dates for the unique version of a URL.

For example, imagine that we had two records

  ['http://example.com/ 123456...',
    ["title" => "My webpage",
     "date"  => "20090304101509",
     "type"  => "text/hmtl",
     ....]
  ]
  ['http://example.com/ 123456...',
    ["title" => "My webpage",
     "date"  => "20101202092343",
     "type"  => "text/hmtl",
     ....]
  ]

during the indexing process, these would be combined into a single
record with two capture dates

  ['http://example.com/ 123456...',
    ["title" => "My webpage",
     "date"  => ["20090304101509", "20101202092343"],
     "type"  => "text/hmtl",
     ....]
  ]

but for all the other properties, we only have one value.  It doesn't
make sense to have the title or mime-type twice.

This is the core of the de-duplication process during indexing.  But
this de-duplication process is done by the Java code in the NutchWAX
Indexer and the JBs Indexer.  Lucene doesn't know anything about it.

> 3.- We have only one collection for all the ARC files. We have our
> collection on open access and the service is load balanced through
> several nodes. That's the scenario in where several tomcats are
> accessing the same indexes.

Does that mean that each node has a local copy of the index?  Or perhaps
the index is on an NFS share or SAN mounted on each node?

Lastly, the indexing process for the JBs is pretty much the same as for
NutchWAX.  The command-lines are similar, but for the JBs, you have to
use the Hadoop command-line driver, whereas NutchWAX comes with its own.

E.g.

   $ nutchwax index indexes segments/*

vs.

   $ hadoop jar jbs-*.jar Indexer indexes segments/*

The version of Hadoop that we use is the Cloudera distribution, which is
based on Hadoop 0.20.2 with some Cloudera patches to fix bugs.  I
believe you can use Hadoop 0.20.1 or 0.20.2 w/o any problems.

The JBs also does a better job of filtering out obvious crap, such as
"words" which are do not contain any letters, such as "34983545$%23432"
is filtered out when indexing with JBs.

It also canonicalizes the mime-types so that all the dozens of different
known variaties of MS Office mime-types are all mapped to one standard
set.  It also omits 'robots.txt' files and ignores mime-types that
probably don't have text in them, such as "application/octet-stream".

I'd recommend giving JBs a try, at least to test and compare to the
index built with NutchWAX; especially since JBs does the accented
character collapsing.

Aaron

--
Aaron Binns
Senior Software Engineer, Web Group, Internet Archive
Program Officer, IIPC
aa...@ar...