Re: [Archive-access-discuss] Nutchwax OutOfMemoryError

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Gerard Suades i Méndez <gs...@ce...> writes:

> 1.- We have a new set of ARC that we would like to include in full
> text search. We were wondering if there is any special procedure to
> update the already existing NutchWAX indexes with the new crawls. Any
> idea for the merge process? Do we need to keep segments of old crawls
> in order to generate the indexes of the new crawls before merging all
> together?

Yes, for *building* the indexes you need to keep the segments, only for
the TNH search service you don't need the segments as the index has all
the information in it needed for search services.

There are basically two ways to merge indexes, which one you choose
depends on your de-duplication strategy.

If you have two Lucene indexes A and B, you can just use the IndexMerger
command in TNH to merge them together.  TNH provides a simple
command-line wrapper around the Lucene index merging API call.  Since
TNH is a webapp, you have to un-jar it to be able to use the Java
command-line wrappers, for example

  $ mkdir tnh
  $ cd tnh
  $ jar xf ~/tnh.war
  $ export CLASSPATH=WEB-INF/classes:WEB-INF/lib/lucene-core-*.jar
  $ java IndexMerger <merged> <index-A> <index-B>

This simply calls the Lucene library index-merge function, so it does
*not* know anything about de-duplication.  If you have the same record
in both index A and index B, then you will have them both in the merged
index.

So, if you already have an index for your existing collection, then get
some new (W)ARC files, you and index those separately and then merge the
two indexes together.

Another approach is to re-build the entire index, giving as inputs the
initial NutchWAX segments and the new NutchWAX segment for the new
(W)ARCs.  Then, you will have one single index with everything in it.

In this case, any duplicate records can be detected and merged when the
combined index is being built.  The merging of duplicate records during
index-building was a feature put into a minor revision of NutchWAX 0.13.
I'll have to look up the specific SVN revision.

With regards to indexing, there is a side-project of mine similar to TNH
which does a better job of index-building than NutchWAX.  This project
is called "The JBs", which was the name of the band for the famous
musician James Brown.

One of the many improvements in The JBs does is "accented letter
collapsing" so that words with accented characters are indexed so that
they can be found with or without the accent mark.  For example,

  Méndez

with NutchWAX it is put into the index exactly as "Méndez".  If someone
searches for "Mendez", it will not be found.  But if the index is built
with then both "Méndez" and "Mendez" can be found.

The JBs also performs merging of duplicates when building a single index
from multiple NutchWAX segments.

But, this email is getting rather long already, with more below, so I
will conclude this section on The JBs.  We can discuss further if you
are interested.

> 2.- The size of the index which self-contained the segments
> information is a linear growth size related to the ARC? at this moment
> index represents pretty much 7.5% of the whole collection ARCs size.

It depends on the mix of file types in the original ARC files.  Only
text types are put into the full-text search, so things like JPG, MP3,
AVI, ZIP, etc. are omitted.  You're 7.5% number does not seem unusual to
me.  In our full-text search for Archive-It.org, there are just over 1
billion documents in the index and the on-disk index size is ~3.5TB, and
the size of all the (W)ARC files is somewhere around 100TB.  But I know
there are lots of large binary files, including lots of YouTube video in
the Archive-It collection.

> 3.- Is it possible to install TNH in several tomcats sharing the same
> index? in other words, does TNH block index while searching as Wayback
> used to?

I don't remember if that specific use-case was tested.  It should work.

TNH is built on Lucene and when TNH opens the index, it uses the Lucene
API call to open the index in read-only mode; so there should be no
exclusive locking and multiple TNH web application instances should be
able to open the same index.

However, TNH and the Lucene library do cache parts of the index in
memory, so if you have multiple instances of the TNH web appliction, you
will have multiple instances of the caches as well.

An alternative approach might be to use a multi-index setup in a single
TNH instance and use the "i=<indexname>" URL parameter to select which
index to search.

Maybe you can describe what you are trying to do with multiple TNH
webapp instances reading the same index and I can provide some
suggestions on how to implement it.

> 4.- Based on the results of our tests we are thinking of using TNH for
> full text search instead of WERA. Is there any roadmap or a major
> release planned for the future?

No, there isn't any roadmap.  Well, the roadmap is to migrate everything
to Apache SOLR, which merged projects with Lucene last year and is now
considered *the* open-source full-text search platform.

Unfortunately, there are some features missing from SOLR which are
required for full-text search on web archives.  Also, we don't know yet
how SOLR will scale, especially in a multi-server configuration.

I produced a report for the IIPC covering the issues with migrating from
NutchWAX to SOLR.

  http://archive.org/~aaron/iipc/

So, that leaves us in an intermediate state where NutchWAX's search
service performance is not sufficient, but SOLR is not quite ready for
full-scale migration.  The Internet Archive needs to decide if we commit
to supporting TNH (with an official release) as an intermediate step in
the migration path to SOLR.

And if people are finding TNH useful and an adequate replacement for the
NutchWAX search service, then we would have a stronger case to commit
the resources to support an official TNH release.

-- 
Aaron Binns
Senior Software Engineer, Web Group, Internet Archive
Program Officer, IIPC
aa...@ar...