Re: [Archive-access-discuss] Is NutchWAX dead?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Jon Walton <jon...@gm...> writes:

> I use NutchWAX to index WARC file content for analysis.  I need to fix
> it to get around the JDK u23 gzip problem, but I noticed that
> development seems to have died.  Is everyone using other solutions now
> such as Solr?  If so, care to share any details?

It's not quite entirely dead, but pretty close to it.  I can't speak for
everyone, but many (former) users of NutchWAX are in some state of
migration to a Solr-based implementation.  IMO, the main challenge in
moving to Solr is replacing the NutchWAX 'import' step -- reading the
documents from (W)ARC files.

There is a branch on the public NutchWAX subversion tree that has a fix
to hangle the JDK u23 gzip change.  This branch also contains a few
customizations specific to the way NutchWAX is still used in a few
deployments with particular needs.  YMMV.  The branch is:

  http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_13-JIRA-WAX-75/archive

As for the future of NutchWAX, well if you consider that NutchWAX has
three essential pieces:

 1. Import: read (w)arc files, get the documents, parse them
    and extract the text, metadata, links, etc.

 2. Index: Read the output of step 1, perform text analysis and
    manipulation and index with Solr/Lucene/etc.

 3. Search/query: the live search service.

For us at the Archive, we still use NutchWAX for step 1, albeit with
modifications that are particular to our deployments.  As for step 2, we
are now using some custom MapReduce code I wrote which can index
documents either directly with Lucene, or push them over the wire into a
Solr server, that project can be found at:

  https://github.com/aaronbinns/jbs

And as for step 3, some folks have moved on to Solr, at the Archive we
use a custom Lucene-based Java web application:

  https://github.com/aaronbinns/tnh

which is purpose-built for search of archival web pages.

My plan is to replace NutchWAX for step 1.  At the Archive, we have an
in-development "access" library which can read (w)arcs and has lots of
goodies for doing things with (w)arcs at scale in Hadoop.  The idea is
that both Wayback and full-text search and other web access projects
will all use that core library.  It's just a ways off still.

Hope that helps,

Aaron

--
Aaron Binns
Senior Software Engineer, Web Group, Internet Archive
Program Officer, IIPC
aa...@ar...