[Archive-access-discuss] [ANN] Nutchwax 0.4.2 release: Minor fixes. Last release before move to ma

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Following up on a note posted earlier (reproduced below), we've just 
uploaded release 0.4.2 of nutchwax -- the last release based on 
Nutch-0.7.  This new release has a few minor fixes mostly build cleanup 
including making the binary work with Java 1.4.x and fixes to make all 
links relative rather than absolute so the webapp can sit behind an 
apache proxy pass-through: E.g. See http://websearch.archive.org/ (The 
Katrina collection search is also reachable off the archive's home page 
as our first publically-accessible full-text search. Look for Katrina in 
the top-left announcements).  The default search.jsp now also has 
Google-like paging through search results thanks to Dan Avery.  Finally, 
also added some small functionality needed by WERA.

 From here on out, nutchwax will be dependent on the mapreduce-based 
Nutch.  As has been noted below, the move to mapreduce is a pretty 
radical break with how things have been done in the past but intent is 
to provide tools and documentation to smooth the transition.  Will keep 
the list updated with progress.

Yours,
St.Ack

Subject:
[Archive-access-discuss] Nutchwax future: 0.6.0 to be mapreduce-based.
From:
Michael Stack <st...@du...>
Date:
Thu, 17 Nov 2005 18:00:16 -0800

To:
arc...@li...

This note advocates that the next major nutchwax release be based on a 
mapreduce version of nutch and use the nutch distributed filesystem, 
NDFS (See references at end of the following paper if you want to learn 
more about NDFS, mapreduce and mapreduce in nutch: 
http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf). 

Here at the Archive, we've been bumping up against the limitations of 
the nutch-0.7.x release pre-mapreduce model trying to index collections 
 >100million documents, so we've started moving nutchwax to work atop 
NDFS and the coming mapreduce version of nutch (See the 'mapred' branch 
in nutchwax and the 'mapred' branch in nutch).  Doug Cutting has been 
doing the bulk of this development work and the running of initial 
indexings.  So far, things look good.  An index made of a collection of 
200million URLs ran to completion atop a rack of 30 odd machines.  While 
the indexing speed is currently not that impressive -- the rack is 
comprised of lowly processors and there remains much room for 
optimization -- the amount of oversight required to complete the task 
was close to zero (This is impressive).

We also want to move onto mapreduce and NDFS because we see it getting 
us over incremental indexing woes we've run into using current nutchwax: 
segments and the webdb have outgrown large disks and various steps in 
the process take interminably long to complete.

A mapreduce based nutchwax internally operates in a manner very 
different from how current nutchwax works; documentation, plugins, 
wrapper, tools, and packaging scripts will all have to be 
adapted/developed to go against the new mapreduce model (More on how the 
two models differ in correspondence to follow).   While we could try 
carrying-forward the two branches of nutchwax, keeping the two branches 
in sync -- especially when the underlying model differs so -- would take 
a lot of work.  Rather, lets just switch completely to be mapreduce (and 
NDFS) going forward.

Advantages are outlined at above.  Main disadvantage is that 
mapreduce-based nutch is still in early-stages of development.

I was thinking I'd make one more point release of nutchwax in the next 
week or so based on Nutch-0.7 to fix a couple of critical issues and to 
add functionality WERA will need for its next release -- see list below 
-- but that thereafter, the next major release would be a 
mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with 
a mapreduce-based nutch release, likely 0.8.0).  I'd switch HEAD to be 
mapreduce immediately after the point release but I'd imagine it'd be 
the new year before sufficent work was done to release nutchwax 0.6.0.

Comments/thoughts on above appreciated or if you think there is a 
critical nutchwax bug or feature missing from the list below.

Yours,
St.Ack

Bugs:

1354276     [nutchwax] Still have URL encoding issues 
<http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137> 
      2005-11-11 10:27     8     Open     stack-sf 
<http://sourceforge.net/users/stack-sf/>Project Admin 
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 
    stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin 
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 

1312200     [nutchwax+wera] Pages at end of redirects not found. 
<http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137> 
    ** 2005-10-03 12:06 *     8     Open     stack-sf 
<http://sourceforge.net/users/stack-sf/>Project Admin 
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 
    stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin 
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 

Features:

1247519     [nutchwax] next/previous in search.jsp needs improvement 
<http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140> 
    ** 2005-07-29 08:33 *     8     Open     nobody     stack-sf 
<http://sourceforge.net/users/stack-sf/>Project Admin 
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 

[Archive-access-discuss] [ANN] Nutchwax 0.4.2 release: Minor fixes. Last release before move to ma

[Archive-access-discuss] [ANN] Nutchwax 0.4.2 release: Minor fixes. Last release before move to mapreduce-based nutch.