Following up on a note posted earlier (reproduced below), we've just
uploaded release 0.4.2 of nutchwax -- the last release based on
Nutch-0.7. This new release has a few minor fixes mostly build cleanup
including making the binary work with Java 1.4.x and fixes to make all
links relative rather than absolute so the webapp can sit behind an
apache proxy pass-through: E.g. See http://websearch.archive.org/ (The
Katrina collection search is also reachable off the archive's home page
as our first publically-accessible full-text search. Look for Katrina in
the top-left announcements). The default search.jsp now also has
Google-like paging through search results thanks to Dan Avery. Finally,
also added some small functionality needed by WERA.
From here on out, nutchwax will be dependent on the mapreduce-based
Nutch. As has been noted below, the move to mapreduce is a pretty
radical break with how things have been done in the past but intent is
to provide tools and documentation to smooth the transition. Will keep
the list updated with progress.
Yours,
St.Ack
Subject:
[Archive-access-discuss] Nutchwax future: 0.6.0 to be mapreduce-based.
From:
Michael Stack <st...@du...>
Date:
Thu, 17 Nov 2005 18:00:16 -0800
To:
arc...@li...
This note advocates that the next major nutchwax release be based on a
mapreduce version of nutch and use the nutch distributed filesystem,
NDFS (See references at end of the following paper if you want to learn
more about NDFS, mapreduce and mapreduce in nutch:
http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf).
Here at the Archive, we've been bumping up against the limitations of
the nutch-0.7.x release pre-mapreduce model trying to index collections
>100million documents, so we've started moving nutchwax to work atop
NDFS and the coming mapreduce version of nutch (See the 'mapred' branch
in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been
doing the bulk of this development work and the running of initial
indexings. So far, things look good. An index made of a collection of
200million URLs ran to completion atop a rack of 30 odd machines. While
the indexing speed is currently not that impressive -- the rack is
comprised of lowly processors and there remains much room for
optimization -- the amount of oversight required to complete the task
was close to zero (This is impressive).
We also want to move onto mapreduce and NDFS because we see it getting
us over incremental indexing woes we've run into using current nutchwax:
segments and the webdb have outgrown large disks and various steps in
the process take interminably long to complete.
A mapreduce based nutchwax internally operates in a manner very
different from how current nutchwax works; documentation, plugins,
wrapper, tools, and packaging scripts will all have to be
adapted/developed to go against the new mapreduce model (More on how the
two models differ in correspondence to follow). While we could try
carrying-forward the two branches of nutchwax, keeping the two branches
in sync -- especially when the underlying model differs so -- would take
a lot of work. Rather, lets just switch completely to be mapreduce (and
NDFS) going forward.
Advantages are outlined at above. Main disadvantage is that
mapreduce-based nutch is still in early-stages of development.
I was thinking I'd make one more point release of nutchwax in the next
week or so based on Nutch-0.7 to fix a couple of critical issues and to
add functionality WERA will need for its next release -- see list below
-- but that thereafter, the next major release would be a
mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with
a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be
mapreduce immediately after the point release but I'd imagine it'd be
the new year before sufficent work was done to release nutchwax 0.6.0.
Comments/thoughts on above appreciated or if you think there is a
critical nutchwax bug or feature missing from the list below.
Yours,
St.Ack
Bugs:
1354276 [nutchwax] Still have URL encoding issues
<http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137>
2005-11-11 10:27 8 Open stack-sf
<http://sourceforge.net/users/stack-sf/>Project Admin
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F>
stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F>
1312200 [nutchwax+wera] Pages at end of redirects not found.
<http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137>
** 2005-10-03 12:06 * 8 Open stack-sf
<http://sourceforge.net/users/stack-sf/>Project Admin
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F>
stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F>
Features:
1247519 [nutchwax] next/previous in search.jsp needs improvement
<http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140>
** 2005-07-29 08:33 * 8 Open nobody stack-sf
<http://sourceforge.net/users/stack-sf/>Project Admin
<http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F>
|