|
From: stack <st...@ar...> - 2005-11-29 01:37:09
|
Following up on a note posted earlier (reproduced below), we've just uploaded release 0.4.2 of nutchwax -- the last release based on Nutch-0.7. This new release has a few minor fixes mostly build cleanup including making the binary work with Java 1.4.x and fixes to make all links relative rather than absolute so the webapp can sit behind an apache proxy pass-through: E.g. See http://websearch.archive.org/ (The Katrina collection search is also reachable off the archive's home page as our first publically-accessible full-text search. Look for Katrina in the top-left announcements). The default search.jsp now also has Google-like paging through search results thanks to Dan Avery. Finally, also added some small functionality needed by WERA. From here on out, nutchwax will be dependent on the mapreduce-based Nutch. As has been noted below, the move to mapreduce is a pretty radical break with how things have been done in the past but intent is to provide tools and documentation to smooth the transition. Will keep the list updated with progress. Yours, St.Ack Subject: [Archive-access-discuss] Nutchwax future: 0.6.0 to be mapreduce-based. From: Michael Stack <st...@du...> Date: Thu, 17 Nov 2005 18:00:16 -0800 To: arc...@li... This note advocates that the next major nutchwax release be based on a mapreduce version of nutch and use the nutch distributed filesystem, NDFS (See references at end of the following paper if you want to learn more about NDFS, mapreduce and mapreduce in nutch: http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf). Here at the Archive, we've been bumping up against the limitations of the nutch-0.7.x release pre-mapreduce model trying to index collections >100million documents, so we've started moving nutchwax to work atop NDFS and the coming mapreduce version of nutch (See the 'mapred' branch in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been doing the bulk of this development work and the running of initial indexings. So far, things look good. An index made of a collection of 200million URLs ran to completion atop a rack of 30 odd machines. While the indexing speed is currently not that impressive -- the rack is comprised of lowly processors and there remains much room for optimization -- the amount of oversight required to complete the task was close to zero (This is impressive). We also want to move onto mapreduce and NDFS because we see it getting us over incremental indexing woes we've run into using current nutchwax: segments and the webdb have outgrown large disks and various steps in the process take interminably long to complete. A mapreduce based nutchwax internally operates in a manner very different from how current nutchwax works; documentation, plugins, wrapper, tools, and packaging scripts will all have to be adapted/developed to go against the new mapreduce model (More on how the two models differ in correspondence to follow). While we could try carrying-forward the two branches of nutchwax, keeping the two branches in sync -- especially when the underlying model differs so -- would take a lot of work. Rather, lets just switch completely to be mapreduce (and NDFS) going forward. Advantages are outlined at above. Main disadvantage is that mapreduce-based nutch is still in early-stages of development. I was thinking I'd make one more point release of nutchwax in the next week or so based on Nutch-0.7 to fix a couple of critical issues and to add functionality WERA will need for its next release -- see list below -- but that thereafter, the next major release would be a mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be mapreduce immediately after the point release but I'd imagine it'd be the new year before sufficent work was done to release nutchwax 0.6.0. Comments/thoughts on above appreciated or if you think there is a critical nutchwax bug or feature missing from the list below. Yours, St.Ack Bugs: 1354276 [nutchwax] Still have URL encoding issues <http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137> 2005-11-11 10:27 8 Open stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> 1312200 [nutchwax+wera] Pages at end of redirects not found. <http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137> ** 2005-10-03 12:06 * 8 Open stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> Features: 1247519 [nutchwax] next/previous in search.jsp needs improvement <http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140> ** 2005-07-29 08:33 * 8 Open nobody stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin <http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> |