|
From: <kau...@cs...> - 2005-11-21 08:43:37
|
Hi, we are planning here a new hardware configuration for February 2006. It will have big disks, but such a directory tree that maximum size for a leaf directory is 2 TB. I suppose that this will be no problem for future Nutchwax, because it can read arcs from a directory structure instead of a single large directory and keep the index in pieces not larger than 2 TB ? I've been reading about 'Full indexing' from Nutchwax operation instructions. But I haven't tried indexing *big* archives in practice. On 11/18/2005, "Michael Stack" <st...@du...> wrote: > This note advocates that the next major nutchwax release be based on a > mapreduce version of nutch and use the nutch distributed filesystem, > NDFS (See references at end of the following paper if you want to learn > more about NDFS, mapreduce and mapreduce in nutch: > http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pd= f). >=20 > Here at the Archive, we've been bumping up against the limitations of > the nutch-0.7.x release pre-mapreduce model trying to index collections > >100million documents, so we've started moving nutchwax to work atop > NDFS and the coming mapreduce version of nutch (See the 'mapred' branch > in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been > doing the bulk of this development work and the running of initial > indexings. So far, things look good. An index made of a collection of > 200million URLs ran to completion atop a rack of 30 odd machines. While > the indexing speed is currently not that impressive -- the rack is > comprised of lowly processors and there remains much room for > optimization -- the amount of oversight required to complete the task > was close to zero (This is impressive). >=20 > We also want to move onto mapreduce and NDFS because we see it getting > us over incremental indexing woes we've run into using current nutchwax: > segments and the webdb have outgrown large disks and various steps in > the process take interminably long to complete. >=20 > A mapreduce based nutchwax internally operates in a manner very > different from how current nutchwax works; documentation, plugins, > wrapper, tools, and packaging scripts will all have to be > adapted/developed to go against the new mapreduce model (More on how the > two models differ in correspondence to follow). While we could try > carrying-forward the two branches of nutchwax, keeping the two branches > in sync -- especially when the underlying model differs so -- would take > a lot of work. Rather, lets just switch completely to be mapreduce (and > NDFS) going forward. >=20 > Advantages are outlined at above. Main disadvantage is that > mapreduce-based nutch is still in early-stages of development. >=20 > I was thinking I'd make one more point release of nutchwax in the next > week or so based on Nutch-0.7 to fix a couple of critical issues and to > add functionality WERA will need for its next release -- see list below > -- but that thereafter, the next major release would be a > mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with > a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be > mapreduce immediately after the point release but I'd imagine it'd be > the new year before sufficent work was done to release nutchwax 0.6.0. >=20 > Comments/thoughts on above appreciated or if you think there is a > critical nutchwax bug or feature missing from the list below. >=20 > Yours, > St.Ack >=20 >=20 > Bugs: >=20 > 1354276 =09[nutchwax] Still have URL encoding issues > <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1354276&group= _id=3D118427&atid=3D681137> > =09 2005-11-11 10:27 =098 =09Open =09stack-sf > <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> > =09stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> >=20 > 1312200 =09[nutchwax+wera] Pages at end of redirects not found. > <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1312200&group= _id=3D118427&atid=3D681137> > =09** 2005-10-03 12:06 * =098 =09Open =09stack-sf > <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> > =09stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> >=20 >=20 >=20 >=20 > Features: >=20 > 1247519 =09[nutchwax] next/previous in search.jsp needs improvement > <http://sourceforge.net/tracker/index.php?func=3Ddetail&aid=3D1247519&group= _id=3D118427&atid=3D681140> > =09** 2005-07-29 08:33 * =098 =09Open =09nobody =09stack-sf > <http://sourceforge.net/users/stack-sf/>Project Admin > <http://sourceforge.net/help/icon_legend.php?context=3Dgroup_admin&uname=3D= stack-sf&this_group=3D118427&return_to=3D%2F> >=20 >=20 >=20 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. Get Certified Today > Register for a JBoss Training Course. Free Certification Exam > for All Training Attendees Through End of 2005. For more info visit: > http://ads.osdn.com/?ad_id=3D7628&alloc_id=3D16845&op=3Dclick > _______________________________________________ > Archive-access-discuss mailing list > Arc...@li... > https://lists.sourceforge.net/lists/listinfo/archive-access-discuss |