|
From: stack <st...@ar...> - 2005-11-21 19:31:51
|
kau...@cs... wrote: >Hi, >we are planning here a new hardware configuration for February 2006. >It will have big disks, but such a directory tree that maximum size for >a leaf directory is 2 TB. > > >I suppose that this will be no problem for future Nutchwax, because it >can read arcs from a directory structure instead of a single large >directory and keep the index in pieces not larger than 2 TB ? > > > As you say, the 2TB limit should not be a problem. Using current nutchwax/nutch, only part that could overwhelm -- because all else on disk can be partitioned -- is the webdb. I'm guessing you'd have to have a collection of >300million documents to bump up against the 2TB limit. When we move nutchwax to the mapreduce/ndfs-based nutchwax, all indexing data and the link db are kept in the distributed file system. If the NDFS footprint on a particular disk threatens the 2TB limit, add more machines (or new directories) to NDFS. For performance, it will be better to keep the merged index outside of NDFS on the native filesystem. Long before this approaches 2TB maximum size, it'll have made sense to distribute the querying across multiple instances of the query webapp. St.Ack >I've been reading about 'Full indexing' from Nutchwax operation >instructions. But I haven't tried indexing *big* archives in practice. > >On 11/18/2005, "Michael Stack" <st...@du...> wrote: > > >>This note advocates that the next major nutchwax release be based on a >>mapreduce version of nutch and use the nutch distributed filesystem, >>NDFS (See references at end of the following paper if you want to learn >>more about NDFS, mapreduce and mapreduce in nutch: >>http://archive-access.sourceforge.net/projects/nutch/iwaw/iwaw-wacsearch.pdf). >> >>Here at the Archive, we've been bumping up against the limitations of >>the nutch-0.7.x release pre-mapreduce model trying to index collections >> >100million documents, so we've started moving nutchwax to work atop >>NDFS and the coming mapreduce version of nutch (See the 'mapred' branch >>in nutchwax and the 'mapred' branch in nutch). Doug Cutting has been >>doing the bulk of this development work and the running of initial >>indexings. So far, things look good. An index made of a collection of >>200million URLs ran to completion atop a rack of 30 odd machines. While >>the indexing speed is currently not that impressive -- the rack is >>comprised of lowly processors and there remains much room for >>optimization -- the amount of oversight required to complete the task >>was close to zero (This is impressive). >> >>We also want to move onto mapreduce and NDFS because we see it getting >>us over incremental indexing woes we've run into using current nutchwax: >>segments and the webdb have outgrown large disks and various steps in >>the process take interminably long to complete. >> >>A mapreduce based nutchwax internally operates in a manner very >>different from how current nutchwax works; documentation, plugins, >>wrapper, tools, and packaging scripts will all have to be >>adapted/developed to go against the new mapreduce model (More on how the >>two models differ in correspondence to follow). While we could try >>carrying-forward the two branches of nutchwax, keeping the two branches >>in sync -- especially when the underlying model differs so -- would take >>a lot of work. Rather, lets just switch completely to be mapreduce (and >>NDFS) going forward. >> >>Advantages are outlined at above. Main disadvantage is that >>mapreduce-based nutch is still in early-stages of development. >> >>I was thinking I'd make one more point release of nutchwax in the next >>week or so based on Nutch-0.7 to fix a couple of critical issues and to >>add functionality WERA will need for its next release -- see list below >>-- but that thereafter, the next major release would be a >>mapreduce/NDFS-based nutchwax, a version 0.6.0 (Perhaps coordinated with >>a mapreduce-based nutch release, likely 0.8.0). I'd switch HEAD to be >>mapreduce immediately after the point release but I'd imagine it'd be >>the new year before sufficent work was done to release nutchwax 0.6.0. >> >>Comments/thoughts on above appreciated or if you think there is a >>critical nutchwax bug or feature missing from the list below. >> >>Yours, >>St.Ack >> >> >>Bugs: >> >>1354276 [nutchwax] Still have URL encoding issues >><http://sourceforge.net/tracker/index.php?func=detail&aid=1354276&group_id=118427&atid=681137> >> 2005-11-11 10:27 8 Open stack-sf >><http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> >>1312200 [nutchwax+wera] Pages at end of redirects not found. >><http://sourceforge.net/tracker/index.php?func=detail&aid=1312200&group_id=118427&atid=681137> >> ** 2005-10-03 12:06 * 8 Open stack-sf >><http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> stack-sf <http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> >> >> >> >>Features: >> >>1247519 [nutchwax] next/previous in search.jsp needs improvement >><http://sourceforge.net/tracker/index.php?func=detail&aid=1247519&group_id=118427&atid=681140> >> ** 2005-07-29 08:33 * 8 Open nobody stack-sf >><http://sourceforge.net/users/stack-sf/>Project Admin >><http://sourceforge.net/help/icon_legend.php?context=group_admin&uname=stack-sf&this_group=118427&return_to=%2F> >> >> >> >> >> >> >> >>------------------------------------------------------- >>This SF.Net email is sponsored by the JBoss Inc. Get Certified Today >>Register for a JBoss Training Course. Free Certification Exam >>for All Training Attendees Through End of 2005. For more info visit: >>http://ads.osdn.com/?ad_id=7628&alloc_id=16845&op=click >>_______________________________________________ >>Archive-access-discuss mailing list >>Arc...@li... >>https://lists.sourceforge.net/lists/listinfo/archive-access-discuss >> >> > > >------------------------------------------------------- >This SF.Net email is sponsored by the JBoss Inc. Get Certified Today >Register for a JBoss Training Course. Free Certification Exam >for All Training Attendees Through End of 2005. For more info visit: >http://ads.osdn.com/?ad_idv28&alloc_id845&op=click >_______________________________________________ >Archive-access-discuss mailing list >Arc...@li... >https://lists.sourceforge.net/lists/listinfo/archive-access-discuss > > |