|
From: Michael S. <st...@ar...> - 2006-05-01 22:36:26
|
With this release, NutchWAX uses mapreduce Nutch at its base: i.e. Nutch 0.8-dev+ (Previous NutchWAX releases were based on Nutch 0.7.x). This allows NutchWAX to scale to index even larger collections while at the same time requiring less user intervention than was previously necessary. A recent indexing, using a rack of ~33 dual-core 2Ghz Athlons, each with 4Gigs of RAM, took 3.8 days to index end-to-end 50k ARCs of 141 million documents (We'll post more stats to the list as we knock off collection indexings). Be aware that 0.6.0 bears little resemblance to previous releases both in how it goes about its work and how its run by the user. Be prepared to leave aside all old NutchWAX assumptions. For an introduction, see http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#getting_started Release notes are available here: http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html. Note that indices made with earlier versions of NutchWAX will not be compatible with 0.6.0. Yours, St.Ack |