With this release, NutchWAX uses mapreduce Nutch at its base: i.e. Nutch
0.8-dev+ (Previous NutchWAX releases were based on Nutch 0.7.x). This
allows NutchWAX to scale to index even larger collections while at the
same time requiring less user intervention than was previously
necessary. A recent indexing, using a rack of ~33 dual-core 2Ghz
Athlons, each with 4Gigs of RAM, took 3.8 days to index end-to-end 50k
ARCs of 141 million documents (We'll post more stats to the list as we
knock off collection indexings).
Be aware that 0.6.0 bears little resemblance to previous releases both
in how it goes about its work and how its run by the user. Be prepared
to leave aside all old NutchWAX assumptions. For an introduction, see
http://archive-access.sourceforge.net/projects/nutch/apidocs/overview-summary.html#getting_started
Release notes are available here:
http://archive-access.sourceforge.net/projects/nutch/articles/releasenotes.html.
Note that indices made with earlier versions of NutchWAX will not be
compatible with 0.6.0.
Yours,
St.Ack
|