|
From: stack <st...@ar...> - 2006-02-17 16:49:52
|
Lukas Matejka wrote:
> first i made links through command 'setup' than i used 'segment' to create
> segments from arcs and than i wanted to use 'links' to process pages and
> links to webdb, but command 'links' uses
>
> ${NUTCH}/bin/nutch admin "${nutchdb}" -create
> before updating db from segments and updating back segments from db
>
> shall I create new WebDB or continue on an old one(for example disabling this
> creating command)?
>
>
For incremental indexing, you will want to keep updating the one webdb
rather than create it anew each incremental indexing, so yes, modify
the indexarcs.sh script so it doesn't invoke create of the webdb. You
will likely also need to change the steps that follow so that it passes
only the segments that are part of the incremental update set rather
than all segments (Currently its written as 'segments/*').
Tell us more about the size of your incremental updates? How frequently
are you planning to do them and how much data are you adding? Our
experience trying to do frequent updates has not been good: index
merging and webdb updating all can take a long time to complete. Tell
me more about the rates of update you are considering and meantime I'll
try and get some figures on our experience posted.
The story should be better in new nutch though I guess index merge works
effectively as it did, a pure lucene operations.
On NutchWAX 0.6.0 release status, a NutchWAX running on top of MapReduce
nutch, development is going well. We've been using a rack of 35 or so
(very) slow processors to test indexing collections of 100M and more.
We're having some robustness and performance issues but they are being
addressed. We're still looking at an end-of-March/start-of-April
release. Will keep the list posted.
St.Ack
|