Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks for the fast and informative answer!

> Its not so much that its broken.

good to hear. Btw, I'm now using a 0.7 version from the integration
server. Don't know if that changes anything.

> + If no /index/ sub-directory in /${searcher.dir}/ then the nutch
> searcher NutchBean in the webapp opens all indices in the /indexes/
> subdir.  Usually, under /indexes, there/ are subdirectories holding an
> /index/ per segment.  I've tested mixing in ${searcher.dir}/indexes the
> indexes of merged segments and individual segment indices.  This works
> as long as the indices under search.dir/indexes have a (empty)
> /index.done/ file added (Merged indexes don't have this file present --
> you may have to add manually). So, you can ingest new ARCS, then add th=
e
> new segment to the crawldb, do a new link invertion (the new segments
> links will be added to the old linkdb, as for the crawldb), and then
> index your new segment.  When done, add the new index to
> ${segment.dir}/indexes and perhaps move the old, big merged index here
> (adding in the index.done file) and restart your webapp. You should be
> able to search the old and new.
> + But you may find that you might have to merge your new incremental
> segment indices into the large index and then 'sort' the merged index t=
o
> get good, 'balanced' results. Sorting is a recent feature added to nutc=
h
> that allows sorting the index by rank so the highest ranked pages are
> returned first.  Generally you sort to get the best results returned
> faster than you would from the unsorted index.  But, I've observed that
> querying across multiple indices, one index may be favored.  To fix, I
> found I had to merge and sort all indices (To sort, do
> ''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class
> org.apache.nutch.indexer.IndexSorter').

Seems complicated, but I will give it a try as soon as I have crawled a
few smaller arcs which don't take too long to index.
But as the URLs of the site I'm crawling don't change too often and
searching across multiple versions doesn't work right now, having a merge=
d
index won't mean much to me.

Would it make sense to just name the collections after the crawl date to
be able to distinguish between different versions?

Regards,
Max