|
From: Michael S. <st...@ar...> - 2006-07-31 20:56:53
|
Maximilian Schoefmann wrote:
> Hi *,
>
> I'want to do regular crawls of a bigger website. I've already crawled it
> successfully with heritrix, indexed the resulting arcs with nutchwax and
> also searched/browsed them with wera. Works pretty well!
>
I'm glad to hear.
> Now I wanted to do a second crawl but I've read that incremental indexing
> is broken in nutchwax 0.6 (which I'am using).
>
Its not so much that its broken. Its more that I don't yet have a good
story to tell on how to do incremental indexing in 0.6+ of nutchwax.
Here is what I currently know (I've been kinda waiting on getting more
practise under my belt before starting in writing a recipe for others):
+ If no /index/ sub-directory in /${searcher.dir}/ then the nutch
searcher NutchBean in the webapp opens all indices in the /indexes/
subdir. Usually, under /indexes, there/ are subdirectories holding an
/index/ per segment. I've tested mixing in ${searcher.dir}/indexes the
indexes of merged segments and individual segment indices. This works
as long as the indices under search.dir/indexes have a (empty)
/index.done/ file added (Merged indexes don't have this file present --
you may have to add manually). So, you can ingest new ARCS, then add the
new segment to the crawldb, do a new link invertion (the new segments
links will be added to the old linkdb, as for the crawldb), and then
index your new segment. When done, add the new index to
${segment.dir}/indexes and perhaps move the old, big merged index here
(adding in the index.done file) and restart your webapp. You should be
able to search the old and new.
+ But you may find that you might have to merge your new incremental
segment indices into the large index and then 'sort' the merged index to
get good, 'balanced' results. Sorting is a recent feature added to nutch
that allows sorting the index by rank so the highest ranked pages are
returned first. Generally you sort to get the best results returned
faster than you would from the unsorted index. But, I've observed that
querying across multiple indices, one index may be favored. To fix, I
found I had to merge and sort all indices (To sort, do
''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class
org.apache.nutch.indexer.IndexSorter').
> I guess I need incremental indexing if I want to be able to search across
> all versions of the site?
>
Yes. Sort of. The not so good news is that in new nutch(wax), the key
it uses doing all of the mapreduce indexing steps is the URL (Not
URL+date but URL only). What this means is that only the latest version
of a page is searchable; unlike old nutch, you can't search a single URL
across all page versions. This feature was lost when we moved on to new
nutch. Recently I made nutchwax use URL + collection as the key
end-to-end indexing and at query time. This makes it so I can have the
same URL in the index multiple times distingushed by collection. Next
will be to key by URL + date (See '[ 1518431 ] [nutchwax] Search
multiple versions of one URL broken'
http://sourceforge.net/tracker/index.php?func=detail&aid=1518431&group_id=118427&atid=681137).
> Now I think I have three options:
> 1. wait until incremental indexing is fixed
> 2. use the 4.3 branch
>
4.3 branch is dead and no longer supported.
> 3. index only the newly crawled arcs and let the user select on which date
> she want's to search
>
> So my questions are:
> - Is it foreseeable when incremental indexing will be fixed - and if -
> what performance can I expect compared to completely reindexing all arc
> files?
>
Soon (smile). Its about time we had a new NutchWAX release. Lots of
changes of late (not the least of which is that there is an official 0.8
nutch release). I'm currently working on making incremental updates
work for us internally. Once the internal client is satisfied, I'll
document and release. I'd SWAG a month.
> - Will the 4.3 branch be maintained beside the 0.6 branch and will it be
> possible to convert the webdb/indices later (doesn't seem to be the case
> right now)?
>
No to both questions.
St.Ack
|