|
From: Maximilian S. <sch...@ci...> - 2006-07-31 12:22:39
|
Hi *, I'want to do regular crawls of a bigger website. I've already crawled it successfully with heritrix, indexed the resulting arcs with nutchwax and also searched/browsed them with wera. Works pretty well! Now I wanted to do a second crawl but I've read that incremental indexing is broken in nutchwax 0.6 (which I'am using). I guess I need incremental indexing if I want to be able to search across all versions of the site? Now I think I have three options: 1. wait until incremental indexing is fixed 2. use the 4.3 branch 3. index only the newly crawled arcs and let the user select on which dat= e she want's to search So my questions are: - Is it foreseeable when incremental indexing will be fixed - and if - what performance can I expect compared to completely reindexing all arc files? - Will the 4.3 branch be maintained beside the 0.6 branch and will it be possible to convert the webdb/indices later (doesn't seem to be the case right now)? What solution would you suggest? Thanks & Best regards, Max |
|
From: Michael S. <st...@ar...> - 2006-07-31 20:56:53
|
Maximilian Schoefmann wrote:
> Hi *,
>
> I'want to do regular crawls of a bigger website. I've already crawled it
> successfully with heritrix, indexed the resulting arcs with nutchwax and
> also searched/browsed them with wera. Works pretty well!
>
I'm glad to hear.
> Now I wanted to do a second crawl but I've read that incremental indexing
> is broken in nutchwax 0.6 (which I'am using).
>
Its not so much that its broken. Its more that I don't yet have a good
story to tell on how to do incremental indexing in 0.6+ of nutchwax.
Here is what I currently know (I've been kinda waiting on getting more
practise under my belt before starting in writing a recipe for others):
+ If no /index/ sub-directory in /${searcher.dir}/ then the nutch
searcher NutchBean in the webapp opens all indices in the /indexes/
subdir. Usually, under /indexes, there/ are subdirectories holding an
/index/ per segment. I've tested mixing in ${searcher.dir}/indexes the
indexes of merged segments and individual segment indices. This works
as long as the indices under search.dir/indexes have a (empty)
/index.done/ file added (Merged indexes don't have this file present --
you may have to add manually). So, you can ingest new ARCS, then add the
new segment to the crawldb, do a new link invertion (the new segments
links will be added to the old linkdb, as for the crawldb), and then
index your new segment. When done, add the new index to
${segment.dir}/indexes and perhaps move the old, big merged index here
(adding in the index.done file) and restart your webapp. You should be
able to search the old and new.
+ But you may find that you might have to merge your new incremental
segment indices into the large index and then 'sort' the merged index to
get good, 'balanced' results. Sorting is a recent feature added to nutch
that allows sorting the index by rank so the highest ranked pages are
returned first. Generally you sort to get the best results returned
faster than you would from the unsorted index. But, I've observed that
querying across multiple indices, one index may be favored. To fix, I
found I had to merge and sort all indices (To sort, do
''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class
org.apache.nutch.indexer.IndexSorter').
> I guess I need incremental indexing if I want to be able to search across
> all versions of the site?
>
Yes. Sort of. The not so good news is that in new nutch(wax), the key
it uses doing all of the mapreduce indexing steps is the URL (Not
URL+date but URL only). What this means is that only the latest version
of a page is searchable; unlike old nutch, you can't search a single URL
across all page versions. This feature was lost when we moved on to new
nutch. Recently I made nutchwax use URL + collection as the key
end-to-end indexing and at query time. This makes it so I can have the
same URL in the index multiple times distingushed by collection. Next
will be to key by URL + date (See '[ 1518431 ] [nutchwax] Search
multiple versions of one URL broken'
http://sourceforge.net/tracker/index.php?func=detail&aid=1518431&group_id=118427&atid=681137).
> Now I think I have three options:
> 1. wait until incremental indexing is fixed
> 2. use the 4.3 branch
>
4.3 branch is dead and no longer supported.
> 3. index only the newly crawled arcs and let the user select on which date
> she want's to search
>
> So my questions are:
> - Is it foreseeable when incremental indexing will be fixed - and if -
> what performance can I expect compared to completely reindexing all arc
> files?
>
Soon (smile). Its about time we had a new NutchWAX release. Lots of
changes of late (not the least of which is that there is an official 0.8
nutch release). I'm currently working on making incremental updates
work for us internally. Once the internal client is satisfied, I'll
document and release. I'd SWAG a month.
> - Will the 4.3 branch be maintained beside the 0.6 branch and will it be
> possible to convert the webdb/indices later (doesn't seem to be the case
> right now)?
>
No to both questions.
St.Ack
|
|
From: Maximilian S. <sch...@ci...> - 2006-08-01 08:28:38
|
Thanks for the fast and informative answer!
> Its not so much that its broken.
good to hear. Btw, I'm now using a 0.7 version from the integration
server. Don't know if that changes anything.
> + If no /index/ sub-directory in /${searcher.dir}/ then the nutch
> searcher NutchBean in the webapp opens all indices in the /indexes/
> subdir. Usually, under /indexes, there/ are subdirectories holding an
> /index/ per segment. I've tested mixing in ${searcher.dir}/indexes the
> indexes of merged segments and individual segment indices. This works
> as long as the indices under search.dir/indexes have a (empty)
> /index.done/ file added (Merged indexes don't have this file present --
> you may have to add manually). So, you can ingest new ARCS, then add th=
e
> new segment to the crawldb, do a new link invertion (the new segments
> links will be added to the old linkdb, as for the crawldb), and then
> index your new segment. When done, add the new index to
> ${segment.dir}/indexes and perhaps move the old, big merged index here
> (adding in the index.done file) and restart your webapp. You should be
> able to search the old and new.
> + But you may find that you might have to merge your new incremental
> segment indices into the large index and then 'sort' the merged index t=
o
> get good, 'balanced' results. Sorting is a recent feature added to nutc=
h
> that allows sorting the index by rank so the highest ranked pages are
> returned first. Generally you sort to get the best results returned
> faster than you would from the unsorted index. But, I've observed that
> querying across multiple indices, one index may be favored. To fix, I
> found I had to merge and sort all indices (To sort, do
> ''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class
> org.apache.nutch.indexer.IndexSorter').
Seems complicated, but I will give it a try as soon as I have crawled a
few smaller arcs which don't take too long to index.
But as the URLs of the site I'm crawling don't change too often and
searching across multiple versions doesn't work right now, having a merge=
d
index won't mean much to me.
Would it make sense to just name the collections after the crawl date to
be able to distinguish between different versions?
Regards,
Max
|
|
From: Maximilian S. <sch...@ci...> - 2006-10-27 17:32:20
|
I've destilled Michael Stack's instruction into a shell script which I'd like to share. It seems to work quite good for me, but I've only used it on smaller archives (several hundert MBs) with the latest NutchWAX (CVS Head) and under Cygwin. Please let me know if it works for you and whether you still find everything with the new indices: http://www.cip.ifi.lmu.de/~schoefma/howto/incremental_indexing_with_nutch= wax/incr_index.sh Usage: ./incr_index.sh input_dir target_dir [collection_name] or ./incr_index.sh --arcs dir_with_arc_files target_dir [collection_name] Example: ./incr_index.sh --arcs heritrix/jobs/MyJob-12345/arcs myarch/output mycol= l Proconditions: - HADOOP_HOME and NUTCHWAX_HOME must be set - You need an existing index in "target_dir" to operate on, e.g. one generated by running NutchWAX' "all" task on a set of arc files. Hints: - Save your production index directory before running this script on it! - When using Cygwin, use relative paths especially for the input dir. - Either shut down NutchWAX when running this script or operate on a copy of your live index (to avoid permission denied errors). Return codes: This script returns exit codes which can be used by other scripts: 0 - Everything went fine, 1 - Script failed to start (directory not found etc.) 2 - The importing/indexing process was already started and the index in the target directory might have been damaged. You should restore it from your backup in this case. - Max |