Thread: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-discuss

[Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Maximilian S. <sch...@ci...> - 2006-07-31 12:22:39

Hi *,

I'want to do regular crawls of a bigger website. I've already crawled it
successfully with heritrix, indexed the resulting arcs with nutchwax and
also searched/browsed them with wera. Works pretty well!

Now I wanted to do a second crawl but I've read that incremental indexing
is broken in nutchwax 0.6 (which I'am using).
I guess I need incremental indexing if I want to be able to search across
all versions of the site?

Now I think I have three options:
1. wait until incremental indexing is fixed
2. use the 4.3 branch
3. index only the newly crawled arcs and let the user select on which dat=
e
she want's to search

So my questions are:
- Is it foreseeable when incremental indexing will be fixed - and if -
what performance can I expect compared to completely reindexing all arc
files?
- Will the 4.3 branch be maintained beside the 0.6 branch and will it be
possible to convert the webdb/indices later (doesn't seem to be the case
right now)?

What solution would you suggest?

Thanks & Best regards,

Max

Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Michael S. <st...@ar...> - 2006-07-31 20:56:53

Maximilian Schoefmann wrote:
> Hi *,
>
> I'want to do regular crawls of a bigger website. I've already crawled it
> successfully with heritrix, indexed the resulting arcs with nutchwax and
> also searched/browsed them with wera. Works pretty well!
>   

I'm glad to hear.

> Now I wanted to do a second crawl but I've read that incremental indexing
> is broken in nutchwax 0.6 (which I'am using).
>   
Its not so much that its broken.  Its more that I don't yet have a good 
story to tell on how to do incremental indexing in 0.6+ of nutchwax. 
Here is what I currently know (I've been kinda waiting on getting more 
practise under my belt before starting in writing a recipe for others):

+ If no /index/ sub-directory in /${searcher.dir}/ then the nutch 
searcher NutchBean in the webapp opens all indices in the /indexes/ 
subdir.  Usually, under /indexes, there/ are subdirectories holding an 
/index/ per segment.  I've tested mixing in ${searcher.dir}/indexes the 
indexes of merged segments and individual segment indices.  This works 
as long as the indices under search.dir/indexes have a (empty) 
/index.done/ file added (Merged indexes don't have this file present -- 
you may have to add manually). So, you can ingest new ARCS, then add the 
new segment to the crawldb, do a new link invertion (the new segments 
links will be added to the old linkdb, as for the crawldb), and then 
index your new segment.  When done, add the new index to 
${segment.dir}/indexes and perhaps move the old, big merged index here 
(adding in the index.done file) and restart your webapp. You should be 
able to search the old and new.
+ But you may find that you might have to merge your new incremental 
segment indices into the large index and then 'sort' the merged index to 
get good, 'balanced' results. Sorting is a recent feature added to nutch 
that allows sorting the index by rank so the highest ranked pages are 
returned first.  Generally you sort to get the best results returned 
faster than you would from the unsorted index.  But, I've observed that 
querying across multiple indices, one index may be favored.  To fix, I 
found I had to merge and sort all indices (To sort, do 
''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class 
org.apache.nutch.indexer.IndexSorter').


> I guess I need incremental indexing if I want to be able to search across
> all versions of the site?
>   

Yes.  Sort of.  The not so good news is that in new nutch(wax), the key 
it uses doing all of the  mapreduce indexing steps is the URL (Not 
URL+date but URL only).  What this means is that only the latest version 
of a page is searchable; unlike old nutch, you can't search a single URL 
across all page versions.  This feature was lost when we moved on to new 
nutch.  Recently I made nutchwax use URL + collection as the key 
end-to-end indexing and at query time.  This makes it so I can have the 
same URL in the index multiple times distingushed by collection.  Next 
will be to key by URL + date (See '[ 1518431 ] [nutchwax] Search 
multiple versions of one URL broken' 
http://sourceforge.net/tracker/index.php?func=detail&aid=1518431&group_id=118427&atid=681137).


> Now I think I have three options:
> 1. wait until incremental indexing is fixed
> 2. use the 4.3 branch
>   

4.3 branch is dead and no longer supported.
> 3. index only the newly crawled arcs and let the user select on which date
> she want's to search
>
> So my questions are:
> - Is it foreseeable when incremental indexing will be fixed - and if -
> what performance can I expect compared to completely reindexing all arc
> files?
>   
Soon (smile).  Its about time we had a new NutchWAX release.  Lots of 
changes of late (not the least of which is that there is an official 0.8 
nutch release).  I'm currently working on making incremental updates 
work for us internally.  Once the internal client is satisfied, I'll 
document and release.   I'd SWAG a month.

> - Will the 4.3 branch be maintained beside the 0.6 branch and will it be
> possible to convert the webdb/indices later (doesn't seem to be the case
> right now)?
>   

No to both questions.
St.Ack

Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Maximilian S. <sch...@ci...> - 2006-08-01 08:28:38

Thanks for the fast and informative answer!

> Its not so much that its broken.

good to hear. Btw, I'm now using a 0.7 version from the integration
server. Don't know if that changes anything.

> + If no /index/ sub-directory in /${searcher.dir}/ then the nutch
> searcher NutchBean in the webapp opens all indices in the /indexes/
> subdir.  Usually, under /indexes, there/ are subdirectories holding an
> /index/ per segment.  I've tested mixing in ${searcher.dir}/indexes the
> indexes of merged segments and individual segment indices.  This works
> as long as the indices under search.dir/indexes have a (empty)
> /index.done/ file added (Merged indexes don't have this file present --
> you may have to add manually). So, you can ingest new ARCS, then add th=
e
> new segment to the crawldb, do a new link invertion (the new segments
> links will be added to the old linkdb, as for the crawldb), and then
> index your new segment.  When done, add the new index to
> ${segment.dir}/indexes and perhaps move the old, big merged index here
> (adding in the index.done file) and restart your webapp. You should be
> able to search the old and new.
> + But you may find that you might have to merge your new incremental
> segment indices into the large index and then 'sort' the merged index t=
o
> get good, 'balanced' results. Sorting is a recent feature added to nutc=
h
> that allows sorting the index by rank so the highest ranked pages are
> returned first.  Generally you sort to get the best results returned
> faster than you would from the unsorted index.  But, I've observed that
> querying across multiple indices, one index may be favored.  To fix, I
> found I had to merge and sort all indices (To sort, do
> ''$HADOOP_HOME/bin/hadoop jar nutchwax.jar class
> org.apache.nutch.indexer.IndexSorter').

Seems complicated, but I will give it a try as soon as I have crawled a
few smaller arcs which don't take too long to index.
But as the URLs of the site I'm crawling don't change too often and
searching across multiple versions doesn't work right now, having a merge=
d
index won't mean much to me.

Would it make sense to just name the collections after the crawl date to
be able to distinguish between different versions?

Regards,
Max

Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

From: Maximilian S. <sch...@ci...> - 2006-10-27 17:32:20

I've destilled Michael Stack's instruction into a shell script which I'd
like to share. It seems to work quite good for me, but I've only used it
on smaller archives (several hundert MBs) with the latest NutchWAX (CVS
Head) and under Cygwin.
Please let me know if it works for you and whether you still find
everything with the new indices:

http://www.cip.ifi.lmu.de/~schoefma/howto/incremental_indexing_with_nutch=
wax/incr_index.sh

Usage:
./incr_index.sh input_dir target_dir [collection_name]
  or
./incr_index.sh --arcs dir_with_arc_files target_dir [collection_name]

Example:
./incr_index.sh --arcs heritrix/jobs/MyJob-12345/arcs myarch/output mycol=
l

Proconditions:
- HADOOP_HOME and NUTCHWAX_HOME must be set
- You need an existing index in "target_dir" to operate on, e.g. one
  generated by running  NutchWAX' "all" task on a set of arc files.

Hints:
- Save your production index directory before running this script on it!
- When using Cygwin, use relative paths especially for the input dir.
- Either shut down NutchWAX when running this script or operate on a copy
  of your live index (to avoid permission denied errors).

Return codes:
This script returns exit codes which can be used by other scripts:
0  -  Everything went fine,
1  -  Script failed to start (directory not found etc.)
2  -  The importing/indexing process was already started and the index
      in the target directory might have been damaged. You should restore
      it from your backup in this case.


- Max