Re: [Archive-access-discuss] Incremental indexing with nutchwax 0.6

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I've destilled Michael Stack's instruction into a shell script which I'd
like to share. It seems to work quite good for me, but I've only used it
on smaller archives (several hundert MBs) with the latest NutchWAX (CVS
Head) and under Cygwin.
Please let me know if it works for you and whether you still find
everything with the new indices:

http://www.cip.ifi.lmu.de/~schoefma/howto/incremental_indexing_with_nutch=
wax/incr_index.sh

Usage:
./incr_index.sh input_dir target_dir [collection_name]
  or
./incr_index.sh --arcs dir_with_arc_files target_dir [collection_name]

Example:
./incr_index.sh --arcs heritrix/jobs/MyJob-12345/arcs myarch/output mycol=
l

Proconditions:
- HADOOP_HOME and NUTCHWAX_HOME must be set
- You need an existing index in "target_dir" to operate on, e.g. one
  generated by running  NutchWAX' "all" task on a set of arc files.

Hints:
- Save your production index directory before running this script on it!
- When using Cygwin, use relative paths especially for the input dir.
- Either shut down NutchWAX when running this script or operate on a copy
  of your live index (to avoid permission denied errors).

Return codes:
This script returns exit codes which can be used by other scripts:
0  -  Everything went fine,
1  -  Script failed to start (directory not found etc.)
2  -  The importing/indexing process was already started and the index
      in the target directory might have been damaged. You should restore
      it from your backup in this case.

- Max