|
From: Maximilian S. <sch...@ci...> - 2006-10-27 17:32:20
|
I've destilled Michael Stack's instruction into a shell script which I'd like to share. It seems to work quite good for me, but I've only used it on smaller archives (several hundert MBs) with the latest NutchWAX (CVS Head) and under Cygwin. Please let me know if it works for you and whether you still find everything with the new indices: http://www.cip.ifi.lmu.de/~schoefma/howto/incremental_indexing_with_nutch= wax/incr_index.sh Usage: ./incr_index.sh input_dir target_dir [collection_name] or ./incr_index.sh --arcs dir_with_arc_files target_dir [collection_name] Example: ./incr_index.sh --arcs heritrix/jobs/MyJob-12345/arcs myarch/output mycol= l Proconditions: - HADOOP_HOME and NUTCHWAX_HOME must be set - You need an existing index in "target_dir" to operate on, e.g. one generated by running NutchWAX' "all" task on a set of arc files. Hints: - Save your production index directory before running this script on it! - When using Cygwin, use relative paths especially for the input dir. - Either shut down NutchWAX when running this script or operate on a copy of your live index (to avoid permission denied errors). Return codes: This script returns exit codes which can be used by other scripts: 0 - Everything went fine, 1 - Script failed to start (directory not found etc.) 2 - The importing/indexing process was already started and the index in the target directory might have been damaged. You should restore it from your backup in this case. - Max |