From: Michael S. <sta...@us...> - 2005-09-08 17:48:56
|
Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23127 Added Files: steps_indexing_katrina.txt Log Message: * steps_indexing_katrina.txt Added. Notes on how I did indexing of Katrina. --- NEW FILE: steps_indexing_katrina.txt --- $Id: steps_indexing_katrina.txt,v 1.1 2005/09/08 17:48:48 stack-sf Exp $ Two crawls of Hurricane Katrina. 00 and 01. Will start by indexing part of 00. Here are all of the backup hosts w/ katrina crawl 00 ARCs on them: $ ~webcrawl/crawl-arc-cfg/db-arc-info \ -like HURRICANE-KATRINA-2005-00%arc.gz | \ awk '{print $2$4}' |grep -e -bu|sort|uniq crawldata0034a-bu.archive.org/1 crawldata0035a-bu.archive.org/3 crawldata0036a-bu.archive.org/0 crawldata0037a-bu.archive.org/0 Now to mount these hosts. Here's a little script to do it: #!/bin/sh # Pass name of file that hosts and name of collection to use as dir under # /mnt. if [ $# != 2 ] then echo "Usage: $0 HOSTS_FILE DIR_UNDER_MNT" exit 1 fi for i in `cat $1` do mntpoint="/mnt/$2/$i" mkdir -p $mntpoint dev=`echo $i|sed -n -e 's/\//:\//p'` mount -t nfs -o ro,rsize=8192,wsize=8192,intr,nfsvers=2 $dev $mntpoint done Counting ARCs: $ ~webcrawl/crawl-arc-cfg/db-arc-info \ -like HURRICANE-KATRINA-2005-00%arc.gz | \ awk '{print $2 " " $6}'|grep -e -bu|uniq|wc -l There are 1010 in crawl 00 (uniq'ing, there are 1008). Here is how I got a list of all files sorted: $ ~webcrawl/crawl-arc-cfg/db-arc-info \ -like HURRICANE-KATRINA-2005-00%arc.gz | \ awk '{print $2 " " $6}'|grep -e -bu| \ awk '{print $2}'|sort|uniq> 00arcs.txt I'll do first 100 for now (One segment). $ head -100 00arcs.txt > 00arcs.0-99.txt I then made a directory to hold symlinks to the first 100: $ mkdir 00arcs.0-99 $ for i in `cat ../00arcs.0-99.txt`; do find /mnt/katrina/ -type f \ -name $i -exec ln -s {} \;; done Don't forget to edit the parse-ext plugin.xml so it points to the pdf parser wrapper script. I ran the indexing like this: $ nohup ./bin/indexarcs.sh -c katrina -s ~/katrina/00arcs.0-99/ \ -d /2/katrina/nutch-data &> /2/katrina/indexing`date +%FT%H:%M`.log \ < /dev/null & |