Update of /cvsroot/archive-access/archive-access/projects/nutch/xdocs
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv23127
Added Files:
steps_indexing_katrina.txt
Log Message:
* steps_indexing_katrina.txt
Added. Notes on how I did indexing of Katrina.
--- NEW FILE: steps_indexing_katrina.txt ---
$Id: steps_indexing_katrina.txt,v 1.1 2005/09/08 17:48:48 stack-sf Exp $
Two crawls of Hurricane Katrina. 00 and 01. Will start by indexing part of 00.
Here are all of the backup hosts w/ katrina crawl 00 ARCs on them:
$ ~webcrawl/crawl-arc-cfg/db-arc-info \
-like HURRICANE-KATRINA-2005-00%arc.gz | \
awk '{print $2$4}' |grep -e -bu|sort|uniq
crawldata0034a-bu.archive.org/1
crawldata0035a-bu.archive.org/3
crawldata0036a-bu.archive.org/0
crawldata0037a-bu.archive.org/0
Now to mount these hosts. Here's a little script to do it:
#!/bin/sh
# Pass name of file that hosts and name of collection to use as dir under
# /mnt.
if [ $# != 2 ]
then
echo "Usage: $0 HOSTS_FILE DIR_UNDER_MNT"
exit 1
fi
for i in `cat $1`
do
mntpoint="/mnt/$2/$i"
mkdir -p $mntpoint
dev=`echo $i|sed -n -e 's/\//:\//p'`
mount -t nfs -o ro,rsize=8192,wsize=8192,intr,nfsvers=2 $dev $mntpoint
done
Counting ARCs:
$ ~webcrawl/crawl-arc-cfg/db-arc-info \
-like HURRICANE-KATRINA-2005-00%arc.gz | \
awk '{print $2 " " $6}'|grep -e -bu|uniq|wc -l
There are 1010 in crawl 00 (uniq'ing, there are 1008).
Here is how I got a list of all files sorted:
$ ~webcrawl/crawl-arc-cfg/db-arc-info \
-like HURRICANE-KATRINA-2005-00%arc.gz | \
awk '{print $2 " " $6}'|grep -e -bu| \
awk '{print $2}'|sort|uniq> 00arcs.txt
I'll do first 100 for now (One segment).
$ head -100 00arcs.txt > 00arcs.0-99.txt
I then made a directory to hold symlinks to the first 100:
$ mkdir 00arcs.0-99
$ for i in `cat ../00arcs.0-99.txt`; do find /mnt/katrina/ -type f \
-name $i -exec ln -s {} \;; done
Don't forget to edit the parse-ext plugin.xml so it points to the pdf parser
wrapper script.
I ran the indexing like this:
$ nohup ./bin/indexarcs.sh -c katrina -s ~/katrina/00arcs.0-99/ \
-d /2/katrina/nutch-data &> /2/katrina/indexing`date +%FT%H:%M`.log \
< /dev/null &
|