Revision: 2325
http://archive-access.svn.sourceforge.net/archive-access/?rev=2325&view=rev
Author: binzino
Date: 2008-06-26 15:26:10 -0700 (Thu, 26 Jun 2008)
Log Message:
-----------
Initial revision of scripts for processing CDX files for duplicate and
revisit records.
Added Paths:
-----------
trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx
trunk/archive-access/projects/nutchwax/archive/bin/dups-from
trunk/archive-access/projects/nutchwax/archive/bin/revisits
Added: trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx 2008-06-26 22:26:10 UTC (rev 2325)
@@ -0,0 +1,26 @@
+#!/usr/bin/env bash
+
+if [ "$#" -eq 0 ];
+then
+ echo "Usage: dedup-cdx <cdx>..."
+ echo "To read from standard input, use \"-\" as a filename."
+ echo
+ echo "Finds duplicate records in a set of CDX files and outputs them "
+ echo "in a format suitable for use with NutchWAX tools."
+ echo
+ echo "Duplicate records are found by sorting all the CDX records, then"
+ echo "comparing subsequent records by URL+digest."
+ echo
+ echo "Output is in abbreviated form of \"URL digest date\", ex:"
+ echo
+ echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20080626121505"
+ echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20070208173443"
+ echo
+ echo "The output of this script can be used as an exclusions file for"
+ echo "importing (W)ARC files with NutchWAX, and also for adding dates"
+ echo "to a parallel index."
+ echo
+ exit 1;
+fi
+
+cat $@ | awk '{ print $1 " sha1:" $6 " " $2 }' | sort | awk '{ if ( url == $1 && digest == $2 ) print $1 " " $2 " " $3 ; url = $1 ; digest = $2 }'
Property changes on: trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx
___________________________________________________________________
Name: svn:executable
+ *
Added: trunk/archive-access/projects/nutchwax/archive/bin/dups-from
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/dups-from (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/bin/dups-from 2008-06-26 22:26:10 UTC (rev 2325)
@@ -0,0 +1,16 @@
+#!/usr/bin/env bash
+
+if [ "$#" -lt 2 ];
+then
+ echo "Usage: dups-from <dups> <cdx>..."
+ echo "To read <cdx> from standard input, use \"-\" as a filename."
+ echo
+ echo "Extract the lines from <dups> that come from the <cdx>... files"
+ echo
+ exit 1;
+fi
+
+dups=$1
+shift
+
+cat $@ | awk '{ print $1 " sha1:" $6 " " $2 }' | cat - ${dups} | sort | uniq -d
Property changes on: trunk/archive-access/projects/nutchwax/archive/bin/dups-from
___________________________________________________________________
Name: svn:executable
+ *
Added: trunk/archive-access/projects/nutchwax/archive/bin/revisits
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/bin/revisits (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/bin/revisits 2008-06-26 22:26:10 UTC (rev 2325)
@@ -0,0 +1,12 @@
+#!/usr/bin/env bash
+
+if [ "$#" -eq 0 ];
+then
+ echo "Usage: revisits <cdx>..."
+ echo
+ echo "Extract revisit records from a CDX file."
+ echo "Normally only CDX's generated from WARCs will have revisit records."
+ exit 1;
+fi
+
+cat $@ | awk '{ if ( $9 == "-" ) print $1 " sha1:" $6 " " $2 }' | sort
Property changes on: trunk/archive-access/projects/nutchwax/archive/bin/revisits
___________________________________________________________________
Name: svn:executable
+ *
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|