From: <bi...@us...> - 2008-06-26 22:26:04
|
Revision: 2325 http://archive-access.svn.sourceforge.net/archive-access/?rev=2325&view=rev Author: binzino Date: 2008-06-26 15:26:10 -0700 (Thu, 26 Jun 2008) Log Message: ----------- Initial revision of scripts for processing CDX files for duplicate and revisit records. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx trunk/archive-access/projects/nutchwax/archive/bin/dups-from trunk/archive-access/projects/nutchwax/archive/bin/revisits Added: trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx 2008-06-26 22:26:10 UTC (rev 2325) @@ -0,0 +1,26 @@ +#!/usr/bin/env bash + +if [ "$#" -eq 0 ]; +then + echo "Usage: dedup-cdx <cdx>..." + echo "To read from standard input, use \"-\" as a filename." + echo + echo "Finds duplicate records in a set of CDX files and outputs them " + echo "in a format suitable for use with NutchWAX tools." + echo + echo "Duplicate records are found by sorting all the CDX records, then" + echo "comparing subsequent records by URL+digest." + echo + echo "Output is in abbreviated form of \"URL digest date\", ex:" + echo + echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20080626121505" + echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20070208173443" + echo + echo "The output of this script can be used as an exclusions file for" + echo "importing (W)ARC files with NutchWAX, and also for adding dates" + echo "to a parallel index." + echo + exit 1; +fi + +cat $@ | awk '{ print $1 " sha1:" $6 " " $2 }' | sort | awk '{ if ( url == $1 && digest == $2 ) print $1 " " $2 " " $3 ; url = $1 ; digest = $2 }' Property changes on: trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx ___________________________________________________________________ Name: svn:executable + * Added: trunk/archive-access/projects/nutchwax/archive/bin/dups-from =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/dups-from (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/bin/dups-from 2008-06-26 22:26:10 UTC (rev 2325) @@ -0,0 +1,16 @@ +#!/usr/bin/env bash + +if [ "$#" -lt 2 ]; +then + echo "Usage: dups-from <dups> <cdx>..." + echo "To read <cdx> from standard input, use \"-\" as a filename." + echo + echo "Extract the lines from <dups> that come from the <cdx>... files" + echo + exit 1; +fi + +dups=$1 +shift + +cat $@ | awk '{ print $1 " sha1:" $6 " " $2 }' | cat - ${dups} | sort | uniq -d Property changes on: trunk/archive-access/projects/nutchwax/archive/bin/dups-from ___________________________________________________________________ Name: svn:executable + * Added: trunk/archive-access/projects/nutchwax/archive/bin/revisits =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/revisits (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/bin/revisits 2008-06-26 22:26:10 UTC (rev 2325) @@ -0,0 +1,12 @@ +#!/usr/bin/env bash + +if [ "$#" -eq 0 ]; +then + echo "Usage: revisits <cdx>..." + echo + echo "Extract revisit records from a CDX file." + echo "Normally only CDX's generated from WARCs will have revisit records." + exit 1; +fi + +cat $@ | awk '{ if ( $9 == "-" ) print $1 " sha1:" $6 " " $2 }' | sort Property changes on: trunk/archive-access/projects/nutchwax/archive/bin/revisits ___________________________________________________________________ Name: svn:executable + * This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |