From: <bi...@us...> - 2008-07-03 18:53:14
|
Revision: 2402 http://archive-access.svn.sourceforge.net/archive-access/?rev=2402&view=rev Author: binzino Date: 2008-07-03 11:53:12 -0700 (Thu, 03 Jul 2008) Log Message: ----------- Added comments regarding WARCs. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-07-03 18:29:09 UTC (rev 2401) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-07-03 18:53:12 UTC (rev 2402) @@ -75,7 +75,7 @@ ====================================================================== -Generate DUP +Generate DUP/Revisits ====================================================================== Now that we have 'all.cdx' containing a sorted list of all the records @@ -98,6 +98,25 @@ This file is then used as an exlusion filter for importing. + +WARC +---- +If we are using WARC files with revisit records instead of ARC files, +then we don't generate a list of duplicate records because there +shouldn't be any. + +However, the revisit records in the WARC files do have the dates when +a URL was revisited and seen to have not changed -- which is more or +less the same thing as our "dup" lines above. + +For extracting these revisits from WARC CDX files, we use the +'revisits' utility provided by NutchWAX + + $ revisits all-warc.cdx > all-warc.dup + +The output of 'revisits' is in the same format as 'dedup-cdx'. + + ====================================================================== Import ====================================================================== @@ -121,7 +140,13 @@ If you examine the Nutch "hadoop.log" file, you will see INFO-level lines from the NutchWAX Importer showing which URLs were excluded. +WARC +---- +If you are importing WARC files with revisit records, then you +typically won't need to provide an exclusion file as the WARC files +were de-duplicated during the crawl. + ====================================================================== Update and Invert ====================================================================== @@ -224,6 +249,15 @@ the previous "dates" index with the new one. +WARC +---- +This step is the same for ARCs and WARCs. + +The only difference is that our "all.dup" file containing the list of +revisit dates was created by different utilities: 'dedup-cdx' for ARCs +and 'revisits' for WARCs. + + ====================================================================== Search ====================================================================== This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |