[Archive-access-cvs] SF.net SVN: archive-access: [2401] trunk/archive-access/projects/nutchwax/ arc

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2401
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2401&view=rev
Author:   binzino
Date:     2008-07-03 11:29:09 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Initial revision.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt

Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt	2008-07-03 18:29:09 UTC (rev 2401)
@@ -0,0 +1,289 @@
+
+HOWTO-dedup.txt
+2008-07-03
+Aaron Binns
+
+Table of Contents
+ o Prerequisites
+   - NutchWAX HOWTO.txt
+   - Wayback 1.2.1
+ o Overview
+ o Generate CDX
+ o Generate DUP
+ o Import
+ o Update and Invert
+ o Index
+ o Add Revisit Dates
+ o Search
+ o Web deployment
+
+
+======================================================================
+Prerequisites
+======================================================================
+
+This de-duplication HOWTO assumes you've already read the main HOWTO
+and are familiar with importing and indexing archive files with
+NutchWAX.
+
+For de-duplication, the Wayback Machine tools are required.  This guide
+assumes you have Wayback 1.2.1 installed in
+
+  /opt/wayback-1.2.1
+
+
+======================================================================
+Overview
+======================================================================
+
+The README-dedup.txt explains the de-duplication process in greater
+detail, including implementation details.
+
+NutchWAX does not automagically detect and eliminate duplicate records
+when importing and indexing.  However, tools are provided to help the
+user implement a system to perform de-duplication.
+
+This guide describes one such system using the tools provided by
+NutchWAX and Wayback.
+
+
+======================================================================
+Generate CDX
+======================================================================
+
+The first step is to generate a list of duplicate records for a set of
+ARC files.
+
+This step is not necessary if your archive files are in WARC format
+and de-duplication was performed during the crawl.
+
+To generate the list of duplicates, we use the Wayback 'arc-indexer'
+with the NutchWAX 'dedup-cdx' utility.  The CDX files *must* be
+sorted.
+
+  $ arc-indexer foo.arc.gz | sort > foo.cdx
+  $ arc-indexer bar.arc.gz | sort > bar.cdx
+  $ arc-indexer baz.arc.gz | sort > baz.cdx
+
+Then we combine the CDX files into one sorted CDX containing all the
+records:
+
+  $ sort -m foo.cdx bar.cdx baz.cdx > all.cdx
+
+The "-m" option speeds up the sort by merging the already-sorted
+files.
+
+
+======================================================================
+Generate DUP
+======================================================================
+
+Now that we have 'all.cdx' containing a sorted list of all the records
+in the ARC files, we can generate a list of duplicates therein:
+
+  $ dedup-cdx all.cdx > all.dup
+
+This "all.dup" file contains lines of the form
+
+   example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
+
+Where each line is
+
+   URL digest date
+
+This file is then used as an exlusion filter for importing.
+
+======================================================================
+Import
+======================================================================
+
+The import process is essentially the same as in NutchWAX, but now
+we use "all.dup" as our exclusion list.
+
+First, we create a manifest
+
+  $ cat > manifest
+  foo.arc.gz test-collection
+  bar.arc.gz test-collection
+  baz.arc.gz test-collection
+  ^D
+
+  $ nutchwax import -e all.dup manifest
+
+The result will be a newly-created Nutch segment, same as importing
+without de-duplication.
+
+If you examine the Nutch "hadoop.log" file, you will see INFO-level
+lines from the NutchWAX Importer showing which URLs were excluded.
+
+
+======================================================================
+Update and Invert 
+======================================================================
+
+Perform the Nutch "updatedb" and "invertlinks" steps as normal.
+
+Nothing special/different to do here with respect to de-duplication.
+
+
+======================================================================
+Index
+======================================================================
+
+The only chage we make to the indexing step is the destination of the
+index directory.
+
+By default, Nutch expects the per-segment index directory to live in a
+sub-directory called 'indexes' and the index command is accordingly
+
+  $ nutch index indexes crawldb linkdb segments/*
+
+Resulting in an index directory structure of the form
+
+    indexes/part-00000
+
+For de-duplication, we use a slightly different directory structure,
+which will be used by a de-duplication-aware NutchWaxBean at
+search-time.  The directory structure we use is:
+
+    pindexes/<segment>/part-00000
+
+Using the segment name is not strictly required, but it is a good
+practice and is strongly recommended.  This way the segment and its
+corresponding index directory are easily matched.
+
+Let's assume that the segment directory created during the import is
+named
+
+  segments/20080703050349
+
+In that case, our index command becomes:
+
+  $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349
+
+Upon completion, the Lucene index is created in
+
+  pindexes/20080703050349/part-0000
+
+This index is exactly the same as one normally created by Nutch, the
+only difference is the location.
+
+
+======================================================================
+Add Revisit Dates
+======================================================================
+
+Now that we have the Nutch index, we add the revisit dates to it.
+
+Examine the "all.dup" file again, it has lines of the form
+
+   example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213
+   example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911
+
+These are the revisit dates that need to be added to the records in
+the Lucene index.  When we generated the index, only the date of the
+first visit was put in the index.  Now we have to add these.
+
+As explained in README-dedup.txt, modifying the Lucene index to
+actually add these dates is infeasible.  What we do is create a
+parallel index next to the main index (the part-00000 created above)
+that contains all the dates for each record.
+
+The NutchWAX 'add-dates' command creates this parallel index for us.
+
+  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
+                       pindexes/20080703050349/part-0000 \
+                       pindexes/20080703050349/dates \
+                       all.dup
+
+Yes, the part-0000 argument does appear twice.  This is beacuse it is
+both the "key" index and the "source" index.
+
+
+Suppose we did another crawl and had even more dates to add to the
+existing index.  In that case we would run
+
+  $ nutchwax add-dates pindexes/20080703050349/part-0000 \
+                       pindexes/20080703050349/dates \
+                       pindexes/20080703050349/new-dates \
+                       new-crawl.dup
+  $ rm -r pindexes/20080703050349/dates
+  $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates
+
+This copies the existing dates from "dates" to "new-dates" and adds
+additional ones from "new-crawl.dup" along the way.  Then we replace
+the previous "dates" index with the new one.
+
+
+======================================================================
+Search
+======================================================================
+
+Test/debug searches can be run from the command-line, but instead of
+using the 'NutchBean' we use 'NutchWaxBean'.
+
+The "NutchWaxBean" extends NutchBean by adding support for parallel
+indexes.
+
+  $ nutch org.archive.nutchwax.NutchWaxBean <query>
+
+The "NutchWaxBean" also gives slightly more verbose and useful ouput,
+
+  $ nutch org.archive.nutchwax.NutchWaxBean carolina
+  Total hits: 247338
+   0 [20080702053119] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080618133218, 20080618133218]
+   ... Studios Blue Ridge Motion Pictures Carolina Pinnacle Creative Network EUE/Screen ... Trailblazer Studios Federal Tax Incentive Carolina Pinnacle Studios  ... 
+   1 [20080703023605] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080613200046, 20080618133218]
+
+The output consists of 
+
+  hit number
+  segment
+  url
+  key (which is url + digest)
+  digest
+  dates
+
+The most useful bit here for testing de-duplication is the list of
+dates.
+
+
+======================================================================
+Web Deployment
+======================================================================
+
+As noted in the HOWTO.txt document, when the nutch(wax) webapp is
+deployed, changes made to the configuration must be also applied to
+the deployed webapp.
+
+In addition to those configuration changes, the "web.xml" file must
+also be modified.
+
+In Nutch, the "web.xml" file contains a directive to call a static
+method on 'NutchBean' to initialize it.  In order to search the
+parallel indexes we have to use 'NutchWaxBean'.  This is done by
+modifying the "web.xml" to call a NutchWaxBean initializer after the
+NutchBean initializer.
+
+Change "web.xml" from
+
+  <listener>
+    <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+  </listener>
+
+to:
+
+  <listener>
+    <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class>
+    <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class>
+  </listener>
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access: [2401] trunk/archive-access/projects/nutchwax/ arc

[Archive-access-cvs] SF.net SVN: archive-access: [2401] trunk/archive-access/projects/nutchwax/ archive/HOWTO-dedup.txt