From: <bi...@us...> - 2008-07-03 18:29:05
|
Revision: 2401 http://archive-access.svn.sourceforge.net/archive-access/?rev=2401&view=rev Author: binzino Date: 2008-07-03 11:29:09 -0700 (Thu, 03 Jul 2008) Log Message: ----------- Initial revision. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt Added: trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-07-03 18:29:09 UTC (rev 2401) @@ -0,0 +1,289 @@ + +HOWTO-dedup.txt +2008-07-03 +Aaron Binns + +Table of Contents + o Prerequisites + - NutchWAX HOWTO.txt + - Wayback 1.2.1 + o Overview + o Generate CDX + o Generate DUP + o Import + o Update and Invert + o Index + o Add Revisit Dates + o Search + o Web deployment + + +====================================================================== +Prerequisites +====================================================================== + +This de-duplication HOWTO assumes you've already read the main HOWTO +and are familiar with importing and indexing archive files with +NutchWAX. + +For de-duplication, the Wayback Machine tools are required. This guide +assumes you have Wayback 1.2.1 installed in + + /opt/wayback-1.2.1 + + +====================================================================== +Overview +====================================================================== + +The README-dedup.txt explains the de-duplication process in greater +detail, including implementation details. + +NutchWAX does not automagically detect and eliminate duplicate records +when importing and indexing. However, tools are provided to help the +user implement a system to perform de-duplication. + +This guide describes one such system using the tools provided by +NutchWAX and Wayback. + + +====================================================================== +Generate CDX +====================================================================== + +The first step is to generate a list of duplicate records for a set of +ARC files. + +This step is not necessary if your archive files are in WARC format +and de-duplication was performed during the crawl. + +To generate the list of duplicates, we use the Wayback 'arc-indexer' +with the NutchWAX 'dedup-cdx' utility. The CDX files *must* be +sorted. + + $ arc-indexer foo.arc.gz | sort > foo.cdx + $ arc-indexer bar.arc.gz | sort > bar.cdx + $ arc-indexer baz.arc.gz | sort > baz.cdx + +Then we combine the CDX files into one sorted CDX containing all the +records: + + $ sort -m foo.cdx bar.cdx baz.cdx > all.cdx + +The "-m" option speeds up the sort by merging the already-sorted +files. + + +====================================================================== +Generate DUP +====================================================================== + +Now that we have 'all.cdx' containing a sorted list of all the records +in the ARC files, we can generate a list of duplicates therein: + + $ dedup-cdx all.cdx > all.dup + +This "all.dup" file contains lines of the form + + example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 + +Where each line is + + URL digest date + +This file is then used as an exlusion filter for importing. + +====================================================================== +Import +====================================================================== + +The import process is essentially the same as in NutchWAX, but now +we use "all.dup" as our exclusion list. + +First, we create a manifest + + $ cat > manifest + foo.arc.gz test-collection + bar.arc.gz test-collection + baz.arc.gz test-collection + ^D + + $ nutchwax import -e all.dup manifest + +The result will be a newly-created Nutch segment, same as importing +without de-duplication. + +If you examine the Nutch "hadoop.log" file, you will see INFO-level +lines from the NutchWAX Importer showing which URLs were excluded. + + +====================================================================== +Update and Invert +====================================================================== + +Perform the Nutch "updatedb" and "invertlinks" steps as normal. + +Nothing special/different to do here with respect to de-duplication. + + +====================================================================== +Index +====================================================================== + +The only chage we make to the indexing step is the destination of the +index directory. + +By default, Nutch expects the per-segment index directory to live in a +sub-directory called 'indexes' and the index command is accordingly + + $ nutch index indexes crawldb linkdb segments/* + +Resulting in an index directory structure of the form + + indexes/part-00000 + +For de-duplication, we use a slightly different directory structure, +which will be used by a de-duplication-aware NutchWaxBean at +search-time. The directory structure we use is: + + pindexes/<segment>/part-00000 + +Using the segment name is not strictly required, but it is a good +practice and is strongly recommended. This way the segment and its +corresponding index directory are easily matched. + +Let's assume that the segment directory created during the import is +named + + segments/20080703050349 + +In that case, our index command becomes: + + $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349 + +Upon completion, the Lucene index is created in + + pindexes/20080703050349/part-0000 + +This index is exactly the same as one normally created by Nutch, the +only difference is the location. + + +====================================================================== +Add Revisit Dates +====================================================================== + +Now that we have the Nutch index, we add the revisit dates to it. + +Examine the "all.dup" file again, it has lines of the form + + example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 + example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 + +These are the revisit dates that need to be added to the records in +the Lucene index. When we generated the index, only the date of the +first visit was put in the index. Now we have to add these. + +As explained in README-dedup.txt, modifying the Lucene index to +actually add these dates is infeasible. What we do is create a +parallel index next to the main index (the part-00000 created above) +that contains all the dates for each record. + +The NutchWAX 'add-dates' command creates this parallel index for us. + + $ nutchwax add-dates pindexes/20080703050349/part-0000 \ + pindexes/20080703050349/part-0000 \ + pindexes/20080703050349/dates \ + all.dup + +Yes, the part-0000 argument does appear twice. This is beacuse it is +both the "key" index and the "source" index. + + +Suppose we did another crawl and had even more dates to add to the +existing index. In that case we would run + + $ nutchwax add-dates pindexes/20080703050349/part-0000 \ + pindexes/20080703050349/dates \ + pindexes/20080703050349/new-dates \ + new-crawl.dup + $ rm -r pindexes/20080703050349/dates + $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates + +This copies the existing dates from "dates" to "new-dates" and adds +additional ones from "new-crawl.dup" along the way. Then we replace +the previous "dates" index with the new one. + + +====================================================================== +Search +====================================================================== + +Test/debug searches can be run from the command-line, but instead of +using the 'NutchBean' we use 'NutchWaxBean'. + +The "NutchWaxBean" extends NutchBean by adding support for parallel +indexes. + + $ nutch org.archive.nutchwax.NutchWaxBean <query> + +The "NutchWaxBean" also gives slightly more verbose and useful ouput, + + $ nutch org.archive.nutchwax.NutchWaxBean carolina + Total hits: 247338 + 0 [20080702053119] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080618133218, 20080618133218] + ... Studios Blue Ridge Motion Pictures Carolina Pinnacle Creative Network EUE/Screen ... Trailblazer Studios Federal Tax Incentive Carolina Pinnacle Studios ... + 1 [20080703023605] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080613200046, 20080618133218] + +The output consists of + + hit number + segment + url + key (which is url + digest) + digest + dates + +The most useful bit here for testing de-duplication is the list of +dates. + + +====================================================================== +Web Deployment +====================================================================== + +As noted in the HOWTO.txt document, when the nutch(wax) webapp is +deployed, changes made to the configuration must be also applied to +the deployed webapp. + +In addition to those configuration changes, the "web.xml" file must +also be modified. + +In Nutch, the "web.xml" file contains a directive to call a static +method on 'NutchBean' to initialize it. In order to search the +parallel indexes we have to use 'NutchWaxBean'. This is done by +modifying the "web.xml" to call a NutchWaxBean initializer after the +NutchBean initializer. + +Change "web.xml" from + + <listener> + <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + </listener> + +to: + + <listener> + <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> + </listener> + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |