[Archive-access-cvs] SF.net SVN: archive-access: [2400] trunk/archive-access/projects/nutchwax/ arc

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2400
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2400&view=rev
Author:   binzino
Date:     2008-07-03 11:28:55 -0700 (Thu, 03 Jul 2008)

Log Message:
-----------
Added info on new configuration properties.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-07-03 02:03:41 UTC (rev 2399)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-07-03 18:28:55 UTC (rev 2400)
@@ -120,12 +120,13 @@
 
 to
 
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic
+  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
 
 In short, we add:
 
   index-nutchwax
   query-nutchwax
+  urlfilter-nutchwax
   parse-pdf
 
 and remove:
@@ -136,19 +137,37 @@
 The only *required* changes are the additions of the NutchWAX index
 and query plugins.  The rest are optional, but recommended.
 
-The addition of the "parse-pdf" plugin is simply because we have lots
-of PDFs in our archives and we want to index them.  We sometimes
-remove the "parse-js" plugin if we don't care to index JavaScript
-files.
+The "parse-pdf" plugin is added simply because we have lots of PDFs in
+our archives and we want to index them.  We sometimes remove the
+"parse-js" plugin if we don't care to index JavaScript files.
 
-We also remove the URL filtering and normalizing plugins because we do
-not need the URLs normalized nor filtered.  We trust that the tool
-that produced the ARC/WARC file will have normalized the URLs
-contained therein according to its own rules so there's no need to
-normalize here.  Also, we don't filter by URL since we want to index
-as much of the ARC/WARC file as we have parsers for.
+We also remove the default Nutch URL filtering and normalizing plugins
+because we do not need the URLs normalized nor filtered.  We trust
+that the tool that produced the ARC/WARC file will have normalized the
+URLs contained therein according to its own rules so there's no need
+to normalize here.  Also, we don't filter by URL since we want to
+index as much of the ARC/WARC file as we have parsers for.
 
+We do, however, add the NutchWAX URL filter.  If de-duplication is
+being performed upon import, this plugin is required.  It performs URL
+filtering of the list of ARC records to exclude based on
+URL+digest+date.
+
 --------------------------------------------------
+indexingfilter.order
+--------------------------------------------------
+
+Add this property with a value of
+
+    org.apache.nutch.indexer.basic.BasicIndexingFilter
+    org.archive.nutchwax.index.ConfigurableIndexingFilter
+
+So that the NutchWAX indexing filter is run after the Nutch basic
+indexing filter.
+
+A full explanation is given in "README-dedup.txt".
+
+--------------------------------------------------
 mime.type.magic
 --------------------------------------------------
 We disable mimetype detection in Nutch for two reasons:
@@ -172,12 +191,12 @@
 nutchwax.filter.index
 --------------------------------------------------
 Configure the 'index-nutchwax' plugin.  Specify how the metadata
-fields added by the ArcsToSegment are mapped to the Lucene documents
-during indexing.
+fields added by the Importer are mapped to the Lucene documents during
+indexing.
 
 The specifications here are of the form:
 
-  src-key:lowercase:store:tokenize:dest-key
+  src-key:lowercase:store:tokenize:exclusive:dest-key
 
 where the only required part is the "src-key", the rest will assume
 the following defaults:
@@ -185,6 +204,7 @@
   lowercase = true
   store     = true
   tokenize  = false
+  exclusive = true
   dest-key  = src-key
 
 We recommend:
@@ -192,6 +212,9 @@
 <property>
   <name>nutchwax.filter.index</name>
   <value>
+    url:false:true:true
+    orig:false
+    digest:false
     arcname:false
     collection
     date
@@ -199,39 +222,50 @@
   </value>
 </property>
 
+The "url", "orig" and "digest" values are required, the rest are
+optional, but strongly recommended.
+
 --------------------------------------------------
 nutchwax.filter.query
 --------------------------------------------------
 Configure the 'query-nutchwax' plugin.  Specify which fields to make
-searchable via "[field]:[term|phrase]" query syntax, and whether they
+searchable via "field:[term|phrase]" query syntax, and whether they
 are "raw" fields or not.
 
-The specification format is 
+The specification format is one of:
 
-  raw:name:lowercase:boost 
-or
-  field:name:boost
+  field:<name>:<boost>
+  raw:<name>:<lowercase>:<boost>
+  group:<name>:<lowercase>:<delimiter>:<boost>
 
 Default values are
 
   lowercase = true
+  delimiter = ","
   boost     = 1.0f
 
 There is no "lowercase" property for "field" specification because the
 Nutch FieldQueryFilter doesn't expose the option, unlike the
 RawFieldQueryFilter.
 
-NTOE: We do *not* use this filter for handling "date" queries, there is a
-specific filter for that: DateQueryFilter
+The "group" fields are raw fields that can accept multiple values,
+separated by a delimiter.  Multiple values appearing in a query are
+automagically translated into required OR-groups, such as
 
+  collection:"193,221,36" => +(collection:193 collection:221 collection:36)
+
+NOTE: We do *not* use this filter for handling "date" queries, there
+is a specific filter for that: DateQueryFilter
+
 We recommend:
 
 <property>
   <name>nutchwax.filter.query</name>
   <value>
+    raw:digest:false
     raw:arcname:false
-    raw:collection
-    raw:type
+    group:collection
+    group:type
     field:anchor
     field:content
     field:host
@@ -240,6 +274,52 @@
 </property>
 
 
+--------------------------------------------------
+nutchwax.urlfilter.wayback.exclusions
+--------------------------------------------------
+File containing the exclusion list for importing.
+
+Normally, this is specified on the command line with the NutchWAX
+Importer is invoked.  It can be specified here if preferred.
+
+--------------------------------------------------
+nutchwax.urlfilter.wayback.canonicalizer
+--------------------------------------------------
+
+For CDX-based de-duplication, the same URL canonicalization algorithm
+must be used here as was used to generate the CDX files.
+
+The default canonicalizer in Wayback's '(w)arc-indexer' utility
+is 
+
+  org.archive.wayback.util.url.AggressiveUrlCanonicalizer
+
+which is the value provided in "nutch-site.xml".
+
+If the '(w)arc-indexer' is executed with the "-i" (identity)
+command-line option, then the matching canonicalizer
+
+  org.archive.wayback.util.url.IdentityUrlCanonicalizer
+
+must be specified here.
+
+--------------------------------------------------
+nutchwax.import.content.limit
+--------------------------------------------------
+Similar to Nutch's
+
+  file.content.limit
+  http.content.limit
+  ftp.content.limit
+
+properties, this specifies a limit on the size of a document imported
+via NutchWAX.
+
+We recommend setting this to a size compatible with the memory
+capacity of the computers performing the import.  Something in the
+1-4MB range is typical.
+
+
 ======================================================================
 Create a manifest
 ======================================================================


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access: [2400] trunk/archive-access/projects/nutchwax/ arc

[Archive-access-cvs] SF.net SVN: archive-access: [2400] trunk/archive-access/projects/nutchwax/ archive/HOWTO.txt