From: <bi...@us...> - 2008-07-03 18:28:48
|
Revision: 2400 http://archive-access.svn.sourceforge.net/archive-access/?rev=2400&view=rev Author: binzino Date: 2008-07-03 11:28:55 -0700 (Thu, 03 Jul 2008) Log Message: ----------- Added info on new configuration properties. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-03 02:03:41 UTC (rev 2399) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-03 18:28:55 UTC (rev 2400) @@ -120,12 +120,13 @@ to - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic + protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax In short, we add: index-nutchwax query-nutchwax + urlfilter-nutchwax parse-pdf and remove: @@ -136,19 +137,37 @@ The only *required* changes are the additions of the NutchWAX index and query plugins. The rest are optional, but recommended. -The addition of the "parse-pdf" plugin is simply because we have lots -of PDFs in our archives and we want to index them. We sometimes -remove the "parse-js" plugin if we don't care to index JavaScript -files. +The "parse-pdf" plugin is added simply because we have lots of PDFs in +our archives and we want to index them. We sometimes remove the +"parse-js" plugin if we don't care to index JavaScript files. -We also remove the URL filtering and normalizing plugins because we do -not need the URLs normalized nor filtered. We trust that the tool -that produced the ARC/WARC file will have normalized the URLs -contained therein according to its own rules so there's no need to -normalize here. Also, we don't filter by URL since we want to index -as much of the ARC/WARC file as we have parsers for. +We also remove the default Nutch URL filtering and normalizing plugins +because we do not need the URLs normalized nor filtered. We trust +that the tool that produced the ARC/WARC file will have normalized the +URLs contained therein according to its own rules so there's no need +to normalize here. Also, we don't filter by URL since we want to +index as much of the ARC/WARC file as we have parsers for. +We do, however, add the NutchWAX URL filter. If de-duplication is +being performed upon import, this plugin is required. It performs URL +filtering of the list of ARC records to exclude based on +URL+digest+date. + -------------------------------------------------- +indexingfilter.order +-------------------------------------------------- + +Add this property with a value of + + org.apache.nutch.indexer.basic.BasicIndexingFilter + org.archive.nutchwax.index.ConfigurableIndexingFilter + +So that the NutchWAX indexing filter is run after the Nutch basic +indexing filter. + +A full explanation is given in "README-dedup.txt". + +-------------------------------------------------- mime.type.magic -------------------------------------------------- We disable mimetype detection in Nutch for two reasons: @@ -172,12 +191,12 @@ nutchwax.filter.index -------------------------------------------------- Configure the 'index-nutchwax' plugin. Specify how the metadata -fields added by the ArcsToSegment are mapped to the Lucene documents -during indexing. +fields added by the Importer are mapped to the Lucene documents during +indexing. The specifications here are of the form: - src-key:lowercase:store:tokenize:dest-key + src-key:lowercase:store:tokenize:exclusive:dest-key where the only required part is the "src-key", the rest will assume the following defaults: @@ -185,6 +204,7 @@ lowercase = true store = true tokenize = false + exclusive = true dest-key = src-key We recommend: @@ -192,6 +212,9 @@ <property> <name>nutchwax.filter.index</name> <value> + url:false:true:true + orig:false + digest:false arcname:false collection date @@ -199,39 +222,50 @@ </value> </property> +The "url", "orig" and "digest" values are required, the rest are +optional, but strongly recommended. + -------------------------------------------------- nutchwax.filter.query -------------------------------------------------- Configure the 'query-nutchwax' plugin. Specify which fields to make -searchable via "[field]:[term|phrase]" query syntax, and whether they +searchable via "field:[term|phrase]" query syntax, and whether they are "raw" fields or not. -The specification format is +The specification format is one of: - raw:name:lowercase:boost -or - field:name:boost + field:<name>:<boost> + raw:<name>:<lowercase>:<boost> + group:<name>:<lowercase>:<delimiter>:<boost> Default values are lowercase = true + delimiter = "," boost = 1.0f There is no "lowercase" property for "field" specification because the Nutch FieldQueryFilter doesn't expose the option, unlike the RawFieldQueryFilter. -NTOE: We do *not* use this filter for handling "date" queries, there is a -specific filter for that: DateQueryFilter +The "group" fields are raw fields that can accept multiple values, +separated by a delimiter. Multiple values appearing in a query are +automagically translated into required OR-groups, such as + collection:"193,221,36" => +(collection:193 collection:221 collection:36) + +NOTE: We do *not* use this filter for handling "date" queries, there +is a specific filter for that: DateQueryFilter + We recommend: <property> <name>nutchwax.filter.query</name> <value> + raw:digest:false raw:arcname:false - raw:collection - raw:type + group:collection + group:type field:anchor field:content field:host @@ -240,6 +274,52 @@ </property> +-------------------------------------------------- +nutchwax.urlfilter.wayback.exclusions +-------------------------------------------------- +File containing the exclusion list for importing. + +Normally, this is specified on the command line with the NutchWAX +Importer is invoked. It can be specified here if preferred. + +-------------------------------------------------- +nutchwax.urlfilter.wayback.canonicalizer +-------------------------------------------------- + +For CDX-based de-duplication, the same URL canonicalization algorithm +must be used here as was used to generate the CDX files. + +The default canonicalizer in Wayback's '(w)arc-indexer' utility +is + + org.archive.wayback.util.url.AggressiveUrlCanonicalizer + +which is the value provided in "nutch-site.xml". + +If the '(w)arc-indexer' is executed with the "-i" (identity) +command-line option, then the matching canonicalizer + + org.archive.wayback.util.url.IdentityUrlCanonicalizer + +must be specified here. + +-------------------------------------------------- +nutchwax.import.content.limit +-------------------------------------------------- +Similar to Nutch's + + file.content.limit + http.content.limit + ftp.content.limit + +properties, this specifies a limit on the size of a document imported +via NutchWAX. + +We recommend setting this to a size compatible with the memory +capacity of the computers performing the import. Something in the +1-4MB range is typical. + + ====================================================================== Create a manifest ====================================================================== This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |