From: <bi...@us...> - 2008-06-29 00:20:07
|
Revision: 2343 http://archive-access.svn.sourceforge.net/archive-access/?rev=2343&view=rev Author: binzino Date: 2008-06-28 17:20:16 -0700 (Sat, 28 Jun 2008) Log Message: ----------- Changed "archive-digest" to "digest" to match changes in NutchWax code. Added "exclusive" property to ConfigurableIndexingFilter config. Added explicit ordering of index filters so that ours is called last so it can over-write metadata values: url, orig, digest. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-06-29 00:17:48 UTC (rev 2342) +++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-06-29 00:20:16 UTC (rev 2343) @@ -10,21 +10,32 @@ <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> <!-- Also, add 'parse-pdf' --> <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> - <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax</value> + <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax</value> </property> <property> + <name>indexingfilter.order</name> + <value> + org.apache.nutch.indexer.basic.BasicIndexingFilter + org.archive.nutchwax.index.ConfigurableIndexingFilter + </value> +</property> + +<property> <!-- Configure the 'index-nutchwax' plugin. Specify how the metadata fields added by the ArcsToSegment are mapped to the Lucene documents during indexing. The specifications here are of the form "src-key:lowercase:store:tokenize:dest-key" Where the only required part is the "src-key", the rest will assume the following defaults: lowercase = true store = true tokenize = false + exclusive = true dest-key = src-key --> <name>nutchwax.filter.index</name> <value> - archive-digest:false + url:false:true:true + orig:false + digest:false arcname:false collection date @@ -46,7 +57,7 @@ <!-- We do *not* use this filter for handling "date" queries, there is a specific filter for that: DateQueryFilter --> <name>nutchwax.filter.query</name> <value> - raw:archive-digest:false + raw:digest:false raw:arcname:false group:collection group:type This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |