From: <bi...@us...> - 2009-10-27 23:00:58
|
Revision: 2843 http://archive-access.svn.sourceforge.net/archive-access/?rev=2843&view=rev Author: binzino Date: 2009-10-27 23:00:46 +0000 (Tue, 27 Oct 2009) Log Message: ----------- Ported changes/fixes from NW 0.12.9. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-10-27 22:52:46 UTC (rev 2842) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/conf/nutch-site.xml 2009-10-27 23:00:46 UTC (rev 2843) @@ -10,19 +10,18 @@ <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> <!-- Also, add 'parse-pdf' --> <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> - <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> + <value>protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-nutchwax|urlfilter-nutchwax</value> </property> -<!-- The indexing filter order *must* be specified in order for - NutchWAX's ConfigurableIndexingFilter to be called *after* the - BasicIndexingFilter. This is necessary so that the - ConfigurableIndexingFilter can over-write some of the values put - into the Lucene document by the BasicIndexingFilter. - - The over-written values are the 'url' and 'digest' fields, which - NutchWAX needs to handle specially in order for de-duplication to - work properly. - --> +<!-- + When using *only* the 'index-nutchwax' in 'plugin.includes' above, + we don't need to specify an order since there is only one plugin. + + However, if you choose to use the Nutch 'index-basic', then you have + to specify the order such that the NutchWAX ConfigurableIndexingFilter + is after it. Whichever plugin comes last over-writes the values + of those that come before it. + <property> <name>indexingfilter.order</name> <value> @@ -30,29 +29,31 @@ org.archive.nutchwax.index.ConfigurableIndexingFilter </value> </property> + --> <property> <!-- Configure the 'index-nutchwax' plugin. Specify how the metadata fields added by the Importer are mapped to the Lucene documents during indexing. - The specifications here are of the form "src-key:lowercase:store:tokenize:dest-key" + The specifications here are of the form "src-key:lowercase:store:index:dest-key" Where the only required part is the "src-key", the rest will assume the following defaults: lowercase = true store = true - tokenize = false + index = tokenized exclusive = true dest-key = src-key --> <name>nutchwax.filter.index</name> <value> - url:false:true:true - url:false:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length + title:false:true:tokenized + content:false:false:tokenized + site:false:false:untokenized + + url:false:true:tokenized + digest:false:true:no + + collection:true:true:no_norms + date:true:true:no_norms + type:true:true:no_norms + length:false:true:no </value> </property> @@ -70,15 +71,10 @@ <!-- We do *not* use this filter for handling "date" queries, there is a specific filter for that: DateQueryFilter --> <name>nutchwax.filter.query</name> <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false group:collection + group:site:false group:type - field:anchor field:content - field:host field:title </value> </property> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |