From: <bi...@us...> - 2008-07-01 22:41:48
|
Revision: 2346 http://archive-access.svn.sourceforge.net/archive-access/?rev=2346&view=rev Author: binzino Date: 2008-07-01 15:41:57 -0700 (Tue, 01 Jul 2008) Log Message: ----------- Added nutchwax.import.content.limit property. And more comments. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml Modified: trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-06-30 20:38:36 UTC (rev 2345) +++ trunk/archive-access/projects/nutchwax/archive/conf/nutch-site.xml 2008-07-01 22:41:57 UTC (rev 2346) @@ -13,6 +13,16 @@ <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax</value> </property> +<!-- The indexing filter order *must* be specified in order for + NutchWAX's ConfigurableIndexingFilter to be called *after* the + BasicIndexingFilter. This is necessary so that the + ConfigurableIndexingFilter can over-write some of the values put + into the Lucene document by the BasicIndexingFilter. + + The over-written values are the 'url' and 'digest' fields, which + NutchWAX needs to handle specially in order for de-duplication to + work properly. + --> <property> <name>indexingfilter.order</name> <value> @@ -78,16 +88,38 @@ <description>Defines if the mime content type detector uses magic resolution.</description> </property> +<!-- Normally, this is specified on the command line with the NutchWAX + Importer is invoked. It can be specified here if the user + prefers. + --> <property> <name>nutchwax.urlfilter.wayback.exclusions</name> <value></value> <description>Path to file containing list of exclusions.</description> </property> +<!-- For CDX-based de-duplication to work properly, you must use the + same Wayback URLCanonicalizer that is used by the "(w)arc-indexer" + utility. By default, this is AggressiveUrlCanonicalizer, but + could by IdentityCanonicalizer if you use the "-i" (identity) option + with "(w)arc-indexer". + --> <property> <name>nutchwax.urlfilter.wayback.canonicalizer</name> <value>org.archive.wayback.util.url.AggressiveUrlCanonicalizer</value> <description></description> </property> +<!-- Similar to Nutch's + file.content.limit + http.content.limit + ftp.content.limit + properties, this specifies a limit on the size of a document + imported via NutchWAX. + --> +<property> + <name>nutchwax.import.content.limit</name> + <value>1048576</value> +</property> + </configuration> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |