From: <bi...@us...> - 2008-07-28 19:40:56
|
Revision: 2507 http://archive-access.svn.sourceforge.net/archive-access/?rev=2507&view=rev Author: binzino Date: 2008-07-28 19:41:05 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Updated for 0.12.1 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:34:33 UTC (rev 2506) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:41:05 UTC (rev 2507) @@ -213,12 +213,15 @@ <name>nutchwax.filter.index</name> <value> url:false:true:true + url:flase:true:false:true:exacturl orig:false digest:false - arcname:false + filename:false + fileoffset:false collection date type + length </value> </property> @@ -263,7 +266,9 @@ <name>nutchwax.filter.query</name> <value> raw:digest:false - raw:arcname:false + raw:filename:false + raw:fileoffset:false + raw:exacturl:false group:collection group:type field:anchor @@ -304,6 +309,62 @@ must be specified here. -------------------------------------------------- +nutchwax.filter.http.status +-------------------------------------------------- +This property configures a filter with a list of ranges +of HTTP status codes to allow. + +Typically, most NutchWAX implementors do not wish to import and index +404, 500, 302 and other non-success pages. This is an inclusion +filter, meaning that only ARC records with an HTTP status code +matching any of the values will be imported. + +There is a special "unknown" value which can be used to include ARC +records that don't have an HTTP status code (for whatever reason). + +The default setting provided in nutch-site.xml is to allow any 2XX +success code: + + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + </value> + </property> + +But some other examples are: + + Allow any 2XX success code *and* redirects, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + 300-399 + </value> + </property> + + Be really strict about only certain codes, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200 + 301 + 302 + 304 + </value> + </property> + + Mix of ranges and specific codes, including the "unknown" + <property> + <name>nutchwax.filter.http.status</name> + <value> + Unknown + 200 + 300-399 + </value> + </property> + +-------------------------------------------------- nutchwax.import.content.limit -------------------------------------------------- Similar to Nutch's This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |