From: <bi...@us...> - 2008-07-28 19:40:56
|
Revision: 2507 http://archive-access.svn.sourceforge.net/archive-access/?rev=2507&view=rev Author: binzino Date: 2008-07-28 19:41:05 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Updated for 0.12.1 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:34:33 UTC (rev 2506) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:41:05 UTC (rev 2507) @@ -213,12 +213,15 @@ <name>nutchwax.filter.index</name> <value> url:false:true:true + url:flase:true:false:true:exacturl orig:false digest:false - arcname:false + filename:false + fileoffset:false collection date type + length </value> </property> @@ -263,7 +266,9 @@ <name>nutchwax.filter.query</name> <value> raw:digest:false - raw:arcname:false + raw:filename:false + raw:fileoffset:false + raw:exacturl:false group:collection group:type field:anchor @@ -304,6 +309,62 @@ must be specified here. -------------------------------------------------- +nutchwax.filter.http.status +-------------------------------------------------- +This property configures a filter with a list of ranges +of HTTP status codes to allow. + +Typically, most NutchWAX implementors do not wish to import and index +404, 500, 302 and other non-success pages. This is an inclusion +filter, meaning that only ARC records with an HTTP status code +matching any of the values will be imported. + +There is a special "unknown" value which can be used to include ARC +records that don't have an HTTP status code (for whatever reason). + +The default setting provided in nutch-site.xml is to allow any 2XX +success code: + + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + </value> + </property> + +But some other examples are: + + Allow any 2XX success code *and* redirects, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + 300-399 + </value> + </property> + + Be really strict about only certain codes, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200 + 301 + 302 + 304 + </value> + </property> + + Mix of ranges and specific codes, including the "unknown" + <property> + <name>nutchwax.filter.http.status</name> + <value> + Unknown + 200 + 300-399 + </value> + </property> + +-------------------------------------------------- nutchwax.import.content.limit -------------------------------------------------- Similar to Nutch's This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 22:11:07
|
Revision: 2977 http://archive-access.svn.sourceforge.net/archive-access/?rev=2977&view=rev Author: binzino Date: 2010-03-18 22:10:35 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2010-03-18 21:55:45 UTC (rev 2976) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2010-03-18 22:10:35 UTC (rev 2977) @@ -1,17 +1,18 @@ HOWTO.txt -2008-07-28 +2010-02-13 Aaron Binns Table of Contents o Prerequisites - NutchWAX installation - ARC/WARC files - o Create a manifest - o Import, Invert and Index - o Search - o Web deployment - - Don't forget to config & patch again + o Build index + - Stand-alone + - Hadoop + o Search index + - Single server + - Master/slave servers ====================================================================== Prerequisites @@ -26,7 +27,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.4 + /opt/nutchwax-0.13 2. ARC/WARC files. @@ -60,32 +61,28 @@ ====================================================================== -Import, Invert and Index +Build Index ====================================================================== -The steps to import the files, invert the link and index the documents -are rather simple: +Building the index consists of two required steps with one recommended +optional step. - $ mkdir crawl - $ cd crawl - $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/* - $ ls -F1 - crawldb/ - indexes/ - linkdb/ - segments/ + 1. Import + 2. Index + 3. Pagerank (optional) -To those already familiar with Nutch, these steps should be quite -familiar. +Performing these steps using the 'nutchwax' command-line driver +are rather straightforward: -The first step, we call NutchWAX's "import" command which creates the -Nutch segment containing the documents in the ARC/WARC files listed in -the manifest. The rest is the same as regular Nutch. + $ /opt/nutchwax-0.13/bin/nutchwax import manifest.txt + $ /opt/nutchwax-0.13/bin/nutchwax index indexes segments/* + $ /opt/nutchwax-0.13/bin/nutchwax merge index indexes + $ /opt/nutchwax-0.13/bin/nutchwax pagerankdb pagerankdb segments/* + $ /opt/nutchwax-0.13/bin/nutchwax pageranker ranks.txt pagerankdb + $ /opt/nutchwax-0.13/bin/nutchwax reboost ranks.txt index + ====================================================================== Search ====================================================================== @@ -96,9 +93,9 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.13/bin/nutchwax search computer -This calls the NutchBean to execute a simple keyword search for +This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the documents you imported. @@ -109,7 +106,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.4/nutch-1.0-dev.war + /opt/nutchwax-0.13/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |