From: <bi...@us...> - 2010-03-18 22:11:07
|
Revision: 2977 http://archive-access.svn.sourceforge.net/archive-access/?rev=2977&view=rev Author: binzino Date: 2010-03-18 22:10:35 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated for NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2010-03-18 21:55:45 UTC (rev 2976) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2010-03-18 22:10:35 UTC (rev 2977) @@ -1,17 +1,18 @@ HOWTO.txt -2008-07-28 +2010-02-13 Aaron Binns Table of Contents o Prerequisites - NutchWAX installation - ARC/WARC files - o Create a manifest - o Import, Invert and Index - o Search - o Web deployment - - Don't forget to config & patch again + o Build index + - Stand-alone + - Hadoop + o Search index + - Single server + - Master/slave servers ====================================================================== Prerequisites @@ -26,7 +27,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.4 + /opt/nutchwax-0.13 2. ARC/WARC files. @@ -60,32 +61,28 @@ ====================================================================== -Import, Invert and Index +Build Index ====================================================================== -The steps to import the files, invert the link and index the documents -are rather simple: +Building the index consists of two required steps with one recommended +optional step. - $ mkdir crawl - $ cd crawl - $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/* - $ ls -F1 - crawldb/ - indexes/ - linkdb/ - segments/ + 1. Import + 2. Index + 3. Pagerank (optional) -To those already familiar with Nutch, these steps should be quite -familiar. +Performing these steps using the 'nutchwax' command-line driver +are rather straightforward: -The first step, we call NutchWAX's "import" command which creates the -Nutch segment containing the documents in the ARC/WARC files listed in -the manifest. The rest is the same as regular Nutch. + $ /opt/nutchwax-0.13/bin/nutchwax import manifest.txt + $ /opt/nutchwax-0.13/bin/nutchwax index indexes segments/* + $ /opt/nutchwax-0.13/bin/nutchwax merge index indexes + $ /opt/nutchwax-0.13/bin/nutchwax pagerankdb pagerankdb segments/* + $ /opt/nutchwax-0.13/bin/nutchwax pageranker ranks.txt pagerankdb + $ /opt/nutchwax-0.13/bin/nutchwax reboost ranks.txt index + ====================================================================== Search ====================================================================== @@ -96,9 +93,9 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.13/bin/nutchwax search computer -This calls the NutchBean to execute a simple keyword search for +This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the documents you imported. @@ -109,7 +106,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.4/nutch-1.0-dev.war + /opt/nutchwax-0.13/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |