[Archive-access-cvs] SF.net SVN: archive-access:[2977] trunk/archive-access/projects/nutchwax/ arch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2977
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2977&view=rev
Author:   binzino
Date:     2010-03-18 22:10:35 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated for NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2010-03-18 21:55:45 UTC (rev 2976)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2010-03-18 22:10:35 UTC (rev 2977)
@@ -1,17 +1,18 @@
 
 HOWTO.txt
-2008-07-28
+2010-02-13
 Aaron Binns
 
 Table of Contents
  o Prerequisites
    - NutchWAX installation
    - ARC/WARC files
- o Create a manifest
- o Import, Invert and Index
- o Search
- o Web deployment
-   - Don't forget to config & patch again
+ o Build index
+   - Stand-alone
+   - Hadoop
+ o Search index
+   - Single server
+   - Master/slave servers
 
 ======================================================================
 Prerequisites
@@ -26,7 +27,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutchwax-0.12.4
+      /opt/nutchwax-0.13
 
  2. ARC/WARC files.
 
@@ -60,32 +61,28 @@
 
 
 ======================================================================
-Import, Invert and Index
+Build Index
 ======================================================================
 
-The steps to import the files, invert the link and index the documents
-are rather simple:
+Building the index consists of two required steps with one recommended
+optional step.
 
-  $ mkdir crawl
-  $ cd crawl
-  $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest
-  $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/*
-  $ ls -F1
-  crawldb/
-  indexes/
-  linkdb/
-  segments/
+  1. Import
+  2. Index
+  3. Pagerank  (optional)
 
-To those already familiar with Nutch, these steps should be quite
-familiar.
+Performing these steps using the 'nutchwax' command-line driver
+are rather straightforward:
 
-The first step, we call NutchWAX's "import" command which creates the
-Nutch segment containing the documents in the ARC/WARC files listed in
-the manifest.  The rest is the same as regular Nutch.
+  $ /opt/nutchwax-0.13/bin/nutchwax import     manifest.txt
+  $ /opt/nutchwax-0.13/bin/nutchwax index      indexes segments/*
+  $ /opt/nutchwax-0.13/bin/nutchwax merge      index   indexes
 
+  $ /opt/nutchwax-0.13/bin/nutchwax pagerankdb pagerankdb segments/*
+  $ /opt/nutchwax-0.13/bin/nutchwax pageranker ranks.txt  pagerankdb
+  $ /opt/nutchwax-0.13/bin/nutchwax reboost    ranks.txt  index
 
+
 ======================================================================
 Search
 ======================================================================
@@ -96,9 +93,9 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.13/bin/nutchwax search computer
 
-This calls the NutchBean to execute a simple keyword search for
+This calls the NutchWaxBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
 documents you imported.
 
@@ -109,7 +106,7 @@
 
 The Nutch(WAX) web application is bundled with NutchWAX as
 
-  /opt/nutchwax-0.12.4/nutch-1.0-dev.war
+  /opt/nutchwax-0.13/nutch-1.0-dev.war
 
 Simply deploy that web application in the same fashion as with
 Nutch.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2977] trunk/archive-access/projects/nutchwax/ arch

[Archive-access-cvs] SF.net SVN: archive-access:[2977] trunk/archive-access/projects/nutchwax/ archive/HOWTO.txt