[Archive-access-cvs] SF.net SVN: archive-access:[2678] trunk/archive-access/projects/nutchwax/ arch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2678
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2678&view=rev
Author:   binzino
Date:     2008-12-18 18:37:40 +0000 (Thu, 18 Dec 2008)

Log Message:
-----------
Updated documenation for 0.12.3 release.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
    trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
    trunk/archive-access/projects/nutchwax/archive/README.txt
    trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt

Added: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	                        (rev 0)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -0,0 +1,392 @@
+
+BUILD-NOTES.txt
+2008-12-18
+Aaron Binns
+
+======================================================================
+Build notes
+======================================================================
+
+This document contains supplemental notes regarding the NutchWAX
+build, expanding upon the information found in the various READMEs and
+HOWTOs.
+
+======================================================================
+
+This 0.12.x release of NutchWAX is radically different in source-code
+form compared to the previous release, 0.10.
+
+One of the design goals of 0.12.x was to reduce or even eliminate the
+"copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
+releases had to copy/paste/edit large chunks of Nutch source code in
+order to add the NutchWAX features.
+
+Also, the NutchWAX 0.12.x sources and build are designed to one day be
+added into mainline Nutch as a proper "contrib" package; then
+eventually be fully integrated into the core Nutch source code.
+
+======================================================================
+
+Most of the NutchWAX source code is relatively straightfoward to those
+already familiar with the inner workings of Nutch.  Still, special
+attention on one class is worth while:
+
+  src/java/org/archive/nutchwax/Importer.java
+
+This is where ARC/WARC files are read and their documents are imported
+into a Nutch segment.
+
+It is inspired by:
+
+  nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+
+on the Nutch SVN head.
+
+Our implementation differs in a few important ways:
+
+  o Rather than taking a directory with ARC files as input, we take a
+    manifest file with URLs to ARC files.  This way, the manifest is
+    split up among the distributed Hadoop jobs and the ARC files are
+    processed in whole by each worker.
+
+    In the Nutch SVN, the ArcSegmentCreator.java expects the input
+    directory to contain the ARC files and (AFAICT) splits them up and
+    distributes them across the Hadoop workers.
+
+  o We use the standard Internet Archive ARCReader and WARCReader
+    classes.  Thus, NutchWAX can read both ARC and WARC files, whereas
+    the ArcSegmentCreator class can only read ARC files.
+
+  o We add metadata fields to the document, which are then available
+    to the "index-nutchwax" plugin at indexing-time.
+
+    Importer.importRecord()
+      ...
+      contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
+      contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
+      contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
+      contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
+      ...
+
+
+======================================================================
+Patching
+======================================================================
+
+When NutchWAX is built, a number of patches are automatically applied
+to the Nutch source and configuration files.
+
+----------------------------------------------------------------------
+The file
+
+  /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+
+contains two errors: one where a mimetype is referenced before it is
+defined; and a second where a definition has an illegal character.
+
+These errors cause Nutch to not recognize certain mimetypes and
+therefore will ignore documents matching those mimetypes.
+
+There are two fixes:
+
+ 1. Move
+
+	<mime-type type="application/xml">
+		<alias type="text/xml" />
+		<glob pattern="*.xml" />
+	</mime-type>
+
+    definition higher up in the file, before the reference to it.
+
+ 2. Remove
+
+	<mime-type type="application/x-ms-dos-executable">
+		<alias type="application/x-dosexec;exe" />
+	</mime-type>
+
+    as the ';' character is illegal according to the comments in the
+    Nutch code.
+
+You can either apply these patches yourself, or copy an already-patched
+copy from:
+
+  /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml
+
+to 
+
+  /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml
+
+----------------------------------------------------------------------
+
+In the file 'conf/nutch-site.xml' we define some properties to
+over-ride the values in 'conf/nutch-default.xml'.
+
+--------------------------------------------------
+plugin.includes
+--------------------------------------------------
+Change the list of plugins from:
+
+  protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
+
+to
+
+  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+
+In short, we add:
+
+  index-nutchwax
+  query-nutchwax
+  urlfilter-nutchwax
+  parse-pdf
+
+and remove:
+
+  urlfilter-regex
+  urlnormalizer-(pass|regex|basic)
+
+The only *required* changes are the additions of the NutchWAX index
+and query plugins.  The rest are optional, but recommended.
+
+The "parse-pdf" plugin is added simply because we have lots of PDFs in
+our archives and we want to index them.  We sometimes remove the
+"parse-js" plugin if we don't care to index JavaScript files.
+
+We also remove the default Nutch URL filtering and normalizing plugins
+because we do not need the URLs normalized nor filtered.  We trust
+that the tool that produced the ARC/WARC file will have normalized the
+URLs contained therein according to its own rules so there's no need
+to normalize here.  Also, we don't filter by URL since we want to
+index as much of the ARC/WARC file as we have parsers for.
+
+We do, however, add the NutchWAX URL filter.  If de-duplication is
+being performed upon import, this plugin is required.  It performs URL
+filtering of the list of ARC records to exclude based on
+URL+digest+date.
+
+--------------------------------------------------
+indexingfilter.order
+--------------------------------------------------
+
+Add this property with a value of
+
+    org.apache.nutch.indexer.basic.BasicIndexingFilter
+    org.archive.nutchwax.index.ConfigurableIndexingFilter
+
+So that the NutchWAX indexing filter is run after the Nutch basic
+indexing filter.
+
+A full explanation is given in "README-dedup.txt".
+
+--------------------------------------------------
+mime.type.magic
+--------------------------------------------------
+We disable mimetype detection in Nutch for two reasons:
+
+1. The ARC/WARC file specifies the Content-Type of the document.  We
+   trust that the tool that created the ARC/WARC file got it right.
+
+2. The implementation in Nutch can use a lot of memory as the *entire*
+   document is read into memory as a byte[], then converted to a
+   String, then checked against the MIME database.  This can lead to
+   out of memory errors for large files, such as music and video.
+
+To disable, simply set the property value to false.
+
+  <property>
+    <name>mime.type.magic</name>
+    <value>false</value>
+  </property>
+
+--------------------------------------------------
+nutchwax.filter.index
+--------------------------------------------------
+Configure the 'index-nutchwax' plugin.  Specify how the metadata
+fields added by the Importer are mapped to the Lucene documents during
+indexing.
+
+The specifications here are of the form:
+
+  src-key:lowercase:store:tokenize:exclusive:dest-key
+
+where the only required part is the "src-key", the rest will assume
+the following defaults:
+
+  lowercase = true
+  store     = true
+  tokenize  = false
+  exclusive = true
+  dest-key  = src-key
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.index</name>
+  <value>
+    url:false:true:true
+    url:flase:true:false:true:exacturl
+    orig:false
+    digest:false
+    filename:false
+    fileoffset:false
+    collection
+    date
+    type
+    length
+  </value>
+</property>
+
+The "url", "orig" and "digest" values are required, the rest are
+optional, but strongly recommended.
+
+--------------------------------------------------
+nutchwax.filter.query
+--------------------------------------------------
+Configure the 'query-nutchwax' plugin.  Specify which fields to make
+searchable via "field:[term|phrase]" query syntax, and whether they
+are "raw" fields or not.
+
+The specification format is one of:
+
+  field:<name>:<boost>
+  raw:<name>:<lowercase>:<boost>
+  group:<name>:<lowercase>:<delimiter>:<boost>
+
+Default values are
+
+  lowercase = true
+  delimiter = ","
+  boost     = 1.0f
+
+There is no "lowercase" property for "field" specification because the
+Nutch FieldQueryFilter doesn't expose the option, unlike the
+RawFieldQueryFilter.
+
+The "group" fields are raw fields that can accept multiple values,
+separated by a delimiter.  Multiple values appearing in a query are
+automagically translated into required OR-groups, such as
+
+  collection:"193,221,36" => +(collection:193 collection:221 collection:36)
+
+NOTE: We do *not* use this filter for handling "date" queries, there
+is a specific filter for that: DateQueryFilter
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.query</name>
+  <value>
+    raw:digest:false
+    raw:filename:false
+    raw:fileoffset:false
+    raw:exacturl:false
+    group:collection
+    group:type
+    field:anchor
+    field:content
+    field:host
+    field:title
+  </value>
+</property>
+
+
+--------------------------------------------------
+nutchwax.urlfilter.wayback.exclusions
+--------------------------------------------------
+File containing the exclusion list for importing.
+
+Normally, this is specified on the command line with the NutchWAX
+Importer is invoked.  It can be specified here if preferred.
+
+--------------------------------------------------
+nutchwax.urlfilter.wayback.canonicalizer
+--------------------------------------------------
+
+For CDX-based de-duplication, the same URL canonicalization algorithm
+must be used here as was used to generate the CDX files.
+
+The default canonicalizer in Wayback's '(w)arc-indexer' utility
+is 
+
+  org.archive.wayback.util.url.AggressiveUrlCanonicalizer
+
+which is the value provided in "nutch-site.xml".
+
+If the '(w)arc-indexer' is executed with the "-i" (identity)
+command-line option, then the matching canonicalizer
+
+  org.archive.wayback.util.url.IdentityUrlCanonicalizer
+
+must be specified here.
+
+--------------------------------------------------
+nutchwax.filter.http.status
+--------------------------------------------------
+This property configures a filter with a list of ranges
+of HTTP status codes to allow.
+
+Typically, most NutchWAX implementors do not wish to import and index
+404, 500, 302 and other non-success pages.  This is an inclusion
+filter, meaning that only ARC records with an HTTP status code
+matching any of the values will be imported.
+
+There is a special "unknown" value which can be used to include ARC
+records that don't have an HTTP status code (for whatever reason).
+
+The default setting provided in nutch-site.xml is to allow any 2XX
+success code:
+
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      200-299
+    </value>
+  </property>
+
+But some other examples are:
+
+  Allow any 2XX success code *and* redirects, use:
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      200-299
+      300-399
+   </value>
+  </property>
+
+  Be really strict about only certain codes, use:
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      200
+      301
+      302
+      304
+   </value>
+  </property>
+
+  Mix of ranges and specific codes, including the "unknown"
+  <property>
+    <name>nutchwax.filter.http.status</name>
+    <value>
+      Unknown
+      200
+      300-399
+   </value>
+  </property>
+
+--------------------------------------------------
+nutchwax.import.content.limit
+--------------------------------------------------
+Similar to Nutch's
+
+  file.content.limit
+  http.content.limit
+  ftp.content.limit
+
+properties, this specifies a limit on the size of a document imported
+via NutchWAX.
+
+We recommend setting this to a size compatible with the memory
+capacity of the computers performing the import.  Something in the
+1-4MB range is typical.
+

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -31,7 +31,7 @@
 in the full-text search index.
 
 Nutch's 'invertlinks' step inverts links and stores them in the
-'linkdb' directory.  We use the inlinks to boost the Lucene score of
+'linkdb' directory.  We use these inlinks to boost the Lucene score of
 documents in proportion to the number of inlinks.
 
 

Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -5,9 +5,8 @@
 
 Table of Contents
  o Prerequisites
-   - Nutch(WAX) installation
+   - NutchWAX installation
    - ARC/WARC files
- o Configuration & Patching
  o Create a manifest
  o Import, Invert and Index
  o Search
@@ -27,7 +26,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutch-1.0-dev
+      /opt/nutchwax-0.12.3
 
  2. ARC/WARC files.
 
@@ -40,348 +39,6 @@
 
 
 ======================================================================
-Patching
-======================================================================
-
-The vanilla NutchWAX as built according to the INSTALL.txt guide is
-not quite ready to be used out-of-the-box.
-
-Before you can use NutchWAX, you must first patch a bug that exists in
-the current Nutch SVN head.
-
-The file
-
-  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
-
-contains two errors: one where a mimetype is referenced before it is
-defined; and a second where a definition has an illegal character.
-
-These errors cause Nutch to not recognize certain mimetypes and
-therefore will ignore documents matching those mimetypes.
-
-There are two fixes:
-
- 1. Move
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-    definition higher up in the file, before the reference to it.
-
- 2. Remove
-
-	<mime-type type="application/x-ms-dos-executable">
-		<alias type="application/x-dosexec;exe" />
-	</mime-type>
-
-    as the ';' character is illegal according to the comments in the
-    Nutch code.
-
-You can either apply these patches yourself, or copy an already-patched
-copy from:
-
-  /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml
-
-to 
-
-  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
-
-
-======================================================================
-Configuring
-======================================================================
-
-Since we assume that you are already familiar with Nutch, then you
-should already be familiar with configuring it.  The configuration
-is mainly defined in
-
-  /opt/nutch-1.0-dev/conf/nutch-default.xml
-
-NutchWAX requires the modification of two existing properties and the
-addition of two new ones.
-
-All of the modifications described below can be found in:
-
-  /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml
-
-You can either apply the configuration changes yourself, or copy that
-file to
-
-  /opt/nutch-1.0-dev/conf/nutch-site.xml
-
---------------------------------------------------
-plugin.includes
---------------------------------------------------
-Change the list of plugins from:
-
-  protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
-
-to
-
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
-
-In short, we add:
-
-  index-nutchwax
-  query-nutchwax
-  urlfilter-nutchwax
-  parse-pdf
-
-and remove:
-
-  urlfilter-regex
-  urlnormalizer-(pass|regex|basic)
-
-The only *required* changes are the additions of the NutchWAX index
-and query plugins.  The rest are optional, but recommended.
-
-The "parse-pdf" plugin is added simply because we have lots of PDFs in
-our archives and we want to index them.  We sometimes remove the
-"parse-js" plugin if we don't care to index JavaScript files.
-
-We also remove the default Nutch URL filtering and normalizing plugins
-because we do not need the URLs normalized nor filtered.  We trust
-that the tool that produced the ARC/WARC file will have normalized the
-URLs contained therein according to its own rules so there's no need
-to normalize here.  Also, we don't filter by URL since we want to
-index as much of the ARC/WARC file as we have parsers for.
-
-We do, however, add the NutchWAX URL filter.  If de-duplication is
-being performed upon import, this plugin is required.  It performs URL
-filtering of the list of ARC records to exclude based on
-URL+digest+date.
-
---------------------------------------------------
-indexingfilter.order
---------------------------------------------------
-
-Add this property with a value of
-
-    org.apache.nutch.indexer.basic.BasicIndexingFilter
-    org.archive.nutchwax.index.ConfigurableIndexingFilter
-
-So that the NutchWAX indexing filter is run after the Nutch basic
-indexing filter.
-
-A full explanation is given in "README-dedup.txt".
-
---------------------------------------------------
-mime.type.magic
---------------------------------------------------
-We disable mimetype detection in Nutch for two reasons:
-
-1. The ARC/WARC file specifies the Content-Type of the document.  We
-   trust that the tool that created the ARC/WARC file got it right.
-
-2. The implementation in Nutch can use a lot of memory as the *entire*
-   document is read into memory as a byte[], then converted to a
-   String, then checked against the MIME database.  This can lead to
-   out of memory errors for large files, such as music and video.
-
-To disable, simply set the property value to false.
-
-  <property>
-    <name>mime.type.magic</name>
-    <value>false</value>
-  </property>
-
---------------------------------------------------
-nutchwax.filter.index
---------------------------------------------------
-Configure the 'index-nutchwax' plugin.  Specify how the metadata
-fields added by the Importer are mapped to the Lucene documents during
-indexing.
-
-The specifications here are of the form:
-
-  src-key:lowercase:store:tokenize:exclusive:dest-key
-
-where the only required part is the "src-key", the rest will assume
-the following defaults:
-
-  lowercase = true
-  store     = true
-  tokenize  = false
-  exclusive = true
-  dest-key  = src-key
-
-We recommend:
-
-<property>
-  <name>nutchwax.filter.index</name>
-  <value>
-    url:false:true:true
-    url:flase:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
-  </value>
-</property>
-
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
-
---------------------------------------------------
-nutchwax.filter.query
---------------------------------------------------
-Configure the 'query-nutchwax' plugin.  Specify which fields to make
-searchable via "field:[term|phrase]" query syntax, and whether they
-are "raw" fields or not.
-
-The specification format is one of:
-
-  field:<name>:<boost>
-  raw:<name>:<lowercase>:<boost>
-  group:<name>:<lowercase>:<delimiter>:<boost>
-
-Default values are
-
-  lowercase = true
-  delimiter = ","
-  boost     = 1.0f
-
-There is no "lowercase" property for "field" specification because the
-Nutch FieldQueryFilter doesn't expose the option, unlike the
-RawFieldQueryFilter.
-
-The "group" fields are raw fields that can accept multiple values,
-separated by a delimiter.  Multiple values appearing in a query are
-automagically translated into required OR-groups, such as
-
-  collection:"193,221,36" => +(collection:193 collection:221 collection:36)
-
-NOTE: We do *not* use this filter for handling "date" queries, there
-is a specific filter for that: DateQueryFilter
-
-We recommend:
-
-<property>
-  <name>nutchwax.filter.query</name>
-  <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
-    group:collection
-    group:type
-    field:anchor
-    field:content
-    field:host
-    field:title
-  </value>
-</property>
-
-
---------------------------------------------------
-nutchwax.urlfilter.wayback.exclusions
---------------------------------------------------
-File containing the exclusion list for importing.
-
-Normally, this is specified on the command line with the NutchWAX
-Importer is invoked.  It can be specified here if preferred.
-
---------------------------------------------------
-nutchwax.urlfilter.wayback.canonicalizer
---------------------------------------------------
-
-For CDX-based de-duplication, the same URL canonicalization algorithm
-must be used here as was used to generate the CDX files.
-
-The default canonicalizer in Wayback's '(w)arc-indexer' utility
-is 
-
-  org.archive.wayback.util.url.AggressiveUrlCanonicalizer
-
-which is the value provided in "nutch-site.xml".
-
-If the '(w)arc-indexer' is executed with the "-i" (identity)
-command-line option, then the matching canonicalizer
-
-  org.archive.wayback.util.url.IdentityUrlCanonicalizer
-
-must be specified here.
-
---------------------------------------------------
-nutchwax.filter.http.status
---------------------------------------------------
-This property configures a filter with a list of ranges
-of HTTP status codes to allow.
-
-Typically, most NutchWAX implementors do not wish to import and index
-404, 500, 302 and other non-success pages.  This is an inclusion
-filter, meaning that only ARC records with an HTTP status code
-matching any of the values will be imported.
-
-There is a special "unknown" value which can be used to include ARC
-records that don't have an HTTP status code (for whatever reason).
-
-The default setting provided in nutch-site.xml is to allow any 2XX
-success code:
-
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      200-299
-    </value>
-  </property>
-
-But some other examples are:
-
-  Allow any 2XX success code *and* redirects, use:
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      200-299
-      300-399
-   </value>
-  </property>
-
-  Be really strict about only certain codes, use:
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      200
-      301
-      302
-      304
-   </value>
-  </property>
-
-  Mix of ranges and specific codes, including the "unknown"
-  <property>
-    <name>nutchwax.filter.http.status</name>
-    <value>
-      Unknown
-      200
-      300-399
-   </value>
-  </property>
-
---------------------------------------------------
-nutchwax.import.content.limit
---------------------------------------------------
-Similar to Nutch's
-
-  file.content.limit
-  http.content.limit
-  ftp.content.limit
-
-properties, this specifies a limit on the size of a document imported
-via NutchWAX.
-
-We recommend setting this to a size compatible with the memory
-capacity of the computers performing the import.  Something in the
-1-4MB range is typical.
-
-
-======================================================================
 Create a manifest
 ======================================================================
 
@@ -411,10 +68,10 @@
 
   $ mkdir crawl
   $ cd crawl
-  $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest
-  $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
-  $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb  -dir segments
-  $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
+  $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest
+  $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/*
   $ ls -F1
   crawldb/
   indexes/
@@ -439,7 +96,7 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer
 
 This calls the NutchBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
@@ -450,17 +107,9 @@
 Web Deployment
 ======================================================================
 
-As users of Nutch are aware, the web application (nutch-1.0-dev.war)
-bundled with Nutch contains duplicate copies of the configuration
-files.
+The Nutch(WAX) web application is bundled with NutchWAX as
 
-So, all patches and configuration changes that we made to the
-files in
+  /opt/nutchwax-0.12.3/nutch-1.0-dev.war
 
-  /opt/nutch-1.0-dev/conf
-
-will have to be duplicated in the Nutch webapp when it is deployed.
-
-This is not due to NutchWAX, this is a "feature" of regular Nutch.  I
-just thought it would be good to remind everyone since we did make
-configuration changes for NutchWAX.
+Simply deploy that web application in the same fashion as with
+Nutch.

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -3,10 +3,22 @@
 2008-12-18
 Aaron Binns
 
+Table of Contents
+ o Introduction
+ o Build from source
+    - SVN: Nutch 1.0-dev
+    - SVN: NutchWAX
+    - Build and Install
+ o Install binary package
+
+
+======================================================================
+Introduction
+======================================================================
+
 This installation guide assumes the reader is already familiar with
 building, packaging and deploying Nutch 1.0-dev.
 
-
 The NutchWAX 0.12 source and build system are designed to integrate
 into the existing Nutch 1.0-dev source and build.
 
@@ -20,12 +32,12 @@
 proper, then builds NutchWAX components and integrates them into the
 Nutch build directory.
 
-We recommend that you execute all build commands from the NutchWAX
-directory.  This way, NutchWAX will ensure that any and all
+In order to build NutchWAX, execute all build commands from the
+NutchWAX directory.  This way, NutchWAX will ensure that any and all
 dependencies in Nutch will be properly built and kept up-to-date.
 Towards this goal, we have duplicated the most common build targets
-from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file,
-such as:
+from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, such
+as:
 
   o compile
   o jar
@@ -39,8 +51,15 @@
 sub-directory as normal.
 
 
-Nutch-1.0-dev
--------------
+======================================================================
+Build from Source
+======================================================================
+
+To build from source, you must check-out the Nutch and NutchWAX sources
+from their respective 'subversion' source control servers.
+
+SVN: nutch-1.0-dev
+------------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
 Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.3 is
@@ -53,9 +72,12 @@
  $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
  $ cd nutch
 
+Please be sure to check-out this specific version of the Nutch source.
+If you just grab the head of the trunk, there may be newer and
+incompatible changed to Nutch.
 
-NutchWAX
---------
+SVN: NutchWAX
+-------------
 Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into
 Nutch's "contrib" directory.
 
@@ -65,7 +87,6 @@
 This will create a sub-directory named "archive" containing the
 NutchWAX sources.
 
-
 Build and install
 -----------------
 Assuming you already have the required tool-set for building Nutch,
@@ -91,3 +112,18 @@
 
   $ cd /opt
   $ tar xvfz nutch-1.0-dev.tar.gz
+  $ mv nutch-1.0-dev nutchwax-0.12.3
+
+
+======================================================================
+Install binary package
+======================================================================
+
+Alternatively, grab a "binary" release package from the Internet
+Archive's NutchWAX home page.
+
+Install it simply by untarring it, for example:
+
+  $ cd /opt
+  $ tar xvfz nutchwax-0.12.3.tar.gz
+

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -3,6 +3,16 @@
 2008-12-18
 Aaron Binns
 
+Table of Contents
+ o Introduction
+ o Build and Install
+ o Tutorial
+
+
+======================================================================
+Introduction
+======================================================================
+
 Welcome to NutchWAX 0.12.3!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
@@ -17,7 +27,6 @@
 Since NutchWAX is a set of add-ons to Nutch, you should already be
 familiar with Nutch before using NutchWAX.
 
-======================================================================
 
 The goal of NutchWAX is to enable full-text indexing and searching of
 documents stored in web archive file formats (ARC and WARC).
@@ -26,13 +35,13 @@
 to Nutch to read documents directly from ARC/WARC files.  We call this
 process "importing" archive files.
 
-Importing produces a Nutch segment, similar to Nutch crawling the
-documents itself.  In this scenario, document importing replaces the
+Importing produces a Nutch segment, the same as when Nutch is used to
+crawl documents itself.  In essence, document importing replaces the
 conventional "generate/fetch/update" cycle of Nutch.
 
 Once the archival documents have been imported into a segment, the
-regular Nutch commands to update the 'crawldb', invert the links and
-index the document contents can proceed as normal.
+regular Nutch commands to index the document contents can proceed as
+normal.
 
 ======================================================================
 
@@ -71,73 +80,25 @@
 
  conf/nutch-site.xml
 
-   Sample configuration properties file showing suggested settings for
-   Nutch and NutchWAX.
+   Additional configuration properties for NutchWAX, including
+   over-rides for properties defined in 'nutch-default.xml'
 
 There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
 is distributed in source code form and is intended to be built in
 conjunction with Nutch.
 
-See "INSTALL.txt" for details on building NutchWAX and Nutch.
 
-See "HOWTO.txt" for a quick tutorial on importing, indexing and
-searching a set of documents in a web archive file.
-
 ======================================================================
-
-This 0.12.x release of NutchWAX is radically different in source-code
-form compared to the previous release, 0.10.
-
-One of the design goals of 0.12.x was to reduce or even eliminate the
-"copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
-releases had to copy/paste/edit large chunks of Nutch source code in
-order to add the NutchWAX features.
-
-Also, the NutchWAX 0.12.x sources and build are designed to one day be
-added into mainline Nutch as a proper "contrib" package; then
-eventually be fully integrated into the core Nutch source code.
-
+Build and Install
 ======================================================================
 
-Most of the NutchWAX source code is relatively straightfoward to those
-already familiar with the inner workings of Nutch.  Still, special
-attention on one class is worth while:
+See "INSTALL.txt" for detailed instructions to build NutchWAX from
+source or install a binary package.
 
-  src/java/org/archive/nutchwax/Importer.java
 
-This is where ARC/WARC files are read and their documents are imported
-into a Nutch segment.
-
-It is inspired by:
-
-  nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
-
-on the Nutch SVN head.
-
-Our implementation differs in a few important ways:
-
-  o Rather than taking a directory with ARC files as input, we take a
-    manifest file with URLs to ARC files.  This way, the manifest is
-    split up among the distributed Hadoop jobs and the ARC files are
-    processed in whole by each worker.
-
-    In the Nutch SVN, the ArcSegmentCreator.java expects the input
-    directory to contain the ARC files and (AFAICT) splits them up and
-    distributes them across the Hadoop workers.
-
-  o We use the standard Internet Archive ARCReader and WARCReader
-    classes.  Thus, NutchWAX can read both ARC and WARC files, whereas
-    the ArcSegmentCreator class can only read ARC files.
-
-  o We add metadata fields to the document, which are then available
-    to the "index-nutchwax" plugin at indexing-time.
-
-    Importer.importRecord()
-      ...
-      contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
-      contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
-      contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
-      contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
-      ...
-
 ======================================================================
+Tutorial
+======================================================================
+
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.

Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-16 19:53:25 UTC (rev 2677)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt	2008-12-18 18:37:40 UTC (rev 2678)
@@ -21,8 +21,45 @@
   o Enhanced OpenSearchServlet
   o Improved XSLT sample for OpenSearch
   o System init.d script for searcher slaves
-  o Enhanced searcher slave aware of NutchWAX extensions
+  o Enhanced searcher slave which supports NutchWAX extensions
 
+
+One of the major changes to 0.12.3 is not a feature, enhancement or
+bug-fix, but the way the NutchWAX source is "integrated" into the
+Nutch source.
+
+Yes, the NutchWAX source is still kept in the contrib/archive
+sub-directory, but when you invoke a build command from the
+NutchWAX directory, such as
+
+  $ cd nutch/contrib/archive
+  $ ant tar
+
+Many files from the NutchWAX source tree are copied directly into the
+Nutch source tree before the build process begins.
+
+The reason for this is to make NutchWAX easier to use.
+
+In previous versions of NutchWAX, once 'ant' build command was
+finished, the operator had to manually patch configuration files in
+the Nutch directory.  Upon a subsequent build, the files would be
+over-written by Nutch's and would have to be patched again.
+
+It was a major hassle and complication.
+
+Another impetus for copying files into the Nutch source was to patch
+bugs and make enhancements in the Nutch Java code which couldn't be
+effectively done keeping the sources separate.  When an 'ant' build
+command is run a few Java files are copied from the NutchWAX source
+tree into the Nutch source tree.
+
+In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of
+this.  Simply execute your build commands from 'contrib/archive' as
+instructed in the HOWTO and no longer worry about patching
+configuration files.  If you wish to alter the NutchWAX configuration
+file, make those changes in the NutchWAX source tree.
+
+
 ======================================================================
 Issues
 ======================================================================


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2678] trunk/archive-access/projects/nutchwax/ arch

[Archive-access-cvs] SF.net SVN: archive-access:[2678] trunk/archive-access/projects/nutchwax/ archive