[Archive-access-cvs] SF.net SVN: archive-access: [2266] trunk/archive-access/projects/nat/ archive

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2266
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2266&view=rev
Author:   binzino
Date:     2008-05-20 17:02:01 -0700 (Tue, 20 May 2008)

Log Message:
-----------
Total re-write of install, readme and howto documents.

Modified Paths:
--------------
    trunk/archive-access/projects/nat/archive/INSTALL.txt
    trunk/archive-access/projects/nat/archive/README.txt

Added Paths:
-----------
    trunk/archive-access/projects/nat/archive/HOWTO.txt

Added: trunk/archive-access/projects/nat/archive/HOWTO.txt
===================================================================

--- trunk/archive-access/projects/nat/archive/HOWTO.txt	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/HOWTO.txt	2008-05-21 00:02:01 UTC (rev 2266)
@@ -0,0 +1,325 @@
+
+HOWTO.txt
+2008-05-20
+Aaron Binns
+
+Table of Contents
+ o Prerequisites
+   - Nutch(WAX) installation
+   - ARC/WARC files
+ o Configuration & Patching
+ o Create a manifest
+ o Import, Invert and Index
+ o Search
+ o Web deployment
+   - Don't forget to config & patch again
+
+======================================================================
+Prerequisites
+======================================================================
+
+In order to use Nutch(WAX) you need the following prerequisites:
+
+ 1. NutchWAX installed.
+
+    See INSTALL.txt for instruction on building and installing
+    NutchWAX.
+
+    This HOWTO assumes it is installed in
+
+      /opt/nutch-1.0-dev
+
+ 2. ARC/WARC files.
+
+    The whole purpose of NutchWAX is to index ARC/WARC files.  These
+    files are not produced by Nutch nor NutchWAX, they are produced by
+    other tools, such as Heritrix.
+
+    If you don't have any ARC/WARC files, you have no need for
+    NutchWAX.
+
+
+======================================================================
+Patching
+======================================================================
+
+The vanilla NutchWAX as built according to the INSTALL.txt guide is
+not quite ready to be used out-of-the-box.
+
+Before you can use NutchWAX, you must first patch a bug that exists in
+the current Nutch SVN head.
+
+The file
+
+  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
+
+contains two errors: one where a mimetype is referenced before it is
+defined; and a second where a definition has an illegal character.
+
+These errors cause Nutch to not recognize certain mimetypes and
+therefore will ignore documents matching those mimetypes.
+
+There are two fixes:
+
+ 1. Move
+
+	<mime-type type="application/xml">
+		<alias type="text/xml" />
+		<glob pattern="*.xml" />
+	</mime-type>
+
+    definition higher up in the file, before the reference to it.
+
+ 2. Remove
+
+	<mime-type type="application/x-ms-dos-executable">
+		<alias type="application/x-dosexec;exe" />
+	</mime-type>
+
+    as the ';' character is illegal according to the comments in the
+    Nutch code.
+
+You can either apply these patches yourself, or copy an already-patched
+copy from:
+
+  /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml
+
+to 
+
+  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
+
+
+======================================================================
+Configuring
+======================================================================
+
+Since we assume that you are already familiar with Nutch, then you
+should already be familiar with configuring it.  The configuration
+is mainly defined in
+
+  /opt/nutch-1.0-dev/conf/nutch-default.xml
+
+NutchWAX requires the modification of two existing properties and the
+addition of two new ones.
+
+All of the modifications described below can be found in:
+
+  /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml
+
+You can either apply the configuration changes yourself, or copy that
+file to
+
+  /opt/nutch-1.0-dev/conf/nutch-site.xml
+
+--------------------------------------------------
+plugin.includes
+--------------------------------------------------
+Change the list of plugins from:
+
+  protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
+
+to
+
+  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic
+
+In short, we add:
+
+  index-nutchwax
+  query-nutchwax
+  parse-pdf
+
+and remove:
+
+  urlfilter-regex
+  urlnormalizer-(pass|regex|basic)
+
+The only *required* changes are the additions of the NutchWAX index
+and query plugins.  The rest are optional, but recommended.
+
+The addition of the "parse-pdf" plugin is simply because we have lots
+of PDFs in our archives and we want to index them.  We sometimes
+remove the "parse-js" plugin if we don't care to index JavaScript
+files.
+
+We also remove the URL filtering and normalizing plugins because we do
+not need the URLs normalized nor filtered.  We trust that the tool
+that produced the ARC/WARC file will have normalized the URLs
+contained therein according to its own rules so there's no need to
+normalize here.  Also, we don't filter by URL since we want to index
+as much of the ARC/WARC file as we have parsers for.
+
+--------------------------------------------------
+mime.type.magic
+--------------------------------------------------
+We disable mimetype detection in Nutch for two reasons:
+
+1. The ARC/WARC file specifies the Content-Type of the document.  We
+   trust that the tool that created the ARC/WARC file got it right.
+
+2. The implementation in Nutch can use a lot of memory as the *entire*
+   document is read into memory as a byte[], then converted to a
+   String, then checked against the MIME database.  This can lead to
+   out of memory errors for large files, such as music and video.
+
+To disable, simply set the property value to false.
+
+  <property>
+    <name>mime.type.magic</name>
+    <value>false</value>
+  </property>
+
+--------------------------------------------------
+nutchwax.filter.index
+--------------------------------------------------
+Configure the 'index-nutchwax' plugin.  Specify how the metadata
+fields added by the ArcsToSegment are mapped to the Lucene documents
+during indexing.
+
+The specifications here are of the form:
+
+  src-key:lowercase:store:tokenize:dest-key
+
+where the only required part is the "src-key", the rest will assume
+the following defaults:
+
+  lowercase = true
+  store     = true
+  tokenize  = false
+  dest-key  = src-key
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.index</name>
+  <value>
+    arcname:false
+    collection
+    date
+    type
+  </value>
+</property>
+
+--------------------------------------------------
+nutchwax.filter.query
+--------------------------------------------------
+Configure the 'query-nutchwax' plugin.  Specify which fields to make
+searchable via "[field]:[term|phrase]" query syntax, and whether they
+are "raw" fields or not.
+
+The specification format is 
+
+  raw:name:lowercase:boost 
+or
+  field:name:boost
+
+Default values are
+
+  lowercase = true
+  boost     = 1.0f
+
+There is no "lowercase" property for "field" specification because the
+Nutch FieldQueryFilter doesn't expose the option, unlike the
+RawFieldQueryFilter.
+
+NTOE: We do *not* use this filter for handling "date" queries, there is a
+specific filter for that: DateQueryFilter
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.query</name>
+  <value>
+    raw:arcname:false
+    raw:collection
+    raw:type
+    field:anchor
+    field:content
+    field:host
+    field:title
+  </value>
+</property>
+
+
+======================================================================
+Create a manifest
+======================================================================
+
+The input to NutchWAX's import tool is a manifest file.  This is a
+simple text file where each line contains a URL to an ARC/WARC file
+and an optional collection name.
+
+For example:
+
+ $ cat > manifest
+ http://someserver/somepath/somearchive.arc.gz mycollection
+ ^D
+
+Creates a simple manifest file with one ARC file and a collection
+name of "mycollection".
+
+You don't have to use collections at all.  If you don't know how you
+would use it, then simply leave it out here.
+
+
+======================================================================
+Import, Invert and Index
+======================================================================
+
+The steps to import the files, invert the link and index the documents
+are rather simple:
+
+  $ mkdir crawl
+  $ cd crawl
+  $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest
+  $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
+  $ ls -F1
+  crawldb/
+  indexes/
+  linkdb/
+  segments/
+
+To those already familiar with Nutch, these steps should be quite
+familiar.
+
+The first step, we call NutchWAX's "import" command which creates the
+Nutch segment containing the documents in the ARC/WARC files listed in
+the manifest.  The rest is the same as regular Nutch.
+
+
+======================================================================
+Search
+======================================================================
+The resulting indexes can be searched in exactly the same manner as in
+regular Nutch.  For example, assuming you just completed the steps
+above, now:
+
+  $ cd ../
+  $ ls -F1
+  crawl/
+  $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer
+
+This calls the NutchBean to execute a simple keyword search for
+"computer".  Use whatever query term you think appears in the
+documents you imported.
+
+
+======================================================================
+Web Deployment
+======================================================================
+
+As users of Nutch are aware, the web application (nutch-1.0-dev.war)
+bundled with Nutch contains duplicate copies of the configuration
+files.
+
+So, all patches and configuration changes that we made to the
+files in
+
+  /opt/nutch-1.0-dev/conf
+
+will have to be duplicated in the Nutch webapp when it is deployed.
+
+This is not due to NutchWAX, this is a "feature" of regular Nutch.  I
+just thought it would be good to remind everyone since we did make
+configuration changes for NutchWAX.

Modified: trunk/archive-access/projects/nat/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/INSTALL.txt	2008-05-14 00:20:24 UTC (rev 2265)
+++ trunk/archive-access/projects/nat/archive/INSTALL.txt	2008-05-21 00:02:01 UTC (rev 2266)
@@ -1,236 +1,93 @@
 
 INSTALL.txt
-2008-05-06
+2008-05-20
 Aaron Binns
 
+This installation guide assumes the reader is already familiar with
+building, packaging and deploying Nutch 1.0-dev.
 
-The NutchWAX 0.12 build and installation is as an "add-on" to an
-existing Nutch 1.0-dev installation.
 
-NutchWAX 0.12 uses a simple 'ant' build script.  The script compiles
-the NutchWAX sources, using the libraries in the installed
-Nutch-1.0-dev.
+The NutchWAX 0.12 source and build system are designed to integrate
+into the existing Nutch 1.0-dev source and build.
 
-We strongly recommend having *two* Nutch-1.0-dev installation
-directories: one that you build NutchWAX against, and another into
-which NutchWAX is deployed.
+The long-term goal is for the NutchWAX components to be fully
+integrated into mainline Nutch.  As a stepping-stone toward that goal,
+we have packaged the NutchWAX source to be dropped into the Nutch
+"contrib" directory and built from there.
 
-NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file
-*into* an existing Nutch-1.0-dev installation.  Think of NutchWAX as
-an add-on.  We over-write a few Nutch config files, but the rest is
-simply added to the existing Nutch-1.0-dev installation.
+Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script.  The
+NutchWAX build script calls out to the Nutch script to build Nutch
+proper, then builds NutchWAX components and integrates them into the
+Nutch build directory.
 
+We recommend that you execute all build commands from the NutchWAX
+directory.  This way, NutchWAX will ensure that any and all
+dependencies in Nutch will be properly built and kept up-to-date.
+Towards this goal, we have duplicated the most common build targets
+from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file,
+such as:
 
+  o compile
+  o jar
+  o job
+  o tar
+  o clean
+
+Again, the idea is that if you're already used to building Nutch, you
+can easily transition to building Nutch and NutchWAX together.  All of
+the build artifacts will still be placed in Nutch's 'build'
+sub-directory as normal.
+
+
 Nutch-1.0-dev
 -------------
-
-As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.  Now
+As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is 
+Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is
 built against is:
 
   650739
 
 To checkout this revision of Nutch, use:
 
- $ mkdir nutch
+ $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
  $ cd nutch
- $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk
 
-To build the nutch-1.0-dev.tar.gz package, use 'ant'
 
- $ cd trunk
- $ ant tar
+NutchWAX
+--------
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into
+Nutch's "contrib" directory.
 
-This produces
+ $ cd contrib
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive
 
-  build/nutch-1.0-dev.tar.gz
+This will create a sub-directory named "archive" containing the
+NutchWAX sources.
 
-Which we then install *twice*
 
- $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev
- $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz
- $ mkdir -p /opt/nutch-1.0-dev
- $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz
-
-The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which
-we compile against, then, when we want to test NutcWAX, we deploy it
-into ~/nutchwax-0.12/nutch-1.0-dev.
-
-Why can't we just use one installation of Nutch?  Mainly to avoid
-weirdness where we are compiling NutchWAX source against the same set
-of libraries where we would be installing NutchWAX.  Consider, when we
-deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib'
-directory.  If we use that same 'lib' directory for dependencies when
-compiling the source, 'ant'/'javac' will likely get confused when
-calculating dependencies.
-
-It's possible that you could successfully go through the
-build/test/release cycle using one Nutch-1.0-dev directory, but these
-instructions assume you will have two.
-
-
 Build and install
 -----------------
+Assuming you already have the required tool-set for building Nutch,
+building NutchWAX is a snap.
 
-  1. Install two Nutch-1.0-dev packages per the instructions above.
+Simply execute the same 'ant' build command in
 
-  2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev
+  nutch/contrib/archive
 
-       <!-- NOTE: Point this to your Nutch 1.0-dev directory -->
-       <property name="nutch.dir" value="/opt/nutch-1.0-dev" />
+as you normally would and everything will build as normal.
 
-  3. Build NutchWAX-0.12
+For example
 
-      $ ant
+  $ cd nutch/contrib/archive
+  $ ant tar
 
-     The default build rule is "package" which will compile all the source
-     and build an intallation tarball: nutchwax-0.12.tar.gz
+This command will build all of Nutch, then the NutchWAX add-ons and
+finally will package everything up into the "nutch-1.0-dev.tar.gz"
+release package.
 
-     The "build.xml" file is pretty straightforward and just grepping
-     for the targets should be pretty obvious: compile, clean, etc.
-
-  4. Install NutchWAX into the build/test Nutch installation
-
-     $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz
-
-That's it!
-
-All we do is add our libraries (nutchwax.jar and dependencies), the
-'nutchwax' helper script, plugins for indexing and querying, and a few
-config files.
-
-Except for the config files, no files in the Nutch-1.0-dev
-installation are over-written, only added.  The "nutch-site.xml" file
-is over-written, but that file is empty in a vanilla Nutch
-installation, so there's small risk of over-writing something.
-
-
-HOWTO run and test
-------------------
-
-The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin'
-directory next to the 'nutch' helper script.
-
-The 'nutchwax' script is used to run the NutchWAX-specific tools, use
-the regular 'nutch' script for regular Nutch activities.
-
-The 'nutchwax' script runs two tools
-
-  "import"     Import a set of .arc/.warc files from a manifest, creating
-               a Nutch segment.
-
-  "dumpindex"  Debug tool that dumps a Lucene index, such as the ones
-               created by Nutch's "index" tool.
-
-The idea is that the NutchWAX "import" tool supplants the Nutch
-generate and fetch cycle.  Rather than generating and fetching
-segments, we import the .arc/.warc files directly into a newly created
-segment.  Then, we process that segment just as we normally would with
-Nutch.
-
-For example,
-
-  $ cd nutch-test
-  $ cat > manifest
-    http://someserver/foo-bar-baz.arc mycollection
-    ^D
-  $ nutch-1.0-dev/bin/nutchwax import manifest
-
-This will import the arc file listed in the manifest into a newly
-created segment.  The segment is created by default in a directory
-hierarchy of the form:
-
-  segments/[date-timestamp]
-
-This mirrors the way segments are created in vanilla Nutch by the
-"generate" command.
-
-You can explicitly name the segment if you want, e.g.
-
-  nutchwax import manifest mysegment
-
-Once the segment is created by the importing of ARC files with
-NutchWAX, you can use Nutch to perform the rest of the steps.  For
+Then, install the "nutch-1.0-dev.tar.gz" tarball as normal.  For
 example:
 
-  $ nutch-1.0-dev/bin/nutchwax import manifest
-  $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
-  $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments
-  $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
-  $ nutch-1.0-dev/bin/nutch merge index indexes
-
-This is pretty much the minimal set of steps to import and index a set
-of ARC files.  The crawldb update and link inversion steps are pro
-forma and don't have anything to do with NutchWAX specifically, but
-are a part of regular Nutch processing.
-
-Now you have a Nutch "index" directory and are ready to search!
-
-Searching is done as in vanilla Nutch.  Either launch the Nutch webapp
-or use the command-line interface to NutchBean to run some test
-searches.  Nothing NutchWAX-specific here.
-
-
-Miscellaneous notes
--------------------
-
-1. Plugins
-
-There are two plugins bundled with NutchWAX: 
-
-   index-nutchwax
-   query-nutchwax
-
-See the "plugin.includes" property in nutch-site.xml to see where
-these plugins are added to the filter chain.
-
-The index-nutchwax plugin ensures that WAX-specifici metadata is
-transferred from the Nutch Content object to the Lucene Document
-object, which is placed in the Lucene index.
-
-The query-nutchwax plugin is used to process query requests against
-those same meta-data fields.  It also expands the capabilities of
-searching the basic Nutch fields as well.
-
-2. URL filters
-
-Nutch's URL filter by default filters-out many common URL oddities
-that would normally trip-up Nutch's crawler.  However, when importing
-content from ARC files, filtering out content probably doens't make
-sense.  That is, whatever content made it into the ARC file should be
-imported, no matter what the URL looks like.
-
-To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'.
-To pass all content through the filter, remove all filter rules except
-for the last one:
-
-  # accept anything else
-  +.
-
-3. conf/tika-mimetypes.xml
-
-NutchWAX comes with a fixed copy of tika-mimetypes.xml.  The version
-in Nutch revision 650739 has a few bugs in it which cause parsing to
-fail for many document types.  The bugs are:
-
- o Move
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-   definition higher up in the file, before the reference to it.
-
- o Remove
-
-	<mime-type type="application/x-ms-dos-executable">
-		<alias type="application/x-dosexec;exe" />
-	</mime-type>
-
-   as the ';' character is illegal according to the comments in the
-   Nutch code.
-
-The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes
-these two bugs.
+  $ cd /opt
+  $ tar xvfz nutch-1.0-dev.tar.gz

Modified: trunk/archive-access/projects/nat/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/README.txt	2008-05-14 00:20:24 UTC (rev 2265)
+++ trunk/archive-access/projects/nat/archive/README.txt	2008-05-21 00:02:01 UTC (rev 2266)
@@ -1,105 +1,122 @@
 
 README.txt
-2008-05-06
+2008-05-20
 Aaron Binns
 
+Welcome to NutchWAX 0.12!
 
-This is the NutchWAX-0.12 source that John Lee handed-off to me.  It
-is a work-in-progress.
+NutchWAX is a set of add-ons to Nutch in order to index and search
+archived web data.
 
-Compared to NutchWAX-0.10 (and earlier) it is *much* simpler.  The
-main WAX-specific code is in just a few files really:
+These add-ons are developed and maintained by the Internet Archive Web
+Team in conjunction with a broad community of contributors, partners
+and end-users.
 
-src/java/org/archive/nutchwax/ArcsToSegment.java
+The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
 
-  This is the meat of the WAX logic for processing .arc files and
-  generating Nutch segments.  Once we use this to generate a set of
-  segments for the .arc files, we can use the rest of vanilla
-  Nutch-1.0-dev to invert links and index the content with Lucene.
+Since NutchWAX is a set of add-ons to Nutch, you should already be
+familiar with Nutch before using NutchWAX.
 
-  This conversion code is heavily edited from:
+======================================================================
 
-    nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+The goal of NutchWAX is to enable full-text indexing and searching of
+documents stored in web archive file formats (ARC and WARC).
 
-  taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development).
+The way we achieve that goal is by providing add-on tools and plugins
+to Nutch to read documents directly from ARC/WARC files.  We call this
+process "importing" archive files.
 
-  Ours differs in a few important ways:
+Importing produces a Nutch segment, the same as if Nutch had actually
+crawled the documents itself.  In this scenario, document importing
+replaces the conventional "generate/fetch/update" cycle of Nutch.
 
-    o Rather than taking a directory with .arc files as input, we take
-      a manifest file with URLs to .arc files.  This way, the manifest
-      is split up among the distributed Hadoop jobs and the .arc files
-      are processed in whole by each worker.
+Once the archival documents have been imported into a segment, the
+regular Nutch commands to update the 'crawldb', invert the links and
+index the document contents can proceed as normal.
 
-      In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the
-      input directory to contain the .arc files and (AFAICT) splits
-      them up and distributes them across the Hadoop workers.  This
-      seems really inefficient to me, I think our approach is much
-      better -- at least for us.
+======================================================================
 
-    o Related to the way input files are split and processed, we use
-      the standard Archive ARCReader class just like Heritrix and
-      Wayback.
+The NutchWAX add-ons consist of:
 
-      The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our
-      ARCReader because of licensing imcompatibility.  Ours is under
-      GPL and Nutch-1.0-dev forbids the use of GPL code.
-      
-      We are in the process of re-licensing or dual-licensing with
-      Apache License, but until then, our ARCReader code won't be incldued      
-      in mainline Nutch.
+ bin/nutchwax
 
-      This isn's a problem per se, but worth noting in case anyone
-      looks at the Nutch-1.0-dev code and wonders why they built their
-      own (horribly inefficient) .arc reader.
+   A shell script that is used to run the NutchWAX command-line tools,
+   such as document importing.
 
-    o We add metadata fields to the processed document for WAX-specific
-      purposes:
+   This is patterned after the 'bin/nutch' shell script.
 
-        content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() );
-        content.getMetadata().set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() ) ;
-        content.getMetadata().set( NutchWax.COLLECTION_KEY,   collection);
-        content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() );
+ plugins/index-nutchwax
 
-      The addition of the arcname and collection key is pretty
-      obvious.  I don't know why the content-type isn't added in the
-      vanilla Nutch-1.0-dev.
-      
-      Also, we should review the use of the ARCHIVE_DATE_KEY in that
-      John Lee mentioned to me that there was possibly duplicate date
-      fields put in the index: one that is a plain old Java date, and
-      one that is a 14-digit date string for use with Wayback.
+   Indexing plugin which adds NutchWAX-specific metadata fields to the
+   indexed document.
 
-src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java
-src/java/plugin/index-nutchwax/plugin.xml
+ plugins/query-nutchwax
 
-  This filter is pretty straightforward.  All it does is take the
-  metadata fields that were added to the document (as described above)
-  and placed in the Lucene index so that we can make use of them at
-  search-time.
+   Query plugin which allows for querying against the metadata fields
+   added by 'index-nutchwax'.
 
-src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java
-src/java/plugin/query-nutchwax/plugin.xml
+There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
+is distributed in source code form and is intended to be built in
+conjunction with Nutch.
 
-  This is a single query filter that can be used for querying single
-  fields from a single implementation.  It does *not* allow for
-  querying multiple fields as you can already do that via Nutch.
+See "INSTALL.txt" for details on building NutchWAX and Nutch.
 
-  What this filter does is allows one to more-or-less create query
-  filters in a data-driven manner rather than having to code-up a new
-  class for each field.  That is, before one would have to create a
-  CollectionQueryFilter class to filter on the "collection" field.
-  With the MultipleFieldQueryFilter class, you can specify that the
-  "collection" field is to be filterable via the plugin.xml file and
-  "nutchwax.filter.query" configuration property.
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.
 
-src/java/org/archive/nutchwax/NutchWax.java
+======================================================================
 
-  Just a simple enum used by the above two classes for the metadata
-  keys.
+This 0.12 release of NutchWAX is radically different in source-code
+form compared to the previous release, 0.10.
 
-src/java/org/archive/nutchwax/tools/DumpIndex.java
+One of the design goals of 0.12 was to reduce or even eliminate the
+"copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
+releases had to copy/paste/edit large chunks of Nutch source code in
+order to add the NutchWAX features.
 
-  A simple command-line utility to dump the contents of a Lucene
-  index.  Used for debugging.
+Also, the NutchWAX 0.12 sources and build are designed to one day be
+added into mainline Nutch as a proper "contrib" package; then
+eventually be fully integrated into the core Nutch source code.
 
+======================================================================
 
+Most of the NutchWAX source code is relatively straightfoward to those
+already familiar with the inner workings of Nutch.  Still, special
+attention on one class is worth while:
+
+  src/java/org/archive/nutchwax/ArcsToSegment.java
+
+This is where ARC/WARC files are read and their documents are imported
+into a Nutch segment.
+
+It is inspired by:
+
+  nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+
+on the Nutch SVN head.
+
+Our implementation differs in a few important ways:
+
+  o Rather than taking a directory with ARC files as input, we take a
+    manifest file with URLs to ARC files.  This way, the manifest is
+    split up among the distributed Hadoop jobs and the ARC files are
+    processed in whole by each worker.
+
+    In the Nutch SVN, the ArcSegmentCreator.java expects the input
+    directory to contain the ARC files and (AFAICT) splits them up and
+    distributes them across the Hadoop workers.
+
+  o We use the standard Internet Archive ARCReader and WARCReader
+    classes.  Thus, NutchWAX can read both ARC and WARC files, whereas
+    the ArcSegmentCreator class can only read ARC files.
+
+  o We add metadata fields to the document, which are then available
+    to the "index-nutchwax" plugin at indexing-time.
+
+    ArcsToSegment.importRecord()
+      ...
+      contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
+      contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
+      contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
+      contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
+      ...


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.