[Archive-access-cvs] SF.net SVN: archive-access:[2747] tags/nutchwax-0_12_5/archive

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2747
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2747&view=rev
Author:   binzino
Date:     2009-06-25 22:00:14 +0000 (Thu, 25 Jun 2009)

Log Message:
-----------
Updated for 0.12.5 release.

Modified Paths:
--------------
    tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
    tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt
    tags/nutchwax-0_12_5/archive/HOWTO.txt
    tags/nutchwax-0_12_5/archive/INSTALL.txt
    tags/nutchwax-0_12_5/archive/README.txt
    tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt

Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
===================================================================

--- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 BUILD-NOTES.txt
-2008-12-18
+2009-06-25
 Aaron Binns
 
 ======================================================================
@@ -130,27 +130,37 @@
 
 to
 
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+  protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
 
 In short, we add:
 
-  index-nutchwax
-  query-nutchwax
-  urlfilter-nutchwax
-  parse-pdf
+ parse-pdf
+ index-nutchwax
+ query-nutchwax
+ urlfilter-nutchwax
 
 and remove:
 
-  urlfilter-regex
-  urlnormalizer-(pass|regex|basic)
+ index-basic
+ index-anchor
+ query-site
+ query-url
+ urlfilter-regex
+ urlnormalizer-(pass|regex|basic)
 
-The only *required* changes are the additions of the NutchWAX index
-and query plugins.  The rest are optional, but recommended.
 
 The "parse-pdf" plugin is added simply because we have lots of PDFs in
 our archives and we want to index them.  We sometimes remove the
 "parse-js" plugin if we don't care to index JavaScript files.
 
+The Nutch index-basic and index-anchor filters are removed and
+replaced with the NutchWAX index-nutchwax filter.  Similarly, we
+remove the Nutch query-site and query-url filters, replacing them with
+the single NutchWAX query-nutchwax filter.  By using the configurable
+NutchWAX filters for indexing and querying, we get more powerful and
+consistent behavior across metadata fields.  Note that we do retain
+the Nutch query-basic filter however.
+
 We also remove the default Nutch URL filtering and normalizing plugins
 because we do not need the URLs normalized nor filtered.  We trust
 that the tool that produced the ARC/WARC file will have normalized the
@@ -166,6 +176,14 @@
 --------------------------------------------------
 indexingfilter.order
 --------------------------------------------------
+If we use the indexing filters as specified in the previous section,
+then this property can remain unset.  However, if you choose to use
+the Nutch index-basic filter, then you *must* specify the order in
+which the filters will be used.  If you don't then the filters will be
+applied in a random order (per Nutch's design) and since one may
+over-write the values of another you won't know what values will
+result.  In that case, you need to specify the order.
+
 Add this property with a value of
 
     org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -174,8 +192,6 @@
 So that the NutchWAX indexing filter is run after the Nutch basic
 indexing filter.
 
-A full explanation is given in "README-dedup.txt".
-
 --------------------------------------------------
 mime.type.magic
 --------------------------------------------------
@@ -205,37 +221,44 @@
 
 The specifications here are of the form:
 
-  src-key:lowercase:store:tokenize:exclusive:dest-key
+  src-key:lowercase:store:index:exclusive:dest-key
 
 where the only required part is the "src-key", the rest will assume
 the following defaults:
 
   lowercase = true
   store     = true
-  tokenize  = false
+  index     = tokenized
   exclusive = true
   dest-key  = src-key
 
+For the 'index' property, the possible values are:
+  tokenized
+  untokenized
+  no_norms
+  no
+
+corresponding to the Lucene options of the same names.
+
 We recommend:
 
 <property>
   <name>nutchwax.filter.index</name>
   <value>
-    url:false:true:true
-    url:false:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
+    title:false:true:tokenized
+    content:false:false:tokenized
+    site:false:false:untokenized
+
+    url:false:true:no
+    digest:false:true:no
+
+    collection:true:true:no_norms
+    date:true:true:no_norms
+    type:true:true:no_norms
+    length:false:true:no
   </value>
 </property>
 
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
 
 --------------------------------------------------
 nutchwax.filter.query
@@ -274,15 +297,10 @@
 <property>
   <name>nutchwax.filter.query</name>
   <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
     group:collection
+    group:site:false
     group:type
-    field:anchor
     field:content
-    field:host
     field:title
   </value>
 </property>
@@ -428,3 +446,31 @@
     <value>false</value>
   </property>
 
+
+--------------------------------------------------
+searcher.fieldcache
+--------------------------------------------------
+
+NutchWAX contains a patch controlling the use of a "fieldcache" in the
+Nutch searcher.  Without this patch Nutch will read the entire set of
+hostnames from the index into an in-memory cache.  This cache is then
+consulted when performing de-duplication of results per the
+"hitsPerSite" feature.
+
+For small-to-medium indexes, this can improve performance as the
+de-duplication information is entirely in memory and no disk access is
+required.
+
+However, for large indexes, in the tens of gigabytes in size, reading
+the entire set of hostnames into an in-memory cache can exhaust the
+Java heap.  In this case, omitting the cache all together and just
+reading the values off disk as needed is better.
+
+The NutchWAX patch controls the use of this cache based on this property
+value.  If set to false, then the cache is not used at all.
+
+<property>
+  <name>searcher.fieldcache</name>
+  <value>true</value>
+</property>
+

Modified: tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 HOWTO-xslt.txt
-2008-12-18
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -128,8 +128,5 @@
 
 You can find sample 'web.xml' and 'search.xsl' files in 
 
-  contrib/archive/web
-
-in the compiled Nutch package.  Or in this source tree under
-
-  src/web
+  ./src/nutch/src/web/jsp/search.xsl
+  ./src/nutch/src/web/web.xml

Modified: tags/nutchwax-0_12_5/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/HOWTO.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 HOWTO.txt
-2008-07-28
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -26,7 +26,7 @@
 
     This HOWTO assumes it is installed in
 
-      /opt/nutchwax-0.12.4
+      /opt/nutchwax-0.12.5
 
  2. ARC/WARC files.
 
@@ -96,9 +96,9 @@
   $ cd ../
   $ ls -F1
   crawl/
-  $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
+  $ /opt/nutchwax-0.12.4/bin/nutch org.archive.nutchwax.NutchWaxBean computer
 
-This calls the NutchBean to execute a simple keyword search for
+This calls the NutchWaxBean to execute a simple keyword search for
 "computer".  Use whatever query term you think appears in the
 documents you imported.
 

Modified: tags/nutchwax-0_12_5/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/INSTALL.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/INSTALL.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2009-03-08
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -62,10 +62,12 @@
 SVN: nutch-1.0-dev
 ------------------
 As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
-Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12.4 is
-built against is:
+Although the Nutch project released 1.0 in early 2009, there were so
+many changes that NutchWAX 0.12.5 is still built against pre-1.0
+codebase.
 
+The specific SVN revision that NutchWAX 0.12.5 is built against is:
+
   701524
 
 To checkout this revision of Nutch, use:
@@ -79,14 +81,14 @@
 
 SVN: NutchWAX
 -------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5
 source into Nutch's "contrib" directory.
 
  $ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive
 
 This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.4 sources.
+NutchWAX 0.12.5 sources.
 
 Build and install
 -----------------
@@ -113,7 +115,7 @@
 
   $ cd /opt
   $ tar xvfz nutch-1.0-dev.tar.gz
-  $ mv nutch-1.0-dev nutchwax-0.12.4
+  $ mv nutch-1.0-dev nutchwax-0.12.5
 
 
 ======================================================================
@@ -126,24 +128,24 @@
 Install it simply by untarring it, for example:
 
   $ cd /opt
-  $ tar xvfz nutchwax-0.12.4.tar.gz
+  $ tar xvfz nutchwax-0.12.5.tar.gz
 
 
 ======================================================================
 Install start-up scripts
 ======================================================================
 
-NutchWAX 0.12.4 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.5 comes with a Unix init.d script which can be used to
 automatically start the searcher slaves for a multi-node search
 configuration.
 
 Assuming you installed NutchWAX as
 
-  /opt/nutchwax-0.12.4
+  /opt/nutchwax-0.12.5
 
 the script is found at
 
-  /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave
+  /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave
 
 This script can be placed in /etc/init.d then added to the list of
 startup scripts to run at bootup by using commands appropriate to your

Modified: tags/nutchwax-0_12_5/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/README.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/README.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
 
 README.txt
-2009-05-05
+2009-06-25
 Aaron Binns
 
 Table of Contents
@@ -13,7 +13,7 @@
 Introduction
 ======================================================================
 
-Welcome to NutchWAX 0.12.4!
+Welcome to NutchWAX 0.12.5!
 
 NutchWAX is a set of add-ons to Nutch in order to index and search
 archived web data.

Modified: tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt	2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt	2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,9 +1,9 @@
 
 RELEASE-NOTES.TXT
-2009-05-05
+2009-06-25
 Aaron Binns
 
-Release notes for NutchWAX 0.12.4
+Release notes for NutchWAX 0.12.5
 
 For the most recent updates and information on NutchWAX,
 please visit the project wiki at:
@@ -15,17 +15,75 @@
 Overview
 ======================================================================
 
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4
 
-  o Option to omit storing of content during import.
-  o Support for per-collection segments in master/slave config.
-  o Additional diagnostic/log messages to help troubleshoot common
-    deployment mistakes.
-  o PageRankDb similar to LinkDb but only keeping inlink counts.
-  o Improved paging through results, handling "paging past the end".
+  o Command-line options for NutchWaxBean to configure number of
+    results to emit and how many hits per site to allow.
 
+  o Change default configuration to use NutchWAX indexing and query
+    filters instead of Nutch-provided ones.  This give more consistent
+    control over indexing and query behavior.
 
+  o No longer store the unique document key (URL+digest) in a separate
+    field in the index.  Since the URL and digest are stored, just use
+    them to synthesize the unique document key as needed.
+
+  o Trimmed down the default configuration of indexing and query
+    filters to only store and index the minimum information needed for
+    typical NutchWAX installations.
+
+
 ======================================================================
+Configuration changes
+======================================================================
+
+As mentioned in the overview, NutchWAX 0.12.5 has some important
+changes to the default configuration.
+
+Previously, the indexing and query filter configuration utilized a
+combination of filters from Nutch and NutchWAX.  This was in line with
+our goal of NutchWAX being a set of add-ons to Nutch.
+
+However, in practice, the mixing of these filters often lead to
+confusion since the NutchWAX filters could be configured via
+properties in the Nutch configuration files whereas the Nutch filters
+were hard-coded and less powerful.
+
+Now, all the Nutch indexing filters have been removed and are replaced
+with the single NutchWAX indexing filter.  Similarly, all but one
+Nutch query filter are removed, replaced by the configurable NutchWAX
+query filter.  We do retain the Nutch 'query-basic' filter as it
+contains the logic for automatically applying a query to multiple
+fields with proportionate weights; something not subsumed by the
+NutchWAX query filter.
+
+
+In addition to removing the Nutch filters, the NutchWAX index and
+query filters are streamlined to only index and store the minimum set
+of metadata fields for typical deployments.
+
+In previous versions of NutchWAX, the indexing filters were configured
+to index and store nearly every piece of metadata available.  Although
+this seems desirable, it adds a lot of storage overhead to the index,
+and can hamper run-time query speed just by having unnecessary
+information in the index (more junk for the disk to seek around).
+
+The NutchWAX 0.12.5 configuration omits the typically unnecessary
+metadata fields from the index and only indexes those fields we think
+are needed for typical searches.
+
+For example, while we do store the digest, we do not index it as it's
+very unusual for someone to search for a document with a specific
+SHA-1 digest value.  You could decide you want that, in which case you
+can edit the configuration and re-index the data.  You would have to
+correspondingly edit the query filter and its configuration to allow
+for searching on that field as well.
+
+We have found that this streamlined indexing configuration yields
+Lucene indexes about 25% smaller than with NutchWAX 0.12.4.
+
+
+======================================================================
 Issues
 ======================================================================
 
@@ -35,23 +93,16 @@
 
 Issues resolved in this release:
 
-WAX-27 Sensible output for requesting page of results past the end.
+WAX-45 Add ability to store but not index a field via
+       ConfigurableIndexingFilter.
 
-WAX-34 Add option to omit storing of content in segment
+WAX-46 Add option to DumpParallelIndex to output only single field.
 
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
-       rather than actual inlinks.
+WAX-47 Stop storing document key in "orig" field in index, synthesize
+       it as needed from the "url" and "digest" fields.
 
-WAX-36 Some additional diagnostics on connecting results to segments
-       and snippets would be very helpful.
+WAX-48 Use NutchWAX configurable query filter for site and url fields.
 
-WAX-37 Per-collection segments not supported in distributed
-       master-slave configuration.
+WAX-49 Add "hitsPerSite" option to NutchWaxBean.
 
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
-
-WAX-42 Add option to continue importing if an arcfile cannot be read.
+WAX-50 Add "num hits to find" option to NutchWaxBean.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.