[Archive-access-cvs] SF.net SVN: archive-access:[2973] trunk/archive-access/projects/nutchwax/ arch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2973
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2973&view=rev
Author:   binzino
Date:     2010-03-18 19:26:44 +0000 (Thu, 18 Mar 2010)

Log Message:
-----------
Updated to match NW 0.13.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt

Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2010-03-16 21:37:14 UTC (rev 2972)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt	2010-03-18 19:26:44 UTC (rev 2973)
@@ -1,6 +1,6 @@
 
 BUILD-NOTES.txt
-2008-12-18
+2010-02-13
 Aaron Binns
 
 ======================================================================
@@ -13,15 +13,15 @@
 
 ======================================================================
 
-This 0.12.x release of NutchWAX is radically different in source-code
+This 0.13 release of NutchWAX is radically different in source-code
 form compared to the previous release, 0.10.
 
-One of the design goals of 0.12.x was to reduce or even eliminate the
+One of the design goals of 0.13 was to reduce or even eliminate the
 "copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
 releases had to copy/paste/edit large chunks of Nutch source code in
 order to add the NutchWAX features.
 
-Also, the NutchWAX 0.12.x sources and build are designed to one day be
+Also, the NutchWAX 0.13 sources and build are designed to one day be
 added into mainline Nutch as a proper "contrib" package; then
 eventually be fully integrated into the core Nutch source code.
 
@@ -77,47 +77,7 @@
 to the Nutch source and configuration files.
 
 ----------------------------------------------------------------------
-The file
 
-  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
-
-contains two errors: one where a mimetype is referenced before it is
-defined; and a second where a definition has an illegal character.
-
-These errors cause Nutch to not recognize certain mimetypes and
-therefore will ignore documents matching those mimetypes.
-
-There are two fixes:
-
- 1. Move
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-    definition higher up in the file, before the reference to it.
-
- 2. Remove
-
-	<mime-type type="application/x-ms-dos-executable">
-		<alias type="application/x-dosexec;exe" />
-	</mime-type>
-
-    as the ';' character is illegal according to the comments in the
-    Nutch code.
-
-You can either apply these patches yourself, or copy an already-patched
-copy from:
-
-  /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml
-
-to 
-
-  /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
-
-----------------------------------------------------------------------
-
 In the file 'conf/nutch-site.xml' we define some properties to
 over-ride the values in 'conf/nutch-default.xml'.
 
@@ -130,27 +90,37 @@
 
 to
 
-  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+  protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
 
 In short, we add:
 
-  index-nutchwax
-  query-nutchwax
-  urlfilter-nutchwax
-  parse-pdf
+ parse-pdf
+ index-nutchwax
+ query-nutchwax
+ urlfilter-nutchwax
 
 and remove:
 
-  urlfilter-regex
-  urlnormalizer-(pass|regex|basic)
+ index-basic
+ index-anchor
+ query-site
+ query-url
+ urlfilter-regex
+ urlnormalizer-(pass|regex|basic)
 
-The only *required* changes are the additions of the NutchWAX index
-and query plugins.  The rest are optional, but recommended.
 
 The "parse-pdf" plugin is added simply because we have lots of PDFs in
 our archives and we want to index them.  We sometimes remove the
 "parse-js" plugin if we don't care to index JavaScript files.
 
+The Nutch index-basic and index-anchor filters are removed and
+replaced with the NutchWAX index-nutchwax filter.  Similarly, we
+remove the Nutch query-site and query-url filters, replacing them with
+the single NutchWAX query-nutchwax filter.  By using the configurable
+NutchWAX filters for indexing and querying, we get more powerful and
+consistent behavior across metadata fields.  Note that we do retain
+the Nutch query-basic filter however.
+
 We also remove the default Nutch URL filtering and normalizing plugins
 because we do not need the URLs normalized nor filtered.  We trust
 that the tool that produced the ARC/WARC file will have normalized the
@@ -166,6 +136,14 @@
 --------------------------------------------------
 indexingfilter.order
 --------------------------------------------------
+If we use the indexing filters as specified in the previous section,
+then this property can remain unset.  However, if you choose to use
+the Nutch index-basic filter, then you *must* specify the order in
+which the filters will be used.  If you don't then the filters will be
+applied in a random order (per Nutch's design) and since one may
+over-write the values of another you won't know what values will
+result.  In that case, you need to specify the order.
+
 Add this property with a value of
 
     org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -174,8 +152,6 @@
 So that the NutchWAX indexing filter is run after the Nutch basic
 indexing filter.
 
-A full explanation is given in "README-dedup.txt".
-
 --------------------------------------------------
 mime.type.magic
 --------------------------------------------------
@@ -205,37 +181,44 @@
 
 The specifications here are of the form:
 
-  src-key:lowercase:store:tokenize:exclusive:dest-key
+  src-key:lowercase:store:index:exclusive:dest-key
 
 where the only required part is the "src-key", the rest will assume
 the following defaults:
 
   lowercase = true
   store     = true
-  tokenize  = false
+  index     = tokenized
   exclusive = true
   dest-key  = src-key
 
+For the 'index' property, the possible values are:
+  tokenized
+  untokenized
+  no_norms
+  no
+
+corresponding to the Lucene options of the same names.
+
 We recommend:
 
 <property>
   <name>nutchwax.filter.index</name>
   <value>
-    url:false:true:true
-    url:false:true:false:true:exacturl
-    orig:false
-    digest:false
-    filename:false
-    fileoffset:false
-    collection
-    date
-    type
-    length
+    title:false:true:tokenized
+    content:false:false:tokenized
+    site:false:false:untokenized
+
+    url:false:true:tokenized
+    digest:false:true:no
+
+    collection:true:true:no_norms
+    date:true:true:no_norms
+    type:true:true:no_norms
+    length:false:true:no
   </value>
 </property>
 
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
 
 --------------------------------------------------
 nutchwax.filter.query
@@ -274,15 +257,10 @@
 <property>
   <name>nutchwax.filter.query</name>
   <value>
-    raw:digest:false
-    raw:filename:false
-    raw:fileoffset:false
-    raw:exacturl:false
     group:collection
+    group:site:false
     group:type
-    field:anchor
     field:content
-    field:host
     field:title
   </value>
 </property>
@@ -428,3 +406,31 @@
     <value>false</value>
   </property>
 
+
+--------------------------------------------------
+searcher.fieldcache
+--------------------------------------------------
+
+NutchWAX contains a patch controlling the use of a "fieldcache" in the
+Nutch searcher.  Without this patch Nutch will read the entire set of
+hostnames from the index into an in-memory cache.  This cache is then
+consulted when performing de-duplication of results per the
+"hitsPerSite" feature.
+
+For small-to-medium indexes, this can improve performance as the
+de-duplication information is entirely in memory and no disk access is
+required.
+
+However, for large indexes, in the tens of gigabytes in size, reading
+the entire set of hostnames into an in-memory cache can exhaust the
+Java heap.  In this case, omitting the cache all together and just
+reading the values off disk as needed is better.
+
+The NutchWAX patch controls the use of this cache based on this property
+value.  If set to false, then the cache is not used at all.
+
+<property>
+  <name>searcher.fieldcache</name>
+  <value>false</value>
+</property>
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2973] trunk/archive-access/projects/nutchwax/ arch

[Archive-access-cvs] SF.net SVN: archive-access:[2973] trunk/archive-access/projects/nutchwax/ archive/BUILD-NOTES.txt