|
From: <bi...@us...> - 2009-05-05 21:15:47
|
Revision: 2700
http://archive-access.svn.sourceforge.net/archive-access/?rev=2700&view=rev
Author: binzino
Date: 2009-05-05 21:14:39 +0000 (Tue, 05 May 2009)
Log Message:
-----------
Fix type-o
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-05-05 20:24:22 UTC (rev 2699)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-05-05 21:14:39 UTC (rev 2700)
@@ -222,7 +222,7 @@
<name>nutchwax.filter.index</name>
<value>
url:false:true:true
- url:flase:true:false:true:exacturl
+ url:false:true:false:true:exacturl
orig:false
digest:false
filename:false
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|
|
From: <bi...@us...> - 2010-03-18 19:26:54
|
Revision: 2973
http://archive-access.svn.sourceforge.net/archive-access/?rev=2973&view=rev
Author: binzino
Date: 2010-03-18 19:26:44 +0000 (Thu, 18 Mar 2010)
Log Message:
-----------
Updated to match NW 0.13.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2010-03-16 21:37:14 UTC (rev 2972)
+++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2010-03-18 19:26:44 UTC (rev 2973)
@@ -1,6 +1,6 @@
BUILD-NOTES.txt
-2008-12-18
+2010-02-13
Aaron Binns
======================================================================
@@ -13,15 +13,15 @@
======================================================================
-This 0.12.x release of NutchWAX is radically different in source-code
+This 0.13 release of NutchWAX is radically different in source-code
form compared to the previous release, 0.10.
-One of the design goals of 0.12.x was to reduce or even eliminate the
+One of the design goals of 0.13 was to reduce or even eliminate the
"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX
releases had to copy/paste/edit large chunks of Nutch source code in
order to add the NutchWAX features.
-Also, the NutchWAX 0.12.x sources and build are designed to one day be
+Also, the NutchWAX 0.13 sources and build are designed to one day be
added into mainline Nutch as a proper "contrib" package; then
eventually be fully integrated into the core Nutch source code.
@@ -77,47 +77,7 @@
to the Nutch source and configuration files.
----------------------------------------------------------------------
-The file
- /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
-
-contains two errors: one where a mimetype is referenced before it is
-defined; and a second where a definition has an illegal character.
-
-These errors cause Nutch to not recognize certain mimetypes and
-therefore will ignore documents matching those mimetypes.
-
-There are two fixes:
-
- 1. Move
-
- <mime-type type="application/xml">
- <alias type="text/xml" />
- <glob pattern="*.xml" />
- </mime-type>
-
- definition higher up in the file, before the reference to it.
-
- 2. Remove
-
- <mime-type type="application/x-ms-dos-executable">
- <alias type="application/x-dosexec;exe" />
- </mime-type>
-
- as the ';' character is illegal according to the comments in the
- Nutch code.
-
-You can either apply these patches yourself, or copy an already-patched
-copy from:
-
- /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml
-
-to
-
- /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml
-
-----------------------------------------------------------------------
-
In the file 'conf/nutch-site.xml' we define some properties to
over-ride the values in 'conf/nutch-default.xml'.
@@ -130,27 +90,37 @@
to
- protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+ protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
In short, we add:
- index-nutchwax
- query-nutchwax
- urlfilter-nutchwax
- parse-pdf
+ parse-pdf
+ index-nutchwax
+ query-nutchwax
+ urlfilter-nutchwax
and remove:
- urlfilter-regex
- urlnormalizer-(pass|regex|basic)
+ index-basic
+ index-anchor
+ query-site
+ query-url
+ urlfilter-regex
+ urlnormalizer-(pass|regex|basic)
-The only *required* changes are the additions of the NutchWAX index
-and query plugins. The rest are optional, but recommended.
The "parse-pdf" plugin is added simply because we have lots of PDFs in
our archives and we want to index them. We sometimes remove the
"parse-js" plugin if we don't care to index JavaScript files.
+The Nutch index-basic and index-anchor filters are removed and
+replaced with the NutchWAX index-nutchwax filter. Similarly, we
+remove the Nutch query-site and query-url filters, replacing them with
+the single NutchWAX query-nutchwax filter. By using the configurable
+NutchWAX filters for indexing and querying, we get more powerful and
+consistent behavior across metadata fields. Note that we do retain
+the Nutch query-basic filter however.
+
We also remove the default Nutch URL filtering and normalizing plugins
because we do not need the URLs normalized nor filtered. We trust
that the tool that produced the ARC/WARC file will have normalized the
@@ -166,6 +136,14 @@
--------------------------------------------------
indexingfilter.order
--------------------------------------------------
+If we use the indexing filters as specified in the previous section,
+then this property can remain unset. However, if you choose to use
+the Nutch index-basic filter, then you *must* specify the order in
+which the filters will be used. If you don't then the filters will be
+applied in a random order (per Nutch's design) and since one may
+over-write the values of another you won't know what values will
+result. In that case, you need to specify the order.
+
Add this property with a value of
org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -174,8 +152,6 @@
So that the NutchWAX indexing filter is run after the Nutch basic
indexing filter.
-A full explanation is given in "README-dedup.txt".
-
--------------------------------------------------
mime.type.magic
--------------------------------------------------
@@ -205,37 +181,44 @@
The specifications here are of the form:
- src-key:lowercase:store:tokenize:exclusive:dest-key
+ src-key:lowercase:store:index:exclusive:dest-key
where the only required part is the "src-key", the rest will assume
the following defaults:
lowercase = true
store = true
- tokenize = false
+ index = tokenized
exclusive = true
dest-key = src-key
+For the 'index' property, the possible values are:
+ tokenized
+ untokenized
+ no_norms
+ no
+
+corresponding to the Lucene options of the same names.
+
We recommend:
<property>
<name>nutchwax.filter.index</name>
<value>
- url:false:true:true
- url:false:true:false:true:exacturl
- orig:false
- digest:false
- filename:false
- fileoffset:false
- collection
- date
- type
- length
+ title:false:true:tokenized
+ content:false:false:tokenized
+ site:false:false:untokenized
+
+ url:false:true:tokenized
+ digest:false:true:no
+
+ collection:true:true:no_norms
+ date:true:true:no_norms
+ type:true:true:no_norms
+ length:false:true:no
</value>
</property>
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
--------------------------------------------------
nutchwax.filter.query
@@ -274,15 +257,10 @@
<property>
<name>nutchwax.filter.query</name>
<value>
- raw:digest:false
- raw:filename:false
- raw:fileoffset:false
- raw:exacturl:false
group:collection
+ group:site:false
group:type
- field:anchor
field:content
- field:host
field:title
</value>
</property>
@@ -428,3 +406,31 @@
<value>false</value>
</property>
+
+--------------------------------------------------
+searcher.fieldcache
+--------------------------------------------------
+
+NutchWAX contains a patch controlling the use of a "fieldcache" in the
+Nutch searcher. Without this patch Nutch will read the entire set of
+hostnames from the index into an in-memory cache. This cache is then
+consulted when performing de-duplication of results per the
+"hitsPerSite" feature.
+
+For small-to-medium indexes, this can improve performance as the
+de-duplication information is entirely in memory and no disk access is
+required.
+
+However, for large indexes, in the tens of gigabytes in size, reading
+the entire set of hostnames into an in-memory cache can exhaust the
+Java heap. In this case, omitting the cache all together and just
+reading the values off disk as needed is better.
+
+The NutchWAX patch controls the use of this cache based on this property
+value. If set to false, then the cache is not used at all.
+
+<property>
+ <name>searcher.fieldcache</name>
+ <value>false</value>
+</property>
+
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|