|
From: <bi...@us...> - 2009-06-25 22:00:15
|
Revision: 2747
http://archive-access.svn.sourceforge.net/archive-access/?rev=2747&view=rev
Author: binzino
Date: 2009-06-25 22:00:14 +0000 (Thu, 25 Jun 2009)
Log Message:
-----------
Updated for 0.12.5 release.
Modified Paths:
--------------
tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt
tags/nutchwax-0_12_5/archive/HOWTO.txt
tags/nutchwax-0_12_5/archive/INSTALL.txt
tags/nutchwax-0_12_5/archive/README.txt
tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt
Modified: tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/BUILD-NOTES.txt 2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
BUILD-NOTES.txt
-2008-12-18
+2009-06-25
Aaron Binns
======================================================================
@@ -130,27 +130,37 @@
to
- protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
+ protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax
In short, we add:
- index-nutchwax
- query-nutchwax
- urlfilter-nutchwax
- parse-pdf
+ parse-pdf
+ index-nutchwax
+ query-nutchwax
+ urlfilter-nutchwax
and remove:
- urlfilter-regex
- urlnormalizer-(pass|regex|basic)
+ index-basic
+ index-anchor
+ query-site
+ query-url
+ urlfilter-regex
+ urlnormalizer-(pass|regex|basic)
-The only *required* changes are the additions of the NutchWAX index
-and query plugins. The rest are optional, but recommended.
The "parse-pdf" plugin is added simply because we have lots of PDFs in
our archives and we want to index them. We sometimes remove the
"parse-js" plugin if we don't care to index JavaScript files.
+The Nutch index-basic and index-anchor filters are removed and
+replaced with the NutchWAX index-nutchwax filter. Similarly, we
+remove the Nutch query-site and query-url filters, replacing them with
+the single NutchWAX query-nutchwax filter. By using the configurable
+NutchWAX filters for indexing and querying, we get more powerful and
+consistent behavior across metadata fields. Note that we do retain
+the Nutch query-basic filter however.
+
We also remove the default Nutch URL filtering and normalizing plugins
because we do not need the URLs normalized nor filtered. We trust
that the tool that produced the ARC/WARC file will have normalized the
@@ -166,6 +176,14 @@
--------------------------------------------------
indexingfilter.order
--------------------------------------------------
+If we use the indexing filters as specified in the previous section,
+then this property can remain unset. However, if you choose to use
+the Nutch index-basic filter, then you *must* specify the order in
+which the filters will be used. If you don't then the filters will be
+applied in a random order (per Nutch's design) and since one may
+over-write the values of another you won't know what values will
+result. In that case, you need to specify the order.
+
Add this property with a value of
org.apache.nutch.indexer.basic.BasicIndexingFilter
@@ -174,8 +192,6 @@
So that the NutchWAX indexing filter is run after the Nutch basic
indexing filter.
-A full explanation is given in "README-dedup.txt".
-
--------------------------------------------------
mime.type.magic
--------------------------------------------------
@@ -205,37 +221,44 @@
The specifications here are of the form:
- src-key:lowercase:store:tokenize:exclusive:dest-key
+ src-key:lowercase:store:index:exclusive:dest-key
where the only required part is the "src-key", the rest will assume
the following defaults:
lowercase = true
store = true
- tokenize = false
+ index = tokenized
exclusive = true
dest-key = src-key
+For the 'index' property, the possible values are:
+ tokenized
+ untokenized
+ no_norms
+ no
+
+corresponding to the Lucene options of the same names.
+
We recommend:
<property>
<name>nutchwax.filter.index</name>
<value>
- url:false:true:true
- url:false:true:false:true:exacturl
- orig:false
- digest:false
- filename:false
- fileoffset:false
- collection
- date
- type
- length
+ title:false:true:tokenized
+ content:false:false:tokenized
+ site:false:false:untokenized
+
+ url:false:true:no
+ digest:false:true:no
+
+ collection:true:true:no_norms
+ date:true:true:no_norms
+ type:true:true:no_norms
+ length:false:true:no
</value>
</property>
-The "url", "orig" and "digest" values are required, the rest are
-optional, but strongly recommended.
--------------------------------------------------
nutchwax.filter.query
@@ -274,15 +297,10 @@
<property>
<name>nutchwax.filter.query</name>
<value>
- raw:digest:false
- raw:filename:false
- raw:fileoffset:false
- raw:exacturl:false
group:collection
+ group:site:false
group:type
- field:anchor
field:content
- field:host
field:title
</value>
</property>
@@ -428,3 +446,31 @@
<value>false</value>
</property>
+
+--------------------------------------------------
+searcher.fieldcache
+--------------------------------------------------
+
+NutchWAX contains a patch controlling the use of a "fieldcache" in the
+Nutch searcher. Without this patch Nutch will read the entire set of
+hostnames from the index into an in-memory cache. This cache is then
+consulted when performing de-duplication of results per the
+"hitsPerSite" feature.
+
+For small-to-medium indexes, this can improve performance as the
+de-duplication information is entirely in memory and no disk access is
+required.
+
+However, for large indexes, in the tens of gigabytes in size, reading
+the entire set of hostnames into an in-memory cache can exhaust the
+Java heap. In this case, omitting the cache all together and just
+reading the values off disk as needed is better.
+
+The NutchWAX patch controls the use of this cache based on this property
+value. If set to false, then the cache is not used at all.
+
+<property>
+ <name>searcher.fieldcache</name>
+ <value>true</value>
+</property>
+
Modified: tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt 2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/HOWTO-xslt.txt 2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
HOWTO-xslt.txt
-2008-12-18
+2009-06-25
Aaron Binns
Table of Contents
@@ -128,8 +128,5 @@
You can find sample 'web.xml' and 'search.xsl' files in
- contrib/archive/web
-
-in the compiled Nutch package. Or in this source tree under
-
- src/web
+ ./src/nutch/src/web/jsp/search.xsl
+ ./src/nutch/src/web/web.xml
Modified: tags/nutchwax-0_12_5/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/HOWTO.txt 2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/HOWTO.txt 2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
HOWTO.txt
-2008-07-28
+2009-06-25
Aaron Binns
Table of Contents
@@ -26,7 +26,7 @@
This HOWTO assumes it is installed in
- /opt/nutchwax-0.12.4
+ /opt/nutchwax-0.12.5
2. ARC/WARC files.
@@ -96,9 +96,9 @@
$ cd ../
$ ls -F1
crawl/
- $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer
+ $ /opt/nutchwax-0.12.4/bin/nutch org.archive.nutchwax.NutchWaxBean computer
-This calls the NutchBean to execute a simple keyword search for
+This calls the NutchWaxBean to execute a simple keyword search for
"computer". Use whatever query term you think appears in the
documents you imported.
Modified: tags/nutchwax-0_12_5/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/INSTALL.txt 2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/INSTALL.txt 2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
INSTALL.txt
-2009-03-08
+2009-06-25
Aaron Binns
Table of Contents
@@ -62,10 +62,12 @@
SVN: nutch-1.0-dev
------------------
As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
-Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.4 is
-built against is:
+Although the Nutch project released 1.0 in early 2009, there were so
+many changes that NutchWAX 0.12.5 is still built against pre-1.0
+codebase.
+The specific SVN revision that NutchWAX 0.12.5 is built against is:
+
701524
To checkout this revision of Nutch, use:
@@ -79,14 +81,14 @@
SVN: NutchWAX
-------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5
source into Nutch's "contrib" directory.
$ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive
This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.4 sources.
+NutchWAX 0.12.5 sources.
Build and install
-----------------
@@ -113,7 +115,7 @@
$ cd /opt
$ tar xvfz nutch-1.0-dev.tar.gz
- $ mv nutch-1.0-dev nutchwax-0.12.4
+ $ mv nutch-1.0-dev nutchwax-0.12.5
======================================================================
@@ -126,24 +128,24 @@
Install it simply by untarring it, for example:
$ cd /opt
- $ tar xvfz nutchwax-0.12.4.tar.gz
+ $ tar xvfz nutchwax-0.12.5.tar.gz
======================================================================
Install start-up scripts
======================================================================
-NutchWAX 0.12.4 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.5 comes with a Unix init.d script which can be used to
automatically start the searcher slaves for a multi-node search
configuration.
Assuming you installed NutchWAX as
- /opt/nutchwax-0.12.4
+ /opt/nutchwax-0.12.5
the script is found at
- /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave
+ /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave
This script can be placed in /etc/init.d then added to the list of
startup scripts to run at bootup by using commands appropriate to your
Modified: tags/nutchwax-0_12_5/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/README.txt 2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/README.txt 2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,6 +1,6 @@
README.txt
-2009-05-05
+2009-06-25
Aaron Binns
Table of Contents
@@ -13,7 +13,7 @@
Introduction
======================================================================
-Welcome to NutchWAX 0.12.4!
+Welcome to NutchWAX 0.12.5!
NutchWAX is a set of add-ons to Nutch in order to index and search
archived web data.
Modified: tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt 2009-06-25 20:23:20 UTC (rev 2746)
+++ tags/nutchwax-0_12_5/archive/RELEASE-NOTES.txt 2009-06-25 22:00:14 UTC (rev 2747)
@@ -1,9 +1,9 @@
RELEASE-NOTES.TXT
-2009-05-05
+2009-06-25
Aaron Binns
-Release notes for NutchWAX 0.12.4
+Release notes for NutchWAX 0.12.5
For the most recent updates and information on NutchWAX,
please visit the project wiki at:
@@ -15,17 +15,75 @@
Overview
======================================================================
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4
- o Option to omit storing of content during import.
- o Support for per-collection segments in master/slave config.
- o Additional diagnostic/log messages to help troubleshoot common
- deployment mistakes.
- o PageRankDb similar to LinkDb but only keeping inlink counts.
- o Improved paging through results, handling "paging past the end".
+ o Command-line options for NutchWaxBean to configure number of
+ results to emit and how many hits per site to allow.
+ o Change default configuration to use NutchWAX indexing and query
+ filters instead of Nutch-provided ones. This give more consistent
+ control over indexing and query behavior.
+ o No longer store the unique document key (URL+digest) in a separate
+ field in the index. Since the URL and digest are stored, just use
+ them to synthesize the unique document key as needed.
+
+ o Trimmed down the default configuration of indexing and query
+ filters to only store and index the minimum information needed for
+ typical NutchWAX installations.
+
+
======================================================================
+Configuration changes
+======================================================================
+
+As mentioned in the overview, NutchWAX 0.12.5 has some important
+changes to the default configuration.
+
+Previously, the indexing and query filter configuration utilized a
+combination of filters from Nutch and NutchWAX. This was in line with
+our goal of NutchWAX being a set of add-ons to Nutch.
+
+However, in practice, the mixing of these filters often lead to
+confusion since the NutchWAX filters could be configured via
+properties in the Nutch configuration files whereas the Nutch filters
+were hard-coded and less powerful.
+
+Now, all the Nutch indexing filters have been removed and are replaced
+with the single NutchWAX indexing filter. Similarly, all but one
+Nutch query filter are removed, replaced by the configurable NutchWAX
+query filter. We do retain the Nutch 'query-basic' filter as it
+contains the logic for automatically applying a query to multiple
+fields with proportionate weights; something not subsumed by the
+NutchWAX query filter.
+
+
+In addition to removing the Nutch filters, the NutchWAX index and
+query filters are streamlined to only index and store the minimum set
+of metadata fields for typical deployments.
+
+In previous versions of NutchWAX, the indexing filters were configured
+to index and store nearly every piece of metadata available. Although
+this seems desirable, it adds a lot of storage overhead to the index,
+and can hamper run-time query speed just by having unnecessary
+information in the index (more junk for the disk to seek around).
+
+The NutchWAX 0.12.5 configuration omits the typically unnecessary
+metadata fields from the index and only indexes those fields we think
+are needed for typical searches.
+
+For example, while we do store the digest, we do not index it as it's
+very unusual for someone to search for a document with a specific
+SHA-1 digest value. You could decide you want that, in which case you
+can edit the configuration and re-index the data. You would have to
+correspondingly edit the query filter and its configuration to allow
+for searching on that field as well.
+
+We have found that this streamlined indexing configuration yields
+Lucene indexes about 25% smaller than with NutchWAX 0.12.4.
+
+
+======================================================================
Issues
======================================================================
@@ -35,23 +93,16 @@
Issues resolved in this release:
-WAX-27 Sensible output for requesting page of results past the end.
+WAX-45 Add ability to store but not index a field via
+ ConfigurableIndexingFilter.
-WAX-34 Add option to omit storing of content in segment
+WAX-46 Add option to DumpParallelIndex to output only single field.
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
- rather than actual inlinks.
+WAX-47 Stop storing document key in "orig" field in index, synthesize
+ it as needed from the "url" and "digest" fields.
-WAX-36 Some additional diagnostics on connecting results to segments
- and snippets would be very helpful.
+WAX-48 Use NutchWAX configurable query filter for site and url fields.
-WAX-37 Per-collection segments not supported in distributed
- master-slave configuration.
+WAX-49 Add "hitsPerSite" option to NutchWaxBean.
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
-
-WAX-42 Add option to continue importing if an arcfile cannot be read.
+WAX-50 Add "num hits to find" option to NutchWaxBean.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|