From: <bi...@us...> - 2010-03-18 19:26:54
|
Revision: 2973 http://archive-access.svn.sourceforge.net/archive-access/?rev=2973&view=rev Author: binzino Date: 2010-03-18 19:26:44 +0000 (Thu, 18 Mar 2010) Log Message: ----------- Updated to match NW 0.13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2010-03-16 21:37:14 UTC (rev 2972) +++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2010-03-18 19:26:44 UTC (rev 2973) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2008-12-18 +2010-02-13 Aaron Binns ====================================================================== @@ -13,15 +13,15 @@ ====================================================================== -This 0.12.x release of NutchWAX is radically different in source-code +This 0.13 release of NutchWAX is radically different in source-code form compared to the previous release, 0.10. -One of the design goals of 0.12.x was to reduce or even eliminate the +One of the design goals of 0.13 was to reduce or even eliminate the "copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX releases had to copy/paste/edit large chunks of Nutch source code in order to add the NutchWAX features. -Also, the NutchWAX 0.12.x sources and build are designed to one day be +Also, the NutchWAX 0.13 sources and build are designed to one day be added into mainline Nutch as a proper "contrib" package; then eventually be fully integrated into the core Nutch source code. @@ -77,47 +77,7 @@ to the Nutch source and configuration files. ---------------------------------------------------------------------- -The file - /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml - -contains two errors: one where a mimetype is referenced before it is -defined; and a second where a definition has an illegal character. - -These errors cause Nutch to not recognize certain mimetypes and -therefore will ignore documents matching those mimetypes. - -There are two fixes: - - 1. Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - 2. Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -You can either apply these patches yourself, or copy an already-patched -copy from: - - /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml - -to - - /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml - ----------------------------------------------------------------------- - In the file 'conf/nutch-site.xml' we define some properties to over-ride the values in 'conf/nutch-default.xml'. @@ -130,27 +90,37 @@ to - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax + protocol-http|parse-(text|html|js|pdf)|index-nutchwax|query-(basic|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax In short, we add: - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf + parse-pdf + index-nutchwax + query-nutchwax + urlfilter-nutchwax and remove: - urlfilter-regex - urlnormalizer-(pass|regex|basic) + index-basic + index-anchor + query-site + query-url + urlfilter-regex + urlnormalizer-(pass|regex|basic) -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. The "parse-pdf" plugin is added simply because we have lots of PDFs in our archives and we want to index them. We sometimes remove the "parse-js" plugin if we don't care to index JavaScript files. +The Nutch index-basic and index-anchor filters are removed and +replaced with the NutchWAX index-nutchwax filter. Similarly, we +remove the Nutch query-site and query-url filters, replacing them with +the single NutchWAX query-nutchwax filter. By using the configurable +NutchWAX filters for indexing and querying, we get more powerful and +consistent behavior across metadata fields. Note that we do retain +the Nutch query-basic filter however. + We also remove the default Nutch URL filtering and normalizing plugins because we do not need the URLs normalized nor filtered. We trust that the tool that produced the ARC/WARC file will have normalized the @@ -166,6 +136,14 @@ -------------------------------------------------- indexingfilter.order -------------------------------------------------- +If we use the indexing filters as specified in the previous section, +then this property can remain unset. However, if you choose to use +the Nutch index-basic filter, then you *must* specify the order in +which the filters will be used. If you don't then the filters will be +applied in a random order (per Nutch's design) and since one may +over-write the values of another you won't know what values will +result. In that case, you need to specify the order. + Add this property with a value of org.apache.nutch.indexer.basic.BasicIndexingFilter @@ -174,8 +152,6 @@ So that the NutchWAX indexing filter is run after the Nutch basic indexing filter. -A full explanation is given in "README-dedup.txt". - -------------------------------------------------- mime.type.magic -------------------------------------------------- @@ -205,37 +181,44 @@ The specifications here are of the form: - src-key:lowercase:store:tokenize:exclusive:dest-key + src-key:lowercase:store:index:exclusive:dest-key where the only required part is the "src-key", the rest will assume the following defaults: lowercase = true store = true - tokenize = false + index = tokenized exclusive = true dest-key = src-key +For the 'index' property, the possible values are: + tokenized + untokenized + no_norms + no + +corresponding to the Lucene options of the same names. + We recommend: <property> <name>nutchwax.filter.index</name> <value> - url:false:true:true - url:false:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length + title:false:true:tokenized + content:false:false:tokenized + site:false:false:untokenized + + url:false:true:tokenized + digest:false:true:no + + collection:true:true:no_norms + date:true:true:no_norms + type:true:true:no_norms + length:false:true:no </value> </property> -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. -------------------------------------------------- nutchwax.filter.query @@ -274,15 +257,10 @@ <property> <name>nutchwax.filter.query</name> <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false group:collection + group:site:false group:type - field:anchor field:content - field:host field:title </value> </property> @@ -428,3 +406,31 @@ <value>false</value> </property> + +-------------------------------------------------- +searcher.fieldcache +-------------------------------------------------- + +NutchWAX contains a patch controlling the use of a "fieldcache" in the +Nutch searcher. Without this patch Nutch will read the entire set of +hostnames from the index into an in-memory cache. This cache is then +consulted when performing de-duplication of results per the +"hitsPerSite" feature. + +For small-to-medium indexes, this can improve performance as the +de-duplication information is entirely in memory and no disk access is +required. + +However, for large indexes, in the tens of gigabytes in size, reading +the entire set of hostnames into an in-memory cache can exhaust the +Java heap. In this case, omitting the cache all together and just +reading the values off disk as needed is better. + +The NutchWAX patch controls the use of this cache based on this property +value. If set to false, then the cache is not used at all. + +<property> + <name>searcher.fieldcache</name> + <value>false</value> +</property> + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |