From: <bi...@us...> - 2009-07-09 17:34:59
|
Revision: 2754 http://archive-access.svn.sourceforge.net/archive-access/?rev=2754&view=rev Author: binzino Date: 2009-07-09 17:34:57 +0000 (Thu, 09 Jul 2009) Log Message: ----------- Updated for 0.12.6 release. Modified Paths: -------------- tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt tags/nutchwax-0_12_6/archive/HOWTO.txt tags/nutchwax-0_12_6/archive/INSTALL.txt tags/nutchwax-0_12_6/archive/README.txt tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/BUILD-NOTES.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2009-06-25 +2009-07-09 Aaron Binns ====================================================================== @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.5/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.6/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.5/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_6/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_6/archive/HOWTO.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/HOWTO.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ HOWTO.txt -2009-06-25 +2009-07-09 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.5 + /opt/nutchwax-0.12.6 2. ARC/WARC files. @@ -68,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.5/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.5/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.5/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.5/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.6/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.6/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.6/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.6/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -96,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.5/bin/nutch org.archive.nutchwax.NutchWaxBean computer + $ /opt/nutchwax-0.12.6/bin/nutchwax search computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +109,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.5/nutch-1.0-dev.war + /opt/nutchwax-0.12.6/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: tags/nutchwax-0_12_6/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_6/archive/INSTALL.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/INSTALL.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ INSTALL.txt -2009-06-25 +2009-07-09 Aaron Binns Table of Contents @@ -63,10 +63,10 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Although the Nutch project released 1.0 in early 2009, there were so -many changes that NutchWAX 0.12.5 is still built against pre-1.0 +many changes that NutchWAX 0.12.6 is still built against pre-1.0 codebase. -The specific SVN revision that NutchWAX 0.12.5 is built against is: +The specific SVN revision that NutchWAX 0.12.6 is built against is: 701524 @@ -81,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.5 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.6 source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_5/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_6/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.5 sources. +NutchWAX 0.12.6 sources. Build and install ----------------- @@ -115,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.5 + $ mv nutch-1.0-dev nutchwax-0.12.6 ====================================================================== @@ -128,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.5.tar.gz + $ tar xvfz nutchwax-0.12.6.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.5 comes with a Unix init.d script which can be used to +NutchWAX 0.12.6 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.5 + /opt/nutchwax-0.12.6 the script is found at - /opt/nutchwax-0.12.5/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.6/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_6/archive/README.txt =================================================================== --- tags/nutchwax-0_12_6/archive/README.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/README.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,6 +1,6 @@ README.txt -2009-06-25 +2009-07-09 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.5! +Welcome to NutchWAX 0.12.6! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt 2009-07-09 00:50:58 UTC (rev 2753) +++ tags/nutchwax-0_12_6/archive/RELEASE-NOTES.txt 2009-07-09 17:34:57 UTC (rev 2754) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2009-06-25 +2009-07-09 Aaron Binns -Release notes for NutchWAX 0.12.5 +Release notes for NutchWAX 0.12.6 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,74 +15,44 @@ Overview ====================================================================== -NutchWAX 0.12.5 contains numerous enhancements and fixes to 0.12.4 +NutchWAX 0.12.6 contains a few convenient enhancements to 0.12.5 - o Command-line options for NutchWaxBean to configure number of - results to emit and how many hits per site to allow. + o Addition of 'search' and 'merge' commands to the 'nutchwax' + command-line driver. Now one can do - o Change default configuration to use NutchWAX indexing and query - filters instead of Nutch-provided ones. This give more consistent - control over indexing and query behavior. + nutchwax search foo - o No longer store the unique document key (URL+digest) in a separate - field in the index. Since the URL and digest are stored, just use - them to synthesize the unique document key as needed. + instead of - o Trimmed down the default configuration of indexing and query - filters to only store and index the minimum information needed for - typical NutchWAX installations. + nutch org.archive.nutchwax.NutchWaxBean foo + Similarly, the new NutchWAX index merging, which supports + parallel indexes, can be invoked via -====================================================================== -Configuration changes -====================================================================== + nutchwax merge output-index input-index... -As mentioned in the overview, NutchWAX 0.12.5 has some important -changes to the default configuration. + o Merging of parallel indexes into a single index. -Previously, the indexing and query filter configuration utilized a -combination of filters from Nutch and NutchWAX. This was in line with -our goal of NutchWAX being a set of add-ons to Nutch. + NutchWAX has a copy/paste/enhanced version of the Nutch index + merger that now supports parallel indexes. This allows parallel + indexes to be merged into a single index. To use this feature, + add the "-p" option to the NutchWAX 'merge' command indicating the + input index directories contain parallel index sub-dirs. -However, in practice, the mixing of these filters often lead to -confusion since the NutchWAX filters could be configured via -properties in the Nutch configuration files whereas the Nutch filters -were hard-coded and less powerful. + nutchwax merge -p output-index input-pindexes... -Now, all the Nutch indexing filters have been removed and are replaced -with the single NutchWAX indexing filter. Similarly, all but one -Nutch query filter are removed, replaced by the configurable NutchWAX -query filter. We do retain the Nutch 'query-basic' filter as it -contains the logic for automatically applying a query to multiple -fields with proportionate weights; something not subsumed by the -NutchWAX query filter. + o Option to specify the directory where the index(es) and segments + live when doing a command-line search. + Previously the directory was obtained from the nutch-default.xml + configuration file. This is inconvenient when testing different + indexes as one would have to edit the config file each time to + specify a different index to search. -In addition to removing the Nutch filters, the NutchWAX index and -query filters are streamlined to only index and store the minimum set -of metadata fields for typical deployments. + Now, the directory can be specified on the command line: -In previous versions of NutchWAX, the indexing filters were configured -to index and store nearly every piece of metadata available. Although -this seems desirable, it adds a lot of storage overhead to the index, -and can hamper run-time query speed just by having unnecessary -information in the index (more junk for the disk to seek around). + nutchwax search -d <dir> <query> -The NutchWAX 0.12.5 configuration omits the typically unnecessary -metadata fields from the index and only indexes those fields we think -are needed for typical searches. - -For example, while we do store the digest, we do not index it as it's -very unusual for someone to search for a document with a specific -SHA-1 digest value. You could decide you want that, in which case you -can edit the configuration and re-index the data. You would have to -correspondingly edit the query filter and its configuration to allow -for searching on that field as well. - -We have found that this streamlined indexing configuration yields -Lucene indexes about 25% smaller than with NutchWAX 0.12.4. - - ====================================================================== Issues ====================================================================== @@ -93,16 +63,9 @@ Issues resolved in this release: -WAX-45 Add ability to store but not index a field via - ConfigurableIndexingFilter. +WAX-51 Enhance index merging to combine parallel indexes. -WAX-46 Add option to DumpParallelIndex to output only single field. +WAX-52 Add option to NutchWaxBean to specify directory where + index+segments are to be found. -WAX-47 Stop storing document key in "orig" field in index, synthesize - it as needed from the "url" and "digest" fields. - -WAX-48 Use NutchWAX configurable query filter for site and url fields. - -WAX-49 Add "hitsPerSite" option to NutchWaxBean. - -WAX-50 Add "num hits to find" option to NutchWaxBean. +WAX-53 IndexMerging parallel indexes fails when index is empty. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |