From: <bi...@us...> - 2009-03-08 21:43:45
|
Revision: 2693 http://archive-access.svn.sourceforge.net/archive-access/?rev=2693&view=rev Author: binzino Date: 2009-03-08 21:43:33 +0000 (Sun, 08 Mar 2009) Log Message: ----------- Updated documentation for 0.12.4 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt trunk/archive-access/projects/nutchwax/archive/HOWTO.txt trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-03-08 20:44:25 UTC (rev 2692) +++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2009-03-08 21:43:33 UTC (rev 2693) @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.4/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.4/conf/tika-mimetypes.xml ---------------------------------------------------------------------- @@ -166,7 +166,6 @@ -------------------------------------------------- indexingfilter.order -------------------------------------------------- - Add this property with a value of org.apache.nutch.indexer.basic.BasicIndexingFilter @@ -300,7 +299,6 @@ -------------------------------------------------- nutchwax.urlfilter.wayback.canonicalizer -------------------------------------------------- - For CDX-based de-duplication, the same URL canonicalization algorithm must be used here as was used to generate the CDX files. @@ -390,3 +388,43 @@ capacity of the computers performing the import. Something in the 1-4MB range is typical. +-------------------------------------------------- +nutchwax.FetchedSegments.perCollection +-------------------------------------------------- +Enable per-collection segment sub-dirs, e.g. + + segments/<collectionId>/segment1 + /segment2 + ... + +Default value: false + +For example, + + <property> + <name>nutchwax.FetchedSegments.perCollection</name> + <value>true</value> + </property> + +-------------------------------------------------- +nutchwax.import.content.store +-------------------------------------------------- +Whether or not we store the full content in the segment's "content" +directory. Most NutchWAX users are also using Wayback to serve the +archived content, so there's no need for NutchWAX to keep a "cached" +copy as well. + +Setting to 'true' yields the same bahavior as in previous versions of +NutchWAX, and as in Nutch. The content is stored in the segment's +"content" directory. + +Setting to 'false' results in an empty "content" directory in the +segment. The content is not stored. + +Default value is 'false'. + + <property> + <name>nutchwax.import.store.content</name> + <value>false</value> + </property> + Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2009-03-08 20:44:25 UTC (rev 2692) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2009-03-08 21:43:33 UTC (rev 2693) @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.3 + /opt/nutchwax-0.12.4 2. ARC/WARC files. @@ -68,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.4/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.4/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.4/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.4/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -96,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.12.4/bin/nutch org.apache.nutch.searcher.NutchBean computer This calls the NutchBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +109,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.3/nutch-1.0-dev.war + /opt/nutchwax-0.12.4/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2009-03-08 20:44:25 UTC (rev 2692) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2009-03-08 21:43:33 UTC (rev 2693) @@ -1,6 +1,6 @@ INSTALL.txt -2008-12-18 +2009-03-08 Aaron Binns Table of Contents @@ -10,6 +10,7 @@ - SVN: NutchWAX - Build and Install o Install binary package + o Install start-up scripts ====================================================================== @@ -62,7 +63,7 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.3 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.4 is built against is: 701524 @@ -78,14 +79,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into -Nutch's "contrib" directory. +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.4 +source into Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_4/archive This will create a sub-directory named "archive" containing the -NutchWAX sources. +NutchWAX 0.12.4 sources. Build and install ----------------- @@ -112,7 +113,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.3 + $ mv nutch-1.0-dev nutchwax-0.12.4 ====================================================================== @@ -125,5 +126,50 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.3.tar.gz + $ tar xvfz nutchwax-0.12.4.tar.gz + +====================================================================== +Install start-up scripts +====================================================================== + +NutchWAX 0.12.4 comes with a Unix init.d script which can be used to +automatically start the searcher slaves for a multi-node search +configuration. + +Assuming you installed NutchWAX as + + /opt/nutchwax-0.12.4 + +the script is found at + + /opt/nutchwax-0.12.4/contrib/archive/etc/init.d/searcher-slave + +This script can be placed in /etc/init.d then added to the list of +startup scripts to run at bootup by using commands appropriate to your +Linux distribution. + +You must edit a few of the environment variables defined in the +'searcher-slave' specifying where NutchWAX is installed and where the +index(s) are deployed. In 'searcher-slave' you will find the: + + export NUTCH_HOME=TODO + export DEPLOYMENT_DIR=TODO + +edit those appropriately for your system. + + +The "master" in the multi-node search deployment is the NutchWAX +webapp running in a webapp server, such as Tomcat or Jetty. + +Jetty comes with a start/stop script appropriate for use as an init.d +script, similar to the 'searcher-slave' script described above. If you +use Jetty, create a symlink + + /etc/init.d/jetty.sh -> /opt/jetty/bin/jetty.sh + +Then add this script to the list of startup scripts to run at bootup +by using commands appropriate to your Linux distribution. + +Follow the instructions from Jetty on the deployment of the NutchWAX +webapp (nutch-1.0-dev.war) in the Jetty web application server. Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2009-03-08 20:44:25 UTC (rev 2692) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2009-03-08 21:43:33 UTC (rev 2693) @@ -1,6 +1,6 @@ README.txt -2008-12-18 +2008-03-08 Aaron Binns Table of Contents @@ -13,7 +13,7 @@ Introduction ====================================================================== -Welcome to NutchWAX 0.12.3! +Welcome to NutchWAX 0.12.4! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2009-03-08 20:44:25 UTC (rev 2692) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2009-03-08 21:43:33 UTC (rev 2693) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2008-12-18 +2008-03-08 Aaron Binns -Release notes for NutchWAX 0.12.3 +Release notes for NutchWAX 0.12.4 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,61 +15,44 @@ Overview ====================================================================== -NutchWAX 0.12.3 contains numerous enhancements and fixes to 0.12.2 +NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3 - o PageRank calculation and scoring - o Enhanced OpenSearchServlet - o Improved XSLT sample for OpenSearch - o System init.d script for searcher slaves - o Enhanced searcher slave which supports NutchWAX extensions + o Option to omit storing of content during import. + o Support for per-collection segments in master/slave config. + o Additional diagnostic/log messages to help troubleshoot common + deployment mistakes. + o PageRankDb similar to LinkDb but only keeping inlink counts. + o Improved paging through results, handling "paging past the end". -One of the major changes to 0.12.3 is not a feature, enhancement or -bug-fix, but the way the NutchWAX source is "integrated" into the -Nutch source. +====================================================================== +Issues +====================================================================== -Yes, the NutchWAX source is still kept in the contrib/archive -sub-directory, but when you invoke a build command from the -NutchWAX directory, such as +For an up-to-date list of NutchWAX issues: - $ cd nutch/contrib/archive - $ ant tar + http://webteam.archive.org/jira/browse/WAX -Many files from the NutchWAX source tree are copied directly into the -Nutch source tree before the build process begins. +Issues resolved in this release: -The reason for this is to make NutchWAX easier to use. +WAX-27 Sensible output for requesting page of results past the end. -In previous versions of NutchWAX, once 'ant' build command was -finished, the operator had to manually patch configuration files in -the Nutch directory. Upon a subsequent build, the files would be -over-written by Nutch's and would have to be patched again. +WAX-34 Add option to omit storing of content in segment -It was a major hassle and complication. +WAX-35 Add pagerankdb similar to linkdb but which only keeps counts + rather than actual inlinks. -Another impetus for copying files into the Nutch source was to patch -bugs and make enhancements in the Nutch Java code which couldn't be -effectively done keeping the sources separate. When an 'ant' build -command is run a few Java files are copied from the NutchWAX source -tree into the Nutch source tree. +WAX-36 Some additional diagnostics on connecting results to segments + and snippets would be very helpful. -In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of -this. Simply execute your build commands from 'contrib/archive' as -instructed in the HOWTO and no longer worry about patching -configuration files. If you wish to alter the NutchWAX configuration -file, make those changes in the NutchWAX source tree. +WAX-37 Per-collection segments not supported in distributed + master-slave configuration. +WAX-38 Build omits neessary libraries from .job file. -====================================================================== -Issues -====================================================================== +WAX-39 Write more efficient, specialized segment parse_text merging. -For an up-to-date list of NutchWAX issues: - http://webteam.archive.org/jira/browse/WAX -Issues resolved in this release: -WAX-26 - Add XML elements containing all search URL params for self-link - generation + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |