From: <bi...@us...> - 2010-01-13 00:26:52
|
Revision: 2947 http://archive-access.svn.sourceforge.net/archive-access/?rev=2947&view=rev Author: binzino Date: 2010-01-13 00:26:44 +0000 (Wed, 13 Jan 2010) Log Message: ----------- Updated for 0.12.9 release. Modified Paths: -------------- tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt tags/nutchwax-0_12_9/archive/HOWTO.txt tags/nutchwax-0_12_9/archive/INSTALL.txt tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2009-09-18 +2010-01-13 Aaron Binns ====================================================================== @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.9/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.8/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.9/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.9/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_9/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_9/archive/HOWTO.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/HOWTO.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,6 +1,6 @@ HOWTO.txt -2009-09-18 +2010-01-13 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.8 + /opt/nutchwax-0.12.9 2. ARC/WARC files. @@ -68,14 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.8/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.8/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.8/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.8/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.9/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.9/bin/nutchwax index indexes segments/* $ ls -F1 - crawldb/ indexes/ - linkdb/ segments/ To those already familiar with Nutch, these steps should be quite @@ -96,7 +92,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.8/bin/nutchwax search computer + $ /opt/nutchwax-0.12.9/bin/nutchwax search computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +105,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.8/nutch-1.0-dev.war + /opt/nutchwax-0.12.9/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: tags/nutchwax-0_12_9/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_9/archive/INSTALL.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/INSTALL.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,6 +1,6 @@ INSTALL.txt -2009-09-18 +2010-01-13 Aaron Binns Table of Contents @@ -63,10 +63,10 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Although the Nutch project released 1.0 in early 2009, there were so -many changes that NutchWAX 0.12.8 is still built against pre-1.0 +many changes that NutchWAX 0.12.9 is still built against pre-1.0 codebase. -The specific SVN revision that NutchWAX 0.12.8 is built against is: +The specific SVN revision that NutchWAX 0.12.9 is built against is: 701524 @@ -81,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.8 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.9 source into Nutch's "contrib" directory. $ cd contrib $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_8/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.8 sources. +NutchWAX 0.12.9 sources. Build and install ----------------- @@ -115,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.8 + $ mv nutch-1.0-dev nutchwax-0.12.9 ====================================================================== @@ -128,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.8.tar.gz + $ tar xvfz nutchwax-0.12.9.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.8 comes with a Unix init.d script which can be used to +NutchWAX 0.12.9 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.8 + /opt/nutchwax-0.12.9 the script is found at - /opt/nutchwax-0.12.8/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.9/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,70 +1,50 @@ RELEASE-NOTES.TXT -2009-09-18 +2010-01-13 Aaron Binns -Release notes for NutchWAX 0.12.8 +Release notes for NutchWAX 0.12.9 For the most recent updates and information on NutchWAX, please visit the project wiki at: - http://webteam.archive.org/confluence/display/search/NutchWAX + http://webarchive.jira.com/wiki/display/search/NutchWAX - ====================================================================== Overview ====================================================================== -The main enhancement in NutchWAX 0.12.8 is the ability to configure -HTTP headers to support caching. +The main enhancement in NutchWAX 0.12.9 is the ability to search +indexes created with NutchWAX 0.10. -The Archive is starting to use Squid to cache the HTTP responses from -NutchWAX and some explicit HTTP response headers were needed to enable -this. Rather than relying on the servlet container (Tomcat/Jetty) to -add the response headers, we added a servlet filter to NutchWAX. +In the segments directory, create a "versions" file and in it +list the names of the segments and their version, e.g. -Right now the filter is very basic, in the web.xml file we now have + foo-segment 10 + bar-segment 12 - <filter> - <filter-name>Cache Settings</filter-name> - <filter-class>org.archive.nutchwax.CacheSettingsFilter</filter-class> - <init-param> - <param-name>max-age</param-name> - <param-value>259200</param-value> <!-- 72 hours (in seconds) --> - </init-param> - </filter> +where the version number is either 10 or 12. If a segment is not +listed in the "versions" file, it is assumed to be version 12. - <filter-mapping> - <filter-name>Cache Settings</filter-name> - <servlet-name>OpenSearch</servlet-name> - </filter-mapping> +Also, a minor, but convenient enhancement is to no longer require the +crawldb and linkdb to be present at index time. Neither one of these +are actually used for indexing and the fact that they were required to +be given to the index step was a legecy of Nutch. Now, there is a +NutchWAX 'index' command which only requires the segment(s) to be +present. -which configures the filter to add a 'max-age' header with a 72 hour -limit. This filter is then applied to all instances of the OpenSearch -servlet. - -This allows browsers to cache the OpenSearch response for up to 72 -hours. It also enables any proxies between the browser and server to -cache the response as well. With the addition of Squid into our -deployment, we let Squid serve cached responses to repeat queries. - -Since our deployment updates every 4 days, a 72-hour expiration works -well. - ====================================================================== Issues ====================================================================== For an up-to-date list of NutchWAX issues: - http://webteam.archive.org/jira/browse/WAX + http://webarchive.jira.com/browse/WAX Issues resolved in this release: -WAX-61 Change mime-type of OpenSearch XML response from text/xml to - application/xml. +WAX-66 Index documents without crawldb nor linkdb. -WAX-62 Add ability to configure HTTP headers to support caching. +WAX-67 Nutch OpenOffice parser does not pass along metadata. -WAX-63 LengthNormUpdater returning error code if no fields in index - have norms is inconvenient. +WAX-68 Compatibility with {index+segment}s created by NutchWAX 0.10. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |