|
From: <bi...@us...> - 2010-01-13 00:26:52
|
Revision: 2947
http://archive-access.svn.sourceforge.net/archive-access/?rev=2947&view=rev
Author: binzino
Date: 2010-01-13 00:26:44 +0000 (Wed, 13 Jan 2010)
Log Message:
-----------
Updated for 0.12.9 release.
Modified Paths:
--------------
tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt
tags/nutchwax-0_12_9/archive/HOWTO.txt
tags/nutchwax-0_12_9/archive/INSTALL.txt
tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt
Modified: tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt 2010-01-13 00:14:06 UTC (rev 2946)
+++ tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt 2010-01-13 00:26:44 UTC (rev 2947)
@@ -1,6 +1,6 @@
BUILD-NOTES.txt
-2009-09-18
+2010-01-13
Aaron Binns
======================================================================
@@ -79,7 +79,7 @@
----------------------------------------------------------------------
The file
- /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.9/conf/tika-mimetypes.xml
contains two errors: one where a mimetype is referenced before it is
defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
You can either apply these patches yourself, or copy an already-patched
copy from:
- /opt/nutchwax-0.12.8/contrib/archive/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.9/contrib/archive/conf/tika-mimetypes.xml
to
- /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.9/conf/tika-mimetypes.xml
----------------------------------------------------------------------
Modified: tags/nutchwax-0_12_9/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_9/archive/HOWTO.txt 2010-01-13 00:14:06 UTC (rev 2946)
+++ tags/nutchwax-0_12_9/archive/HOWTO.txt 2010-01-13 00:26:44 UTC (rev 2947)
@@ -1,6 +1,6 @@
HOWTO.txt
-2009-09-18
+2010-01-13
Aaron Binns
Table of Contents
@@ -26,7 +26,7 @@
This HOWTO assumes it is installed in
- /opt/nutchwax-0.12.8
+ /opt/nutchwax-0.12.9
2. ARC/WARC files.
@@ -68,14 +68,10 @@
$ mkdir crawl
$ cd crawl
- $ /opt/nutchwax-0.12.8/bin/nutchwax import ../manifest
- $ /opt/nutchwax-0.12.8/bin/nutch updatedb crawldb -dir segments
- $ /opt/nutchwax-0.12.8/bin/nutch invertlinks linkdb -dir segments
- $ /opt/nutchwax-0.12.8/bin/nutch index indexes crawldb linkdb segments/*
+ $ /opt/nutchwax-0.12.9/bin/nutchwax import ../manifest
+ $ /opt/nutchwax-0.12.9/bin/nutchwax index indexes segments/*
$ ls -F1
- crawldb/
indexes/
- linkdb/
segments/
To those already familiar with Nutch, these steps should be quite
@@ -96,7 +92,7 @@
$ cd ../
$ ls -F1
crawl/
- $ /opt/nutchwax-0.12.8/bin/nutchwax search computer
+ $ /opt/nutchwax-0.12.9/bin/nutchwax search computer
This calls the NutchWaxBean to execute a simple keyword search for
"computer". Use whatever query term you think appears in the
@@ -109,7 +105,7 @@
The Nutch(WAX) web application is bundled with NutchWAX as
- /opt/nutchwax-0.12.8/nutch-1.0-dev.war
+ /opt/nutchwax-0.12.9/nutch-1.0-dev.war
Simply deploy that web application in the same fashion as with
Nutch.
Modified: tags/nutchwax-0_12_9/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_9/archive/INSTALL.txt 2010-01-13 00:14:06 UTC (rev 2946)
+++ tags/nutchwax-0_12_9/archive/INSTALL.txt 2010-01-13 00:26:44 UTC (rev 2947)
@@ -1,6 +1,6 @@
INSTALL.txt
-2009-09-18
+2010-01-13
Aaron Binns
Table of Contents
@@ -63,10 +63,10 @@
------------------
As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
Although the Nutch project released 1.0 in early 2009, there were so
-many changes that NutchWAX 0.12.8 is still built against pre-1.0
+many changes that NutchWAX 0.12.9 is still built against pre-1.0
codebase.
-The specific SVN revision that NutchWAX 0.12.8 is built against is:
+The specific SVN revision that NutchWAX 0.12.9 is built against is:
701524
@@ -81,14 +81,14 @@
SVN: NutchWAX
-------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.8
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.9
source into Nutch's "contrib" directory.
$ cd contrib
$ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_8/archive
This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.8 sources.
+NutchWAX 0.12.9 sources.
Build and install
-----------------
@@ -115,7 +115,7 @@
$ cd /opt
$ tar xvfz nutch-1.0-dev.tar.gz
- $ mv nutch-1.0-dev nutchwax-0.12.8
+ $ mv nutch-1.0-dev nutchwax-0.12.9
======================================================================
@@ -128,24 +128,24 @@
Install it simply by untarring it, for example:
$ cd /opt
- $ tar xvfz nutchwax-0.12.8.tar.gz
+ $ tar xvfz nutchwax-0.12.9.tar.gz
======================================================================
Install start-up scripts
======================================================================
-NutchWAX 0.12.8 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.9 comes with a Unix init.d script which can be used to
automatically start the searcher slaves for a multi-node search
configuration.
Assuming you installed NutchWAX as
- /opt/nutchwax-0.12.8
+ /opt/nutchwax-0.12.9
the script is found at
- /opt/nutchwax-0.12.8/contrib/archive/etc/init.d/searcher-slave
+ /opt/nutchwax-0.12.9/contrib/archive/etc/init.d/searcher-slave
This script can be placed in /etc/init.d then added to the list of
startup scripts to run at bootup by using commands appropriate to your
Modified: tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt 2010-01-13 00:14:06 UTC (rev 2946)
+++ tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt 2010-01-13 00:26:44 UTC (rev 2947)
@@ -1,70 +1,50 @@
RELEASE-NOTES.TXT
-2009-09-18
+2010-01-13
Aaron Binns
-Release notes for NutchWAX 0.12.8
+Release notes for NutchWAX 0.12.9
For the most recent updates and information on NutchWAX,
please visit the project wiki at:
- http://webteam.archive.org/confluence/display/search/NutchWAX
+ http://webarchive.jira.com/wiki/display/search/NutchWAX
-
======================================================================
Overview
======================================================================
-The main enhancement in NutchWAX 0.12.8 is the ability to configure
-HTTP headers to support caching.
+The main enhancement in NutchWAX 0.12.9 is the ability to search
+indexes created with NutchWAX 0.10.
-The Archive is starting to use Squid to cache the HTTP responses from
-NutchWAX and some explicit HTTP response headers were needed to enable
-this. Rather than relying on the servlet container (Tomcat/Jetty) to
-add the response headers, we added a servlet filter to NutchWAX.
+In the segments directory, create a "versions" file and in it
+list the names of the segments and their version, e.g.
-Right now the filter is very basic, in the web.xml file we now have
+ foo-segment 10
+ bar-segment 12
- <filter>
- <filter-name>Cache Settings</filter-name>
- <filter-class>org.archive.nutchwax.CacheSettingsFilter</filter-class>
- <init-param>
- <param-name>max-age</param-name>
- <param-value>259200</param-value> <!-- 72 hours (in seconds) -->
- </init-param>
- </filter>
+where the version number is either 10 or 12. If a segment is not
+listed in the "versions" file, it is assumed to be version 12.
- <filter-mapping>
- <filter-name>Cache Settings</filter-name>
- <servlet-name>OpenSearch</servlet-name>
- </filter-mapping>
+Also, a minor, but convenient enhancement is to no longer require the
+crawldb and linkdb to be present at index time. Neither one of these
+are actually used for indexing and the fact that they were required to
+be given to the index step was a legecy of Nutch. Now, there is a
+NutchWAX 'index' command which only requires the segment(s) to be
+present.
-which configures the filter to add a 'max-age' header with a 72 hour
-limit. This filter is then applied to all instances of the OpenSearch
-servlet.
-
-This allows browsers to cache the OpenSearch response for up to 72
-hours. It also enables any proxies between the browser and server to
-cache the response as well. With the addition of Squid into our
-deployment, we let Squid serve cached responses to repeat queries.
-
-Since our deployment updates every 4 days, a 72-hour expiration works
-well.
-
======================================================================
Issues
======================================================================
For an up-to-date list of NutchWAX issues:
- http://webteam.archive.org/jira/browse/WAX
+ http://webarchive.jira.com/browse/WAX
Issues resolved in this release:
-WAX-61 Change mime-type of OpenSearch XML response from text/xml to
- application/xml.
+WAX-66 Index documents without crawldb nor linkdb.
-WAX-62 Add ability to configure HTTP headers to support caching.
+WAX-67 Nutch OpenOffice parser does not pass along metadata.
-WAX-63 LengthNormUpdater returning error code if no fields in index
- have norms is inconvenient.
+WAX-68 Compatibility with {index+segment}s created by NutchWAX 0.10.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|