|
From: <bi...@us...> - 2009-09-19 02:57:20
|
Revision: 2804
http://archive-access.svn.sourceforge.net/archive-access/?rev=2804&view=rev
Author: binzino
Date: 2009-09-19 02:57:12 +0000 (Sat, 19 Sep 2009)
Log Message:
-----------
Updated documents for 0.12.8 release.
Modified Paths:
--------------
tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt
tags/nutchwax-0_12_8/archive/HOWTO.txt
tags/nutchwax-0_12_8/archive/INSTALL.txt
tags/nutchwax-0_12_8/archive/README.txt
tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt
Modified: tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt 2009-08-27 23:56:42 UTC (rev 2803)
+++ tags/nutchwax-0_12_8/archive/BUILD-NOTES.txt 2009-09-19 02:57:12 UTC (rev 2804)
@@ -1,6 +1,6 @@
BUILD-NOTES.txt
-2009-07-24
+2009-09-18
Aaron Binns
======================================================================
@@ -79,7 +79,7 @@
----------------------------------------------------------------------
The file
- /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml
contains two errors: one where a mimetype is referenced before it is
defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
You can either apply these patches yourself, or copy an already-patched
copy from:
- /opt/nutchwax-0.12.7/contrib/archive/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.8/contrib/archive/conf/tika-mimetypes.xml
to
- /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml
----------------------------------------------------------------------
Modified: tags/nutchwax-0_12_8/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_8/archive/HOWTO.txt 2009-08-27 23:56:42 UTC (rev 2803)
+++ tags/nutchwax-0_12_8/archive/HOWTO.txt 2009-09-19 02:57:12 UTC (rev 2804)
@@ -1,6 +1,6 @@
HOWTO.txt
-2009-07-24
+2009-09-18
Aaron Binns
Table of Contents
@@ -26,7 +26,7 @@
This HOWTO assumes it is installed in
- /opt/nutchwax-0.12.7
+ /opt/nutchwax-0.12.8
2. ARC/WARC files.
@@ -68,10 +68,10 @@
$ mkdir crawl
$ cd crawl
- $ /opt/nutchwax-0.12.7/bin/nutchwax import ../manifest
- $ /opt/nutchwax-0.12.7/bin/nutch updatedb crawldb -dir segments
- $ /opt/nutchwax-0.12.7/bin/nutch invertlinks linkdb -dir segments
- $ /opt/nutchwax-0.12.7/bin/nutch index indexes crawldb linkdb segments/*
+ $ /opt/nutchwax-0.12.8/bin/nutchwax import ../manifest
+ $ /opt/nutchwax-0.12.8/bin/nutch updatedb crawldb -dir segments
+ $ /opt/nutchwax-0.12.8/bin/nutch invertlinks linkdb -dir segments
+ $ /opt/nutchwax-0.12.8/bin/nutch index indexes crawldb linkdb segments/*
$ ls -F1
crawldb/
indexes/
@@ -96,7 +96,7 @@
$ cd ../
$ ls -F1
crawl/
- $ /opt/nutchwax-0.12.7/bin/nutchwax search computer
+ $ /opt/nutchwax-0.12.8/bin/nutchwax search computer
This calls the NutchWaxBean to execute a simple keyword search for
"computer". Use whatever query term you think appears in the
@@ -109,7 +109,7 @@
The Nutch(WAX) web application is bundled with NutchWAX as
- /opt/nutchwax-0.12.7/nutch-1.0-dev.war
+ /opt/nutchwax-0.12.8/nutch-1.0-dev.war
Simply deploy that web application in the same fashion as with
Nutch.
Modified: tags/nutchwax-0_12_8/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_8/archive/INSTALL.txt 2009-08-27 23:56:42 UTC (rev 2803)
+++ tags/nutchwax-0_12_8/archive/INSTALL.txt 2009-09-19 02:57:12 UTC (rev 2804)
@@ -1,6 +1,6 @@
INSTALL.txt
-2009-07-24
+2009-09-18
Aaron Binns
Table of Contents
@@ -63,10 +63,10 @@
------------------
As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
Although the Nutch project released 1.0 in early 2009, there were so
-many changes that NutchWAX 0.12.7 is still built against pre-1.0
+many changes that NutchWAX 0.12.8 is still built against pre-1.0
codebase.
-The specific SVN revision that NutchWAX 0.12.7 is built against is:
+The specific SVN revision that NutchWAX 0.12.8 is built against is:
701524
@@ -81,14 +81,14 @@
SVN: NutchWAX
-------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.7
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.8
source into Nutch's "contrib" directory.
$ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_7/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_8/archive
This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.7 sources.
+NutchWAX 0.12.8 sources.
Build and install
-----------------
@@ -115,7 +115,7 @@
$ cd /opt
$ tar xvfz nutch-1.0-dev.tar.gz
- $ mv nutch-1.0-dev nutchwax-0.12.7
+ $ mv nutch-1.0-dev nutchwax-0.12.8
======================================================================
@@ -128,24 +128,24 @@
Install it simply by untarring it, for example:
$ cd /opt
- $ tar xvfz nutchwax-0.12.7.tar.gz
+ $ tar xvfz nutchwax-0.12.8.tar.gz
======================================================================
Install start-up scripts
======================================================================
-NutchWAX 0.12.7 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.8 comes with a Unix init.d script which can be used to
automatically start the searcher slaves for a multi-node search
configuration.
Assuming you installed NutchWAX as
- /opt/nutchwax-0.12.7
+ /opt/nutchwax-0.12.8
the script is found at
- /opt/nutchwax-0.12.7/contrib/archive/etc/init.d/searcher-slave
+ /opt/nutchwax-0.12.8/contrib/archive/etc/init.d/searcher-slave
This script can be placed in /etc/init.d then added to the list of
startup scripts to run at bootup by using commands appropriate to your
Modified: tags/nutchwax-0_12_8/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_8/archive/README.txt 2009-08-27 23:56:42 UTC (rev 2803)
+++ tags/nutchwax-0_12_8/archive/README.txt 2009-09-19 02:57:12 UTC (rev 2804)
@@ -1,6 +1,6 @@
README.txt
-2009-07-24
+2009-09-18
Aaron Binns
Table of Contents
@@ -13,7 +13,7 @@
Introduction
======================================================================
-Welcome to NutchWAX 0.12.7!
+Welcome to NutchWAX 0.12.8!
NutchWAX is a set of add-ons to Nutch in order to index and search
archived web data.
Modified: tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt 2009-08-27 23:56:42 UTC (rev 2803)
+++ tags/nutchwax-0_12_8/archive/RELEASE-NOTES.txt 2009-09-19 02:57:12 UTC (rev 2804)
@@ -1,9 +1,9 @@
RELEASE-NOTES.TXT
-2009-07-24
+2009-09-18
Aaron Binns
-Release notes for NutchWAX 0.12.7
+Release notes for NutchWAX 0.12.8
For the most recent updates and information on NutchWAX,
please visit the project wiki at:
@@ -15,26 +15,42 @@
Overview
======================================================================
-Aside from the bugs listen in the following section, the main
-feature enhancement in Nutchwax 0.12.7 is the addition of a tool
-to update the boost values in an index. A new command has been
-added to the 'nutchwax' command-line driver:
+The main enhancement in NutchWAX 0.12.8 is the ability to configure
+HTTP headers to support caching.
- nutchwax reboost
+The Archive is starting to use Squid to cache the HTTP responses from
+NutchWAX and some explicit HTTP response headers were needed to enable
+this. Rather than relying on the servlet container (Tomcat/Jetty) to
+add the response headers, we added a servlet filter to NutchWAX.
-This command takes a pagerank.txt file and an index, calculates (new)
-boost values based on the pagerank.txt file and applies them to the
-index. The boost values are modified in-place in the index.
+Right now the filter is very basic, in the web.xml file we now have
-This feaure is used by the Archive in our deployments where web data
-is continuously crawled and archived over time, with pagerank values
-updating accordingly.
+ <filter>
+ <filter-name>Cache Settings</filter-name>
+ <filter-class>org.archive.nutchwax.CacheSettingsFilter</filter-class>
+ <init-param>
+ <param-name>max-age</param-name>
+ <param-value>259200</param-value> <!-- 72 hours (in seconds) -->
+ </init-param>
+ </filter>
-Before NW 0.12.7, content had to be re-indexed to take into account
-updated pagerank information -- an expensive operation for large
-indexes. Now, the pagerank-based boost values can be updated in
-place.
+ <filter-mapping>
+ <filter-name>Cache Settings</filter-name>
+ <servlet-name>OpenSearch</servlet-name>
+ </filter-mapping>
+which configures the filter to add a 'max-age' header with a 72 hour
+limit. This filter is then applied to all instances of the OpenSearch
+servlet.
+
+This allows browsers to cache the OpenSearch response for up to 72
+hours. It also enables any proxies between the browser and server to
+cache the response as well. With the addition of Squid into our
+deployment, we let Squid serve cached responses to repeat queries.
+
+Since our deployment updates every 4 days, a 72-hour expiration works
+well.
+
======================================================================
Issues
======================================================================
@@ -45,12 +61,10 @@
Issues resolved in this release:
-WAX-55 NutchWaxBean's command-line searching should emit title along with other document metadata.
+WAX-61 Change mime-type of OpenSearch XML response from text/xml to
+ application/xml.
-WAX-56 Date-adder allows for duplicate dates to be added to a record.
+WAX-62 Add ability to configure HTTP headers to support caching.
-WAX-57 nutchwax command-driver doesn't properly enclose arguments in quotes.
-
-WAX-58 Need tool to update an existing index's norms based on pagerank information.
-
-WAX-59 Wrong log() function used in PageRankScoringFilter.
+WAX-63 LengthNormUpdater returning error code if no fields in index
+ have norms is inconvenient.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|