|
From: <bi...@us...> - 2009-07-24 20:52:02
|
Revision: 2795
http://archive-access.svn.sourceforge.net/archive-access/?rev=2795&view=rev
Author: binzino
Date: 2009-07-24 20:51:49 +0000 (Fri, 24 Jul 2009)
Log Message:
-----------
Updated documentation for NutchWAX 0.12.7 release.
Modified Paths:
--------------
tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt
tags/nutchwax-0_12_7/archive/HOWTO.txt
tags/nutchwax-0_12_7/archive/INSTALL.txt
tags/nutchwax-0_12_7/archive/README.txt
tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt
Modified: tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt 2009-07-24 19:13:40 UTC (rev 2794)
+++ tags/nutchwax-0_12_7/archive/BUILD-NOTES.txt 2009-07-24 20:51:49 UTC (rev 2795)
@@ -1,6 +1,6 @@
BUILD-NOTES.txt
-2009-07-09
+2009-07-24
Aaron Binns
======================================================================
@@ -79,7 +79,7 @@
----------------------------------------------------------------------
The file
- /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml
contains two errors: one where a mimetype is referenced before it is
defined; and a second where a definition has an illegal character.
@@ -110,11 +110,11 @@
You can either apply these patches yourself, or copy an already-patched
copy from:
- /opt/nutchwax-0.12.6/contrib/archive/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.7/contrib/archive/conf/tika-mimetypes.xml
to
- /opt/nutchwax-0.12.6/conf/tika-mimetypes.xml
+ /opt/nutchwax-0.12.7/conf/tika-mimetypes.xml
----------------------------------------------------------------------
Modified: tags/nutchwax-0_12_7/archive/HOWTO.txt
===================================================================
--- tags/nutchwax-0_12_7/archive/HOWTO.txt 2009-07-24 19:13:40 UTC (rev 2794)
+++ tags/nutchwax-0_12_7/archive/HOWTO.txt 2009-07-24 20:51:49 UTC (rev 2795)
@@ -1,6 +1,6 @@
HOWTO.txt
-2009-07-09
+2009-07-24
Aaron Binns
Table of Contents
@@ -26,7 +26,7 @@
This HOWTO assumes it is installed in
- /opt/nutchwax-0.12.6
+ /opt/nutchwax-0.12.7
2. ARC/WARC files.
@@ -68,10 +68,10 @@
$ mkdir crawl
$ cd crawl
- $ /opt/nutchwax-0.12.6/bin/nutchwax import ../manifest
- $ /opt/nutchwax-0.12.6/bin/nutch updatedb crawldb -dir segments
- $ /opt/nutchwax-0.12.6/bin/nutch invertlinks linkdb -dir segments
- $ /opt/nutchwax-0.12.6/bin/nutch index indexes crawldb linkdb segments/*
+ $ /opt/nutchwax-0.12.7/bin/nutchwax import ../manifest
+ $ /opt/nutchwax-0.12.7/bin/nutch updatedb crawldb -dir segments
+ $ /opt/nutchwax-0.12.7/bin/nutch invertlinks linkdb -dir segments
+ $ /opt/nutchwax-0.12.7/bin/nutch index indexes crawldb linkdb segments/*
$ ls -F1
crawldb/
indexes/
@@ -96,7 +96,7 @@
$ cd ../
$ ls -F1
crawl/
- $ /opt/nutchwax-0.12.6/bin/nutchwax search computer
+ $ /opt/nutchwax-0.12.7/bin/nutchwax search computer
This calls the NutchWaxBean to execute a simple keyword search for
"computer". Use whatever query term you think appears in the
@@ -109,7 +109,7 @@
The Nutch(WAX) web application is bundled with NutchWAX as
- /opt/nutchwax-0.12.6/nutch-1.0-dev.war
+ /opt/nutchwax-0.12.7/nutch-1.0-dev.war
Simply deploy that web application in the same fashion as with
Nutch.
Modified: tags/nutchwax-0_12_7/archive/INSTALL.txt
===================================================================
--- tags/nutchwax-0_12_7/archive/INSTALL.txt 2009-07-24 19:13:40 UTC (rev 2794)
+++ tags/nutchwax-0_12_7/archive/INSTALL.txt 2009-07-24 20:51:49 UTC (rev 2795)
@@ -1,6 +1,6 @@
INSTALL.txt
-2009-07-09
+2009-07-24
Aaron Binns
Table of Contents
@@ -63,10 +63,10 @@
------------------
As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
Although the Nutch project released 1.0 in early 2009, there were so
-many changes that NutchWAX 0.12.6 is still built against pre-1.0
+many changes that NutchWAX 0.12.7 is still built against pre-1.0
codebase.
-The specific SVN revision that NutchWAX 0.12.6 is built against is:
+The specific SVN revision that NutchWAX 0.12.7 is built against is:
701524
@@ -81,14 +81,14 @@
SVN: NutchWAX
-------------
-Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.6
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.7
source into Nutch's "contrib" directory.
$ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_6/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_7/archive
This will create a sub-directory named "archive" containing the
-NutchWAX 0.12.6 sources.
+NutchWAX 0.12.7 sources.
Build and install
-----------------
@@ -115,7 +115,7 @@
$ cd /opt
$ tar xvfz nutch-1.0-dev.tar.gz
- $ mv nutch-1.0-dev nutchwax-0.12.6
+ $ mv nutch-1.0-dev nutchwax-0.12.7
======================================================================
@@ -128,24 +128,24 @@
Install it simply by untarring it, for example:
$ cd /opt
- $ tar xvfz nutchwax-0.12.6.tar.gz
+ $ tar xvfz nutchwax-0.12.7.tar.gz
======================================================================
Install start-up scripts
======================================================================
-NutchWAX 0.12.6 comes with a Unix init.d script which can be used to
+NutchWAX 0.12.7 comes with a Unix init.d script which can be used to
automatically start the searcher slaves for a multi-node search
configuration.
Assuming you installed NutchWAX as
- /opt/nutchwax-0.12.6
+ /opt/nutchwax-0.12.7
the script is found at
- /opt/nutchwax-0.12.6/contrib/archive/etc/init.d/searcher-slave
+ /opt/nutchwax-0.12.7/contrib/archive/etc/init.d/searcher-slave
This script can be placed in /etc/init.d then added to the list of
startup scripts to run at bootup by using commands appropriate to your
Modified: tags/nutchwax-0_12_7/archive/README.txt
===================================================================
--- tags/nutchwax-0_12_7/archive/README.txt 2009-07-24 19:13:40 UTC (rev 2794)
+++ tags/nutchwax-0_12_7/archive/README.txt 2009-07-24 20:51:49 UTC (rev 2795)
@@ -1,6 +1,6 @@
README.txt
-2009-07-09
+2009-07-24
Aaron Binns
Table of Contents
@@ -13,7 +13,7 @@
Introduction
======================================================================
-Welcome to NutchWAX 0.12.6!
+Welcome to NutchWAX 0.12.7!
NutchWAX is a set of add-ons to Nutch in order to index and search
archived web data.
Modified: tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt 2009-07-24 19:13:40 UTC (rev 2794)
+++ tags/nutchwax-0_12_7/archive/RELEASE-NOTES.txt 2009-07-24 20:51:49 UTC (rev 2795)
@@ -1,9 +1,9 @@
RELEASE-NOTES.TXT
-2009-07-09
+2009-07-24
Aaron Binns
-Release notes for NutchWAX 0.12.6
+Release notes for NutchWAX 0.12.7
For the most recent updates and information on NutchWAX,
please visit the project wiki at:
@@ -15,44 +15,26 @@
Overview
======================================================================
-NutchWAX 0.12.6 contains a few convenient enhancements to 0.12.5
+Aside from the bugs listen in the following section, the main
+feature enhancement in Nutchwax 0.12.7 is the addition of a tool
+to update the boost values in an index. A new command has been
+added to the 'nutchwax' command-line driver:
- o Addition of 'search' and 'merge' commands to the 'nutchwax'
- command-line driver. Now one can do
+ nutchwax reboost
- nutchwax search foo
+This command takes a pagerank.txt file and an index, calculates (new)
+boost values based on the pagerank.txt file and applies them to the
+index. The boost values are modified in-place in the index.
- instead of
+This feaure is used by the Archive in our deployments where web data
+is continuously crawled and archived over time, with pagerank values
+updating accordingly.
- nutch org.archive.nutchwax.NutchWaxBean foo
+Before NW 0.12.7, content had to be re-indexed to take into account
+updated pagerank information -- an expensive operation for large
+indexes. Now, the pagerank-based boost values can be updated in
+place.
- Similarly, the new NutchWAX index merging, which supports
- parallel indexes, can be invoked via
-
- nutchwax merge output-index input-index...
-
- o Merging of parallel indexes into a single index.
-
- NutchWAX has a copy/paste/enhanced version of the Nutch index
- merger that now supports parallel indexes. This allows parallel
- indexes to be merged into a single index. To use this feature,
- add the "-p" option to the NutchWAX 'merge' command indicating the
- input index directories contain parallel index sub-dirs.
-
- nutchwax merge -p output-index input-pindexes...
-
- o Option to specify the directory where the index(es) and segments
- live when doing a command-line search.
-
- Previously the directory was obtained from the nutch-default.xml
- configuration file. This is inconvenient when testing different
- indexes as one would have to edit the config file each time to
- specify a different index to search.
-
- Now, the directory can be specified on the command line:
-
- nutchwax search -d <dir> <query>
-
======================================================================
Issues
======================================================================
@@ -63,9 +45,12 @@
Issues resolved in this release:
-WAX-51 Enhance index merging to combine parallel indexes.
+WAX-55 NutchWaxBean's command-line searching should emit title along with other document metadata.
-WAX-52 Add option to NutchWaxBean to specify directory where
- index+segments are to be found.
+WAX-56 Date-adder allows for duplicate dates to be added to a record.
-WAX-53 IndexMerging parallel indexes fails when index is empty.
+WAX-57 nutchwax command-driver doesn't properly enclose arguments in quotes.
+
+WAX-58 Need tool to update an existing index's norms based on pagerank information.
+
+WAX-59 Wrong log() function used in PageRankScoringFilter.
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|