|
From: <bi...@us...> - 2008-07-28 19:49:58
|
Revision: 2509
http://archive-access.svn.sourceforge.net/archive-access/?rev=2509&view=rev
Author: binzino
Date: 2008-07-28 19:50:07 +0000 (Mon, 28 Jul 2008)
Log Message:
-----------
Updated for NutchWAX 0.12.1 release.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
trunk/archive-access/projects/nutchwax/archive/README.txt
trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:43:10 UTC (rev 2508)
+++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-07-28 19:50:07 UTC (rev 2509)
@@ -1,6 +1,6 @@
HOWTO.txt
-2008-05-20
+2008-07-28
Aaron Binns
Table of Contents
Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-07-28 19:43:10 UTC (rev 2508)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-07-28 19:50:07 UTC (rev 2509)
@@ -1,6 +1,6 @@
INSTALL.txt
-2008-07-02
+2008-07-28
Aaron Binns
This installation guide assumes the reader is already familiar with
@@ -46,11 +46,11 @@
Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is
built against is:
- 673823
+ 676736
To checkout this revision of Nutch, use:
- $ svn checkout -r 673823 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
+ $ svn checkout -r 676736 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
$ cd nutch
Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-28 19:43:10 UTC (rev 2508)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-28 19:50:07 UTC (rev 2509)
@@ -1,9 +1,9 @@
README.txt
-2008-07-02
+2008-07-25
Aaron Binns
-Welcome to NutchWAX 0.12!
+Welcome to NutchWAX 0.12.1!
NutchWAX is a set of add-ons to Nutch in order to index and search
archived web data.
@@ -76,15 +76,15 @@
======================================================================
-This 0.12 release of NutchWAX is radically different in source-code
+This 0.12.x release of NutchWAX is radically different in source-code
form compared to the previous release, 0.10.
-One of the design goals of 0.12 was to reduce or even eliminate the
+One of the design goals of 0.12.x was to reduce or even eliminate the
"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX
releases had to copy/paste/edit large chunks of Nutch source code in
order to add the NutchWAX features.
-Also, the NutchWAX 0.12 sources and build are designed to one day be
+Also, the NutchWAX 0.12.x sources and build are designed to one day be
added into mainline Nutch as a proper "contrib" package; then
eventually be fully integrated into the core Nutch source code.
Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-07-28 19:43:10 UTC (rev 2508)
+++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-07-28 19:50:07 UTC (rev 2509)
@@ -1,9 +1,9 @@
RELEASE-NOTES.TXT
-2007-07-03
+2007-07-25
Aaron Binns
-Release notes for NutchWAX 0.12
+Release notes for NutchWAX 0.12.1
For the most recent updates and information on NutchWAX,
please visit the project wiki at:
@@ -15,28 +15,10 @@
Overview
======================================================================
-NutchWAX 0.12-beta-1 was released on June 2, 2008. We anticipated
-releasing another beta mid-June with bug fixes and some minor
-enhancements based on feedback from the community.
+NutchWAX 0.12.1 contains some minor enhancements and fixes to NutchWAX
+0.12. One of the driving forces behind some of the enhancements was
+integration with the Wayback machine.
-During internal testing by the Internet Archive Web Team, a few
-serious problems were found, the most critical being the failure to
-store different copies of the same URL when importing large batches of
-archive files.
-
-The NutchWAX team canceled the mid-month release in order to focus on
-fixing this problem.
-
-The good news is that not only has that problem been fixed, but the
-solution is part of a broader enhancement to manage the de-duplication
-of archive contnet during import and indexing.
-
-For more details on de-duplication in NutchWAX, please see
-
- HOWTO-dedup.txt
- README-dedup.txt
-
-
======================================================================
Issues
======================================================================
@@ -47,16 +29,24 @@
Issues resolved in this release:
-WAX-9 Entire file not imported
-WAX-8 Investigate why so many PDFs fail to parse
+WAX-16
+ Option to skip ARC record import based on HTTP status code of
+ content
- Fixing the first one caused nearly all of the PDF parsing errors to
- disappear.
+WAX-12
+ Add metadata field "fileoffset"
-WAX-7 Change config to that URL filters are not applied during link inversion
+WAX-11
+ Change metadata field name in search results from "arcname" to
+ "filename"
- This is easily achieved by using command-line options when invoking
- the Nutch "invertlinks" command.
+WAX-10
+ Add "exacturl" metadata field to indexing so it can be searched
+ as-is, not parsed/tokenized like the "url" field.
-WAX-3 Observe content size limit on importing
-WAX-2 Date queries cause TooManyClauses exceptions
+WAX-6
+ Change DateAdder to allow for implementation of URLCanonicalizer to
+ be defined in property.
+
+WAX-4
+ Implementor/user-provided XSLT for OpenSearch results
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|