[Archive-access-cvs] SF.net SVN: archive-access:[2704] tags/nutchwax-0_12_4/archive

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2704
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2704&view=rev
Author:   binzino
Date:     2009-05-05 22:17:48 +0000 (Tue, 05 May 2009)

Log Message:
-----------
Oops, didn't have the updated versions checked-in when I did the
release copy.  Fixed.

Added Paths:
-----------
    tags/nutchwax-0_12_4/archive/README.txt
    tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt

Removed Paths:
-------------
    tags/nutchwax-0_12_4/archive/README.txt
    tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt

Deleted: tags/nutchwax-0_12_4/archive/README.txt
===================================================================

--- tags/nutchwax-0_12_4/archive/README.txt	2009-05-05 21:46:40 UTC (rev 2703)
+++ tags/nutchwax-0_12_4/archive/README.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -1,104 +0,0 @@
-
-README.txt
-2008-03-08
-Aaron Binns
-
-Table of Contents
- o Introduction
- o Build and Install
- o Tutorial
-
-
-======================================================================
-Introduction
-======================================================================
-
-Welcome to NutchWAX 0.12.4!
-
-NutchWAX is a set of add-ons to Nutch in order to index and search
-archived web data.
-
-These add-ons are developed and maintained by the Internet Archive Web
-Team in conjunction with a broad community of contributors, partners
-and end-users.
-
-The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
-
-Since NutchWAX is a set of add-ons to Nutch, you should already be
-familiar with Nutch before using NutchWAX.
-
-
-The goal of NutchWAX is to enable full-text indexing and searching of
-documents stored in web archive file formats (ARC and WARC).
-
-The way we achieve that goal is by providing plugins and add-on tools
-to Nutch to read documents directly from ARC/WARC files.  We call this
-process "importing" archive files.
-
-Importing produces a Nutch segment, the same as when Nutch is used to
-crawl documents itself.  In essence, document importing replaces the
-conventional "generate/fetch/update" cycle of Nutch.
-
-Once the archival documents have been imported into a segment, the
-regular Nutch commands to index the document contents can proceed as
-normal.
-
-======================================================================
-
-The main NutchWAX add-ons are:
-
- bin/nutchwax
-
-   A shell script that is used to run the NutchWAX commands, such as
-   document importing.
-
-   This is patterned after the 'bin/nutch' shell script.
-
- plugins/index-nutchwax
-
-   Indexing plugin which adds NutchWAX-specific metadata fields to the
-   indexed document.
-
- plugins/query-nutchwax
-
-   Query plugin which allows for querying against the metadata fields
-   added by 'index-nutchwax'.
-
- plugins/urlfilter-nutchwax
-
-   Filtering plugin which can be used to exclude URLs from import.  It
-   can be used as part of a NutchWAX de-duplication scheme.
-
- plugins/scoring-nutchwax
-
-   Scoring plugin for use at index-time which reads from an external
-   "pagerank.txt" file for scoring documents based on the log10 of the
-   number of inlinks to a document.
-
-   The use of this plugin is optional but can improve the quality of
-   search results, especially for very large collections.
-
- conf/nutch-site.xml
-
-   Additional configuration properties for NutchWAX, including
-   over-rides for properties defined in 'nutch-default.xml'
-
-There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
-is distributed in source code form and is intended to be built in
-conjunction with Nutch.
-
-
-======================================================================
-Build and Install
-======================================================================
-
-See "INSTALL.txt" for detailed instructions to build NutchWAX from
-source or install a binary package.
-
-
-======================================================================
-Tutorial
-======================================================================
-
-See "HOWTO.txt" for a quick tutorial on importing, indexing and
-searching a set of documents in a web archive file.

Copied: tags/nutchwax-0_12_4/archive/README.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/README.txt)
===================================================================
--- tags/nutchwax-0_12_4/archive/README.txt	                        (rev 0)
+++ tags/nutchwax-0_12_4/archive/README.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -0,0 +1,104 @@
+
+README.txt
+2009-05-05
+Aaron Binns
+
+Table of Contents
+ o Introduction
+ o Build and Install
+ o Tutorial
+
+
+======================================================================
+Introduction
+======================================================================
+
+Welcome to NutchWAX 0.12.4!
+
+NutchWAX is a set of add-ons to Nutch in order to index and search
+archived web data.
+
+These add-ons are developed and maintained by the Internet Archive Web
+Team in conjunction with a broad community of contributors, partners
+and end-users.
+
+The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
+
+Since NutchWAX is a set of add-ons to Nutch, you should already be
+familiar with Nutch before using NutchWAX.
+
+
+The goal of NutchWAX is to enable full-text indexing and searching of
+documents stored in web archive file formats (ARC and WARC).
+
+The way we achieve that goal is by providing plugins and add-on tools
+to Nutch to read documents directly from ARC/WARC files.  We call this
+process "importing" archive files.
+
+Importing produces a Nutch segment, the same as when Nutch is used to
+crawl documents itself.  In essence, document importing replaces the
+conventional "generate/fetch/update" cycle of Nutch.
+
+Once the archival documents have been imported into a segment, the
+regular Nutch commands to index the document contents can proceed as
+normal.
+
+======================================================================
+
+The main NutchWAX add-ons are:
+
+ bin/nutchwax
+
+   A shell script that is used to run the NutchWAX commands, such as
+   document importing.
+
+   This is patterned after the 'bin/nutch' shell script.
+
+ plugins/index-nutchwax
+
+   Indexing plugin which adds NutchWAX-specific metadata fields to the
+   indexed document.
+
+ plugins/query-nutchwax
+
+   Query plugin which allows for querying against the metadata fields
+   added by 'index-nutchwax'.
+
+ plugins/urlfilter-nutchwax
+
+   Filtering plugin which can be used to exclude URLs from import.  It
+   can be used as part of a NutchWAX de-duplication scheme.
+
+ plugins/scoring-nutchwax
+
+   Scoring plugin for use at index-time which reads from an external
+   "pagerank.txt" file for scoring documents based on the log10 of the
+   number of inlinks to a document.
+
+   The use of this plugin is optional but can improve the quality of
+   search results, especially for very large collections.
+
+ conf/nutch-site.xml
+
+   Additional configuration properties for NutchWAX, including
+   over-rides for properties defined in 'nutch-default.xml'
+
+There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
+is distributed in source code form and is intended to be built in
+conjunction with Nutch.
+
+
+======================================================================
+Build and Install
+======================================================================
+
+See "INSTALL.txt" for detailed instructions to build NutchWAX from
+source or install a binary package.
+
+
+======================================================================
+Tutorial
+======================================================================
+
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.

Deleted: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt
===================================================================
--- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	2009-05-05 21:46:40 UTC (rev 2703)
+++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -1,58 +0,0 @@
-
-RELEASE-NOTES.TXT
-2008-03-08
-Aaron Binns
-
-Release notes for NutchWAX 0.12.4
-
-For the most recent updates and information on NutchWAX,
-please visit the project wiki at:
-
-  http://webteam.archive.org/confluence/display/search/NutchWAX
-
-
-======================================================================
-Overview
-======================================================================
-
-NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
-
-  o Option to omit storing of content during import.
-  o Support for per-collection segments in master/slave config.
-  o Additional diagnostic/log messages to help troubleshoot common
-    deployment mistakes.
-  o PageRankDb similar to LinkDb but only keeping inlink counts.
-  o Improved paging through results, handling "paging past the end".
-
-
-======================================================================
-Issues
-======================================================================
-
-For an up-to-date list of NutchWAX issues:
-
-  http://webteam.archive.org/jira/browse/WAX
-
-Issues resolved in this release:
-
-WAX-27 Sensible output for requesting page of results past the end.
-
-WAX-34 Add option to omit storing of content in segment
-
-WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
-       rather than actual inlinks.
-
-WAX-36 Some additional diagnostics on connecting results to segments
-       and snippets would be very helpful.
-
-WAX-37 Per-collection segments not supported in distributed
-       master-slave configuration.
-
-WAX-38 Build omits neessary libraries from .job file.
-
-WAX-39 Write more efficient, specialized segment parse_text merging.
-
-
-
-
-

Copied: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt)
===================================================================
--- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	                        (rev 0)
+++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt	2009-05-05 22:17:48 UTC (rev 2704)
@@ -0,0 +1,57 @@
+
+RELEASE-NOTES.TXT
+2009-05-05
+Aaron Binns
+
+Release notes for NutchWAX 0.12.4
+
+For the most recent updates and information on NutchWAX,
+please visit the project wiki at:
+
+  http://webteam.archive.org/confluence/display/search/NutchWAX
+
+
+======================================================================
+Overview
+======================================================================
+
+NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3
+
+  o Option to omit storing of content during import.
+  o Support for per-collection segments in master/slave config.
+  o Additional diagnostic/log messages to help troubleshoot common
+    deployment mistakes.
+  o PageRankDb similar to LinkDb but only keeping inlink counts.
+  o Improved paging through results, handling "paging past the end".
+
+
+======================================================================
+Issues
+======================================================================
+
+For an up-to-date list of NutchWAX issues:
+
+  http://webteam.archive.org/jira/browse/WAX
+
+Issues resolved in this release:
+
+WAX-27 Sensible output for requesting page of results past the end.
+
+WAX-34 Add option to omit storing of content in segment
+
+WAX-35 Add pagerankdb similar to linkdb but which only keeps counts
+       rather than actual inlinks.
+
+WAX-36 Some additional diagnostics on connecting results to segments
+       and snippets would be very helpful.
+
+WAX-37 Per-collection segments not supported in distributed
+       master-slave configuration.
+
+WAX-38 Build omits neessary libraries from .job file.
+
+WAX-39 Write more efficient, specialized segment parse_text merging.
+
+WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher
+
+WAX-42 Add option to continue importing if an arcfile cannot be read.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.