[Archive-access-cvs] SF.net SVN: archive-access: [2399] trunk/archive-access/projects/nutchwax/ arc

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2399
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2399&view=rev
Author:   binzino
Date:     2008-07-02 19:03:41 -0700 (Wed, 02 Jul 2008)

Log Message:
-----------
Updated with changes in RC-1.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/README.txt

Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================

--- trunk/archive-access/projects/nutchwax/archive/README.txt	2008-07-03 02:03:09 UTC (rev 2398)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt	2008-07-03 02:03:41 UTC (rev 2399)
@@ -1,6 +1,6 @@
 
 README.txt
-2008-05-20
+2008-07-02
 Aaron Binns
 
 Welcome to NutchWAX 0.12!
@@ -22,13 +22,13 @@
 The goal of NutchWAX is to enable full-text indexing and searching of
 documents stored in web archive file formats (ARC and WARC).
 
-The way we achieve that goal is by providing add-on tools and plugins
+The way we achieve that goal is by providing plugins and add-on tools
 to Nutch to read documents directly from ARC/WARC files.  We call this
 process "importing" archive files.
 
-Importing produces a Nutch segment, the same as if Nutch had actually
-crawled the documents itself.  In this scenario, document importing
-replaces the conventional "generate/fetch/update" cycle of Nutch.
+Importing produces a Nutch segment, similar to Nutch crawling the
+documents itself.  In this scenario, document importing replaces the
+conventional "generate/fetch/update" cycle of Nutch.
 
 Once the archival documents have been imported into a segment, the
 regular Nutch commands to update the 'crawldb', invert the links and
@@ -36,12 +36,12 @@
 
 ======================================================================
 
-The NutchWAX add-ons consist of:
+The main NutchWAX add-ons are:
 
  bin/nutchwax
 
-   A shell script that is used to run the NutchWAX command-line tools,
-   such as document importing.
+   A shell script that is used to run the NutchWAX commands, such as
+   document importing.
 
    This is patterned after the 'bin/nutch' shell script.
 
@@ -55,6 +55,16 @@
    Query plugin which allows for querying against the metadata fields
    added by 'index-nutchwax'.
 
+ plugins/urlfilter-nutchwax
+
+   Filtering plugin which can be used to exclude URLs from import.  It
+   can be used as part of a NutchWAX de-duplication scheme.
+
+ conf/nutch-site.xml
+
+   Sample configuration properties file showing suggested settings for
+   Nutch and NutchWAX.
+
 There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
 is distributed in source code form and is intended to be built in
 conjunction with Nutch.
@@ -84,7 +94,7 @@
 already familiar with the inner workings of Nutch.  Still, special
 attention on one class is worth while:
 
-  src/java/org/archive/nutchwax/ArcsToSegment.java
+  src/java/org/archive/nutchwax/Importer.java
 
 This is where ARC/WARC files are read and their documents are imported
 into a Nutch segment.
@@ -113,10 +123,14 @@
   o We add metadata fields to the document, which are then available
     to the "index-nutchwax" plugin at indexing-time.
 
-    ArcsToSegment.importRecord()
+    Importer.importRecord()
       ...
       contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
       contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
       contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
       contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
       ...
+
+
+======================================================================
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access: [2399] trunk/archive-access/projects/nutchwax/ arc

[Archive-access-cvs] SF.net SVN: archive-access: [2399] trunk/archive-access/projects/nutchwax/ archive/README.txt