Revision: 2399
http://archive-access.svn.sourceforge.net/archive-access/?rev=2399&view=rev
Author: binzino
Date: 2008-07-02 19:03:41 -0700 (Wed, 02 Jul 2008)
Log Message:
-----------
Updated with changes in RC-1.
Modified Paths:
--------------
trunk/archive-access/projects/nutchwax/archive/README.txt
Modified: trunk/archive-access/projects/nutchwax/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-03 02:03:09 UTC (rev 2398)
+++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-03 02:03:41 UTC (rev 2399)
@@ -1,6 +1,6 @@
README.txt
-2008-05-20
+2008-07-02
Aaron Binns
Welcome to NutchWAX 0.12!
@@ -22,13 +22,13 @@
The goal of NutchWAX is to enable full-text indexing and searching of
documents stored in web archive file formats (ARC and WARC).
-The way we achieve that goal is by providing add-on tools and plugins
+The way we achieve that goal is by providing plugins and add-on tools
to Nutch to read documents directly from ARC/WARC files. We call this
process "importing" archive files.
-Importing produces a Nutch segment, the same as if Nutch had actually
-crawled the documents itself. In this scenario, document importing
-replaces the conventional "generate/fetch/update" cycle of Nutch.
+Importing produces a Nutch segment, similar to Nutch crawling the
+documents itself. In this scenario, document importing replaces the
+conventional "generate/fetch/update" cycle of Nutch.
Once the archival documents have been imported into a segment, the
regular Nutch commands to update the 'crawldb', invert the links and
@@ -36,12 +36,12 @@
======================================================================
-The NutchWAX add-ons consist of:
+The main NutchWAX add-ons are:
bin/nutchwax
- A shell script that is used to run the NutchWAX command-line tools,
- such as document importing.
+ A shell script that is used to run the NutchWAX commands, such as
+ document importing.
This is patterned after the 'bin/nutch' shell script.
@@ -55,6 +55,16 @@
Query plugin which allows for querying against the metadata fields
added by 'index-nutchwax'.
+ plugins/urlfilter-nutchwax
+
+ Filtering plugin which can be used to exclude URLs from import. It
+ can be used as part of a NutchWAX de-duplication scheme.
+
+ conf/nutch-site.xml
+
+ Sample configuration properties file showing suggested settings for
+ Nutch and NutchWAX.
+
There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX
is distributed in source code form and is intended to be built in
conjunction with Nutch.
@@ -84,7 +94,7 @@
already familiar with the inner workings of Nutch. Still, special
attention on one class is worth while:
- src/java/org/archive/nutchwax/ArcsToSegment.java
+ src/java/org/archive/nutchwax/Importer.java
This is where ARC/WARC files are read and their documents are imported
into a Nutch segment.
@@ -113,10 +123,14 @@
o We add metadata fields to the document, which are then available
to the "index-nutchwax" plugin at indexing-time.
- ArcsToSegment.importRecord()
+ Importer.importRecord()
...
contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() );
contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() );
contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName );
contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() );
...
+
+
+======================================================================
+
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|