From: <bi...@us...> - 2008-07-03 02:03:32
|
Revision: 2399 http://archive-access.svn.sourceforge.net/archive-access/?rev=2399&view=rev Author: binzino Date: 2008-07-02 19:03:41 -0700 (Wed, 02 Jul 2008) Log Message: ----------- Updated with changes in RC-1. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/README.txt Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-03 02:03:09 UTC (rev 2398) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-07-03 02:03:41 UTC (rev 2399) @@ -1,6 +1,6 @@ README.txt -2008-05-20 +2008-07-02 Aaron Binns Welcome to NutchWAX 0.12! @@ -22,13 +22,13 @@ The goal of NutchWAX is to enable full-text indexing and searching of documents stored in web archive file formats (ARC and WARC). -The way we achieve that goal is by providing add-on tools and plugins +The way we achieve that goal is by providing plugins and add-on tools to Nutch to read documents directly from ARC/WARC files. We call this process "importing" archive files. -Importing produces a Nutch segment, the same as if Nutch had actually -crawled the documents itself. In this scenario, document importing -replaces the conventional "generate/fetch/update" cycle of Nutch. +Importing produces a Nutch segment, similar to Nutch crawling the +documents itself. In this scenario, document importing replaces the +conventional "generate/fetch/update" cycle of Nutch. Once the archival documents have been imported into a segment, the regular Nutch commands to update the 'crawldb', invert the links and @@ -36,12 +36,12 @@ ====================================================================== -The NutchWAX add-ons consist of: +The main NutchWAX add-ons are: bin/nutchwax - A shell script that is used to run the NutchWAX command-line tools, - such as document importing. + A shell script that is used to run the NutchWAX commands, such as + document importing. This is patterned after the 'bin/nutch' shell script. @@ -55,6 +55,16 @@ Query plugin which allows for querying against the metadata fields added by 'index-nutchwax'. + plugins/urlfilter-nutchwax + + Filtering plugin which can be used to exclude URLs from import. It + can be used as part of a NutchWAX de-duplication scheme. + + conf/nutch-site.xml + + Sample configuration properties file showing suggested settings for + Nutch and NutchWAX. + There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX is distributed in source code form and is intended to be built in conjunction with Nutch. @@ -84,7 +94,7 @@ already familiar with the inner workings of Nutch. Still, special attention on one class is worth while: - src/java/org/archive/nutchwax/ArcsToSegment.java + src/java/org/archive/nutchwax/Importer.java This is where ARC/WARC files are read and their documents are imported into a Nutch segment. @@ -113,10 +123,14 @@ o We add metadata fields to the document, which are then available to the "index-nutchwax" plugin at indexing-time. - ArcsToSegment.importRecord() + Importer.importRecord() ... contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); ... + + +====================================================================== + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |