From: <bi...@us...> - 2009-05-05 22:18:28
|
Revision: 2704 http://archive-access.svn.sourceforge.net/archive-access/?rev=2704&view=rev Author: binzino Date: 2009-05-05 22:17:48 +0000 (Tue, 05 May 2009) Log Message: ----------- Oops, didn't have the updated versions checked-in when I did the release copy. Fixed. Added Paths: ----------- tags/nutchwax-0_12_4/archive/README.txt tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt Removed Paths: ------------- tags/nutchwax-0_12_4/archive/README.txt tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt Deleted: tags/nutchwax-0_12_4/archive/README.txt =================================================================== --- tags/nutchwax-0_12_4/archive/README.txt 2009-05-05 21:46:40 UTC (rev 2703) +++ tags/nutchwax-0_12_4/archive/README.txt 2009-05-05 22:17:48 UTC (rev 2704) @@ -1,104 +0,0 @@ - -README.txt -2008-03-08 -Aaron Binns - -Table of Contents - o Introduction - o Build and Install - o Tutorial - - -====================================================================== -Introduction -====================================================================== - -Welcome to NutchWAX 0.12.4! - -NutchWAX is a set of add-ons to Nutch in order to index and search -archived web data. - -These add-ons are developed and maintained by the Internet Archive Web -Team in conjunction with a broad community of contributors, partners -and end-users. - -The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions". - -Since NutchWAX is a set of add-ons to Nutch, you should already be -familiar with Nutch before using NutchWAX. - - -The goal of NutchWAX is to enable full-text indexing and searching of -documents stored in web archive file formats (ARC and WARC). - -The way we achieve that goal is by providing plugins and add-on tools -to Nutch to read documents directly from ARC/WARC files. We call this -process "importing" archive files. - -Importing produces a Nutch segment, the same as when Nutch is used to -crawl documents itself. In essence, document importing replaces the -conventional "generate/fetch/update" cycle of Nutch. - -Once the archival documents have been imported into a segment, the -regular Nutch commands to index the document contents can proceed as -normal. - -====================================================================== - -The main NutchWAX add-ons are: - - bin/nutchwax - - A shell script that is used to run the NutchWAX commands, such as - document importing. - - This is patterned after the 'bin/nutch' shell script. - - plugins/index-nutchwax - - Indexing plugin which adds NutchWAX-specific metadata fields to the - indexed document. - - plugins/query-nutchwax - - Query plugin which allows for querying against the metadata fields - added by 'index-nutchwax'. - - plugins/urlfilter-nutchwax - - Filtering plugin which can be used to exclude URLs from import. It - can be used as part of a NutchWAX de-duplication scheme. - - plugins/scoring-nutchwax - - Scoring plugin for use at index-time which reads from an external - "pagerank.txt" file for scoring documents based on the log10 of the - number of inlinks to a document. - - The use of this plugin is optional but can improve the quality of - search results, especially for very large collections. - - conf/nutch-site.xml - - Additional configuration properties for NutchWAX, including - over-rides for properties defined in 'nutch-default.xml' - -There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX -is distributed in source code form and is intended to be built in -conjunction with Nutch. - - -====================================================================== -Build and Install -====================================================================== - -See "INSTALL.txt" for detailed instructions to build NutchWAX from -source or install a binary package. - - -====================================================================== -Tutorial -====================================================================== - -See "HOWTO.txt" for a quick tutorial on importing, indexing and -searching a set of documents in a web archive file. Copied: tags/nutchwax-0_12_4/archive/README.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/README.txt) =================================================================== --- tags/nutchwax-0_12_4/archive/README.txt (rev 0) +++ tags/nutchwax-0_12_4/archive/README.txt 2009-05-05 22:17:48 UTC (rev 2704) @@ -0,0 +1,104 @@ + +README.txt +2009-05-05 +Aaron Binns + +Table of Contents + o Introduction + o Build and Install + o Tutorial + + +====================================================================== +Introduction +====================================================================== + +Welcome to NutchWAX 0.12.4! + +NutchWAX is a set of add-ons to Nutch in order to index and search +archived web data. + +These add-ons are developed and maintained by the Internet Archive Web +Team in conjunction with a broad community of contributors, partners +and end-users. + +The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions". + +Since NutchWAX is a set of add-ons to Nutch, you should already be +familiar with Nutch before using NutchWAX. + + +The goal of NutchWAX is to enable full-text indexing and searching of +documents stored in web archive file formats (ARC and WARC). + +The way we achieve that goal is by providing plugins and add-on tools +to Nutch to read documents directly from ARC/WARC files. We call this +process "importing" archive files. + +Importing produces a Nutch segment, the same as when Nutch is used to +crawl documents itself. In essence, document importing replaces the +conventional "generate/fetch/update" cycle of Nutch. + +Once the archival documents have been imported into a segment, the +regular Nutch commands to index the document contents can proceed as +normal. + +====================================================================== + +The main NutchWAX add-ons are: + + bin/nutchwax + + A shell script that is used to run the NutchWAX commands, such as + document importing. + + This is patterned after the 'bin/nutch' shell script. + + plugins/index-nutchwax + + Indexing plugin which adds NutchWAX-specific metadata fields to the + indexed document. + + plugins/query-nutchwax + + Query plugin which allows for querying against the metadata fields + added by 'index-nutchwax'. + + plugins/urlfilter-nutchwax + + Filtering plugin which can be used to exclude URLs from import. It + can be used as part of a NutchWAX de-duplication scheme. + + plugins/scoring-nutchwax + + Scoring plugin for use at index-time which reads from an external + "pagerank.txt" file for scoring documents based on the log10 of the + number of inlinks to a document. + + The use of this plugin is optional but can improve the quality of + search results, especially for very large collections. + + conf/nutch-site.xml + + Additional configuration properties for NutchWAX, including + over-rides for properties defined in 'nutch-default.xml' + +There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX +is distributed in source code form and is intended to be built in +conjunction with Nutch. + + +====================================================================== +Build and Install +====================================================================== + +See "INSTALL.txt" for detailed instructions to build NutchWAX from +source or install a binary package. + + +====================================================================== +Tutorial +====================================================================== + +See "HOWTO.txt" for a quick tutorial on importing, indexing and +searching a set of documents in a web archive file. Deleted: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt 2009-05-05 21:46:40 UTC (rev 2703) +++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt 2009-05-05 22:17:48 UTC (rev 2704) @@ -1,58 +0,0 @@ - -RELEASE-NOTES.TXT -2008-03-08 -Aaron Binns - -Release notes for NutchWAX 0.12.4 - -For the most recent updates and information on NutchWAX, -please visit the project wiki at: - - http://webteam.archive.org/confluence/display/search/NutchWAX - - -====================================================================== -Overview -====================================================================== - -NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3 - - o Option to omit storing of content during import. - o Support for per-collection segments in master/slave config. - o Additional diagnostic/log messages to help troubleshoot common - deployment mistakes. - o PageRankDb similar to LinkDb but only keeping inlink counts. - o Improved paging through results, handling "paging past the end". - - -====================================================================== -Issues -====================================================================== - -For an up-to-date list of NutchWAX issues: - - http://webteam.archive.org/jira/browse/WAX - -Issues resolved in this release: - -WAX-27 Sensible output for requesting page of results past the end. - -WAX-34 Add option to omit storing of content in segment - -WAX-35 Add pagerankdb similar to linkdb but which only keeps counts - rather than actual inlinks. - -WAX-36 Some additional diagnostics on connecting results to segments - and snippets would be very helpful. - -WAX-37 Per-collection segments not supported in distributed - master-slave configuration. - -WAX-38 Build omits neessary libraries from .job file. - -WAX-39 Write more efficient, specialized segment parse_text merging. - - - - - Copied: tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt (from rev 2703, trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt) =================================================================== --- tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt (rev 0) +++ tags/nutchwax-0_12_4/archive/RELEASE-NOTES.txt 2009-05-05 22:17:48 UTC (rev 2704) @@ -0,0 +1,57 @@ + +RELEASE-NOTES.TXT +2009-05-05 +Aaron Binns + +Release notes for NutchWAX 0.12.4 + +For the most recent updates and information on NutchWAX, +please visit the project wiki at: + + http://webteam.archive.org/confluence/display/search/NutchWAX + + +====================================================================== +Overview +====================================================================== + +NutchWAX 0.12.4 contains numerous enhancements and fixes to 0.12.3 + + o Option to omit storing of content during import. + o Support for per-collection segments in master/slave config. + o Additional diagnostic/log messages to help troubleshoot common + deployment mistakes. + o PageRankDb similar to LinkDb but only keeping inlink counts. + o Improved paging through results, handling "paging past the end". + + +====================================================================== +Issues +====================================================================== + +For an up-to-date list of NutchWAX issues: + + http://webteam.archive.org/jira/browse/WAX + +Issues resolved in this release: + +WAX-27 Sensible output for requesting page of results past the end. + +WAX-34 Add option to omit storing of content in segment + +WAX-35 Add pagerankdb similar to linkdb but which only keeps counts + rather than actual inlinks. + +WAX-36 Some additional diagnostics on connecting results to segments + and snippets would be very helpful. + +WAX-37 Per-collection segments not supported in distributed + master-slave configuration. + +WAX-38 Build omits neessary libraries from .job file. + +WAX-39 Write more efficient, specialized segment parse_text merging. + +WAX-41 Option to enable/disable the FIELDCACHE in the Nutch IndexSearcher + +WAX-42 Add option to continue importing if an arcfile cannot be read. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |