From: <bi...@us...> - 2008-12-18 18:37:45
|
Revision: 2678 http://archive-access.svn.sourceforge.net/archive-access/?rev=2678&view=rev Author: binzino Date: 2008-12-18 18:37:40 +0000 (Thu, 18 Dec 2008) Log Message: ----------- Updated documenation for 0.12.3 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt trunk/archive-access/projects/nutchwax/archive/HOWTO.txt trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt Added: trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/BUILD-NOTES.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -0,0 +1,392 @@ + +BUILD-NOTES.txt +2008-12-18 +Aaron Binns + +====================================================================== +Build notes +====================================================================== + +This document contains supplemental notes regarding the NutchWAX +build, expanding upon the information found in the various READMEs and +HOWTOs. + +====================================================================== + +This 0.12.x release of NutchWAX is radically different in source-code +form compared to the previous release, 0.10. + +One of the design goals of 0.12.x was to reduce or even eliminate the +"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX +releases had to copy/paste/edit large chunks of Nutch source code in +order to add the NutchWAX features. + +Also, the NutchWAX 0.12.x sources and build are designed to one day be +added into mainline Nutch as a proper "contrib" package; then +eventually be fully integrated into the core Nutch source code. + +====================================================================== + +Most of the NutchWAX source code is relatively straightfoward to those +already familiar with the inner workings of Nutch. Still, special +attention on one class is worth while: + + src/java/org/archive/nutchwax/Importer.java + +This is where ARC/WARC files are read and their documents are imported +into a Nutch segment. + +It is inspired by: + + nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java + +on the Nutch SVN head. + +Our implementation differs in a few important ways: + + o Rather than taking a directory with ARC files as input, we take a + manifest file with URLs to ARC files. This way, the manifest is + split up among the distributed Hadoop jobs and the ARC files are + processed in whole by each worker. + + In the Nutch SVN, the ArcSegmentCreator.java expects the input + directory to contain the ARC files and (AFAICT) splits them up and + distributes them across the Hadoop workers. + + o We use the standard Internet Archive ARCReader and WARCReader + classes. Thus, NutchWAX can read both ARC and WARC files, whereas + the ArcSegmentCreator class can only read ARC files. + + o We add metadata fields to the document, which are then available + to the "index-nutchwax" plugin at indexing-time. + + Importer.importRecord() + ... + contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); + contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); + contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); + contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); + ... + + +====================================================================== +Patching +====================================================================== + +When NutchWAX is built, a number of patches are automatically applied +to the Nutch source and configuration files. + +---------------------------------------------------------------------- +The file + + /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml + +contains two errors: one where a mimetype is referenced before it is +defined; and a second where a definition has an illegal character. + +These errors cause Nutch to not recognize certain mimetypes and +therefore will ignore documents matching those mimetypes. + +There are two fixes: + + 1. Move + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + definition higher up in the file, before the reference to it. + + 2. Remove + + <mime-type type="application/x-ms-dos-executable"> + <alias type="application/x-dosexec;exe" /> + </mime-type> + + as the ';' character is illegal according to the comments in the + Nutch code. + +You can either apply these patches yourself, or copy an already-patched +copy from: + + /opt/nutchwax-0.12.3/contrib/archive/conf/tika-mimetypes.xml + +to + + /opt/nutchwax-0.12.3/conf/tika-mimetypes.xml + +---------------------------------------------------------------------- + +In the file 'conf/nutch-site.xml' we define some properties to +over-ride the values in 'conf/nutch-default.xml'. + +-------------------------------------------------- +plugin.includes +-------------------------------------------------- +Change the list of plugins from: + + protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) + +to + + protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax + +In short, we add: + + index-nutchwax + query-nutchwax + urlfilter-nutchwax + parse-pdf + +and remove: + + urlfilter-regex + urlnormalizer-(pass|regex|basic) + +The only *required* changes are the additions of the NutchWAX index +and query plugins. The rest are optional, but recommended. + +The "parse-pdf" plugin is added simply because we have lots of PDFs in +our archives and we want to index them. We sometimes remove the +"parse-js" plugin if we don't care to index JavaScript files. + +We also remove the default Nutch URL filtering and normalizing plugins +because we do not need the URLs normalized nor filtered. We trust +that the tool that produced the ARC/WARC file will have normalized the +URLs contained therein according to its own rules so there's no need +to normalize here. Also, we don't filter by URL since we want to +index as much of the ARC/WARC file as we have parsers for. + +We do, however, add the NutchWAX URL filter. If de-duplication is +being performed upon import, this plugin is required. It performs URL +filtering of the list of ARC records to exclude based on +URL+digest+date. + +-------------------------------------------------- +indexingfilter.order +-------------------------------------------------- + +Add this property with a value of + + org.apache.nutch.indexer.basic.BasicIndexingFilter + org.archive.nutchwax.index.ConfigurableIndexingFilter + +So that the NutchWAX indexing filter is run after the Nutch basic +indexing filter. + +A full explanation is given in "README-dedup.txt". + +-------------------------------------------------- +mime.type.magic +-------------------------------------------------- +We disable mimetype detection in Nutch for two reasons: + +1. The ARC/WARC file specifies the Content-Type of the document. We + trust that the tool that created the ARC/WARC file got it right. + +2. The implementation in Nutch can use a lot of memory as the *entire* + document is read into memory as a byte[], then converted to a + String, then checked against the MIME database. This can lead to + out of memory errors for large files, such as music and video. + +To disable, simply set the property value to false. + + <property> + <name>mime.type.magic</name> + <value>false</value> + </property> + +-------------------------------------------------- +nutchwax.filter.index +-------------------------------------------------- +Configure the 'index-nutchwax' plugin. Specify how the metadata +fields added by the Importer are mapped to the Lucene documents during +indexing. + +The specifications here are of the form: + + src-key:lowercase:store:tokenize:exclusive:dest-key + +where the only required part is the "src-key", the rest will assume +the following defaults: + + lowercase = true + store = true + tokenize = false + exclusive = true + dest-key = src-key + +We recommend: + +<property> + <name>nutchwax.filter.index</name> + <value> + url:false:true:true + url:flase:true:false:true:exacturl + orig:false + digest:false + filename:false + fileoffset:false + collection + date + type + length + </value> +</property> + +The "url", "orig" and "digest" values are required, the rest are +optional, but strongly recommended. + +-------------------------------------------------- +nutchwax.filter.query +-------------------------------------------------- +Configure the 'query-nutchwax' plugin. Specify which fields to make +searchable via "field:[term|phrase]" query syntax, and whether they +are "raw" fields or not. + +The specification format is one of: + + field:<name>:<boost> + raw:<name>:<lowercase>:<boost> + group:<name>:<lowercase>:<delimiter>:<boost> + +Default values are + + lowercase = true + delimiter = "," + boost = 1.0f + +There is no "lowercase" property for "field" specification because the +Nutch FieldQueryFilter doesn't expose the option, unlike the +RawFieldQueryFilter. + +The "group" fields are raw fields that can accept multiple values, +separated by a delimiter. Multiple values appearing in a query are +automagically translated into required OR-groups, such as + + collection:"193,221,36" => +(collection:193 collection:221 collection:36) + +NOTE: We do *not* use this filter for handling "date" queries, there +is a specific filter for that: DateQueryFilter + +We recommend: + +<property> + <name>nutchwax.filter.query</name> + <value> + raw:digest:false + raw:filename:false + raw:fileoffset:false + raw:exacturl:false + group:collection + group:type + field:anchor + field:content + field:host + field:title + </value> +</property> + + +-------------------------------------------------- +nutchwax.urlfilter.wayback.exclusions +-------------------------------------------------- +File containing the exclusion list for importing. + +Normally, this is specified on the command line with the NutchWAX +Importer is invoked. It can be specified here if preferred. + +-------------------------------------------------- +nutchwax.urlfilter.wayback.canonicalizer +-------------------------------------------------- + +For CDX-based de-duplication, the same URL canonicalization algorithm +must be used here as was used to generate the CDX files. + +The default canonicalizer in Wayback's '(w)arc-indexer' utility +is + + org.archive.wayback.util.url.AggressiveUrlCanonicalizer + +which is the value provided in "nutch-site.xml". + +If the '(w)arc-indexer' is executed with the "-i" (identity) +command-line option, then the matching canonicalizer + + org.archive.wayback.util.url.IdentityUrlCanonicalizer + +must be specified here. + +-------------------------------------------------- +nutchwax.filter.http.status +-------------------------------------------------- +This property configures a filter with a list of ranges +of HTTP status codes to allow. + +Typically, most NutchWAX implementors do not wish to import and index +404, 500, 302 and other non-success pages. This is an inclusion +filter, meaning that only ARC records with an HTTP status code +matching any of the values will be imported. + +There is a special "unknown" value which can be used to include ARC +records that don't have an HTTP status code (for whatever reason). + +The default setting provided in nutch-site.xml is to allow any 2XX +success code: + + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + </value> + </property> + +But some other examples are: + + Allow any 2XX success code *and* redirects, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200-299 + 300-399 + </value> + </property> + + Be really strict about only certain codes, use: + <property> + <name>nutchwax.filter.http.status</name> + <value> + 200 + 301 + 302 + 304 + </value> + </property> + + Mix of ranges and specific codes, including the "unknown" + <property> + <name>nutchwax.filter.http.status</name> + <value> + Unknown + 200 + 300-399 + </value> + </property> + +-------------------------------------------------- +nutchwax.import.content.limit +-------------------------------------------------- +Similar to Nutch's + + file.content.limit + http.content.limit + ftp.content.limit + +properties, this specifies a limit on the size of a document imported +via NutchWAX. + +We recommend setting this to a size compatible with the memory +capacity of the computers performing the import. Something in the +1-4MB range is typical. + Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-pagerank.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -31,7 +31,7 @@ in the full-text search index. Nutch's 'invertlinks' step inverts links and stores them in the -'linkdb' directory. We use the inlinks to boost the Lucene score of +'linkdb' directory. We use these inlinks to boost the Lucene score of documents in proportion to the number of inlinks. Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -5,9 +5,8 @@ Table of Contents o Prerequisites - - Nutch(WAX) installation + - NutchWAX installation - ARC/WARC files - o Configuration & Patching o Create a manifest o Import, Invert and Index o Search @@ -27,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutch-1.0-dev + /opt/nutchwax-0.12.3 2. ARC/WARC files. @@ -40,348 +39,6 @@ ====================================================================== -Patching -====================================================================== - -The vanilla NutchWAX as built according to the INSTALL.txt guide is -not quite ready to be used out-of-the-box. - -Before you can use NutchWAX, you must first patch a bug that exists in -the current Nutch SVN head. - -The file - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - -contains two errors: one where a mimetype is referenced before it is -defined; and a second where a definition has an illegal character. - -These errors cause Nutch to not recognize certain mimetypes and -therefore will ignore documents matching those mimetypes. - -There are two fixes: - - 1. Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - 2. Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -You can either apply these patches yourself, or copy an already-patched -copy from: - - /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml - -to - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - - -====================================================================== -Configuring -====================================================================== - -Since we assume that you are already familiar with Nutch, then you -should already be familiar with configuring it. The configuration -is mainly defined in - - /opt/nutch-1.0-dev/conf/nutch-default.xml - -NutchWAX requires the modification of two existing properties and the -addition of two new ones. - -All of the modifications described below can be found in: - - /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml - -You can either apply the configuration changes yourself, or copy that -file to - - /opt/nutch-1.0-dev/conf/nutch-site.xml - --------------------------------------------------- -plugin.includes --------------------------------------------------- -Change the list of plugins from: - - protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) - -to - - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax - -In short, we add: - - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf - -and remove: - - urlfilter-regex - urlnormalizer-(pass|regex|basic) - -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. - -The "parse-pdf" plugin is added simply because we have lots of PDFs in -our archives and we want to index them. We sometimes remove the -"parse-js" plugin if we don't care to index JavaScript files. - -We also remove the default Nutch URL filtering and normalizing plugins -because we do not need the URLs normalized nor filtered. We trust -that the tool that produced the ARC/WARC file will have normalized the -URLs contained therein according to its own rules so there's no need -to normalize here. Also, we don't filter by URL since we want to -index as much of the ARC/WARC file as we have parsers for. - -We do, however, add the NutchWAX URL filter. If de-duplication is -being performed upon import, this plugin is required. It performs URL -filtering of the list of ARC records to exclude based on -URL+digest+date. - --------------------------------------------------- -indexingfilter.order --------------------------------------------------- - -Add this property with a value of - - org.apache.nutch.indexer.basic.BasicIndexingFilter - org.archive.nutchwax.index.ConfigurableIndexingFilter - -So that the NutchWAX indexing filter is run after the Nutch basic -indexing filter. - -A full explanation is given in "README-dedup.txt". - --------------------------------------------------- -mime.type.magic --------------------------------------------------- -We disable mimetype detection in Nutch for two reasons: - -1. The ARC/WARC file specifies the Content-Type of the document. We - trust that the tool that created the ARC/WARC file got it right. - -2. The implementation in Nutch can use a lot of memory as the *entire* - document is read into memory as a byte[], then converted to a - String, then checked against the MIME database. This can lead to - out of memory errors for large files, such as music and video. - -To disable, simply set the property value to false. - - <property> - <name>mime.type.magic</name> - <value>false</value> - </property> - --------------------------------------------------- -nutchwax.filter.index --------------------------------------------------- -Configure the 'index-nutchwax' plugin. Specify how the metadata -fields added by the Importer are mapped to the Lucene documents during -indexing. - -The specifications here are of the form: - - src-key:lowercase:store:tokenize:exclusive:dest-key - -where the only required part is the "src-key", the rest will assume -the following defaults: - - lowercase = true - store = true - tokenize = false - exclusive = true - dest-key = src-key - -We recommend: - -<property> - <name>nutchwax.filter.index</name> - <value> - url:false:true:true - url:flase:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length - </value> -</property> - -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. - --------------------------------------------------- -nutchwax.filter.query --------------------------------------------------- -Configure the 'query-nutchwax' plugin. Specify which fields to make -searchable via "field:[term|phrase]" query syntax, and whether they -are "raw" fields or not. - -The specification format is one of: - - field:<name>:<boost> - raw:<name>:<lowercase>:<boost> - group:<name>:<lowercase>:<delimiter>:<boost> - -Default values are - - lowercase = true - delimiter = "," - boost = 1.0f - -There is no "lowercase" property for "field" specification because the -Nutch FieldQueryFilter doesn't expose the option, unlike the -RawFieldQueryFilter. - -The "group" fields are raw fields that can accept multiple values, -separated by a delimiter. Multiple values appearing in a query are -automagically translated into required OR-groups, such as - - collection:"193,221,36" => +(collection:193 collection:221 collection:36) - -NOTE: We do *not* use this filter for handling "date" queries, there -is a specific filter for that: DateQueryFilter - -We recommend: - -<property> - <name>nutchwax.filter.query</name> - <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false - group:collection - group:type - field:anchor - field:content - field:host - field:title - </value> -</property> - - --------------------------------------------------- -nutchwax.urlfilter.wayback.exclusions --------------------------------------------------- -File containing the exclusion list for importing. - -Normally, this is specified on the command line with the NutchWAX -Importer is invoked. It can be specified here if preferred. - --------------------------------------------------- -nutchwax.urlfilter.wayback.canonicalizer --------------------------------------------------- - -For CDX-based de-duplication, the same URL canonicalization algorithm -must be used here as was used to generate the CDX files. - -The default canonicalizer in Wayback's '(w)arc-indexer' utility -is - - org.archive.wayback.util.url.AggressiveUrlCanonicalizer - -which is the value provided in "nutch-site.xml". - -If the '(w)arc-indexer' is executed with the "-i" (identity) -command-line option, then the matching canonicalizer - - org.archive.wayback.util.url.IdentityUrlCanonicalizer - -must be specified here. - --------------------------------------------------- -nutchwax.filter.http.status --------------------------------------------------- -This property configures a filter with a list of ranges -of HTTP status codes to allow. - -Typically, most NutchWAX implementors do not wish to import and index -404, 500, 302 and other non-success pages. This is an inclusion -filter, meaning that only ARC records with an HTTP status code -matching any of the values will be imported. - -There is a special "unknown" value which can be used to include ARC -records that don't have an HTTP status code (for whatever reason). - -The default setting provided in nutch-site.xml is to allow any 2XX -success code: - - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - </value> - </property> - -But some other examples are: - - Allow any 2XX success code *and* redirects, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - 300-399 - </value> - </property> - - Be really strict about only certain codes, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200 - 301 - 302 - 304 - </value> - </property> - - Mix of ranges and specific codes, including the "unknown" - <property> - <name>nutchwax.filter.http.status</name> - <value> - Unknown - 200 - 300-399 - </value> - </property> - --------------------------------------------------- -nutchwax.import.content.limit --------------------------------------------------- -Similar to Nutch's - - file.content.limit - http.content.limit - ftp.content.limit - -properties, this specifies a limit on the size of a document imported -via NutchWAX. - -We recommend setting this to a size compatible with the memory -capacity of the computers performing the import. Something in the -1-4MB range is typical. - - -====================================================================== Create a manifest ====================================================================== @@ -411,10 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest - $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.3/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.3/bin/nutch updatedb crawldb -dir segments + $ /opt/nutchwax-0.12.3/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutchwax-0.12.3/bin/nutch index indexes crawldb linkdb segments/* $ ls -F1 crawldb/ indexes/ @@ -439,7 +96,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer + $ /opt/nutchwax-0.12.3/bin/nutch org.apache.nutch.searcher.NutchBean computer This calls the NutchBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -450,17 +107,9 @@ Web Deployment ====================================================================== -As users of Nutch are aware, the web application (nutch-1.0-dev.war) -bundled with Nutch contains duplicate copies of the configuration -files. +The Nutch(WAX) web application is bundled with NutchWAX as -So, all patches and configuration changes that we made to the -files in + /opt/nutchwax-0.12.3/nutch-1.0-dev.war - /opt/nutch-1.0-dev/conf - -will have to be duplicated in the Nutch webapp when it is deployed. - -This is not due to NutchWAX, this is a "feature" of regular Nutch. I -just thought it would be good to remind everyone since we did make -configuration changes for NutchWAX. +Simply deploy that web application in the same fashion as with +Nutch. Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -3,10 +3,22 @@ 2008-12-18 Aaron Binns +Table of Contents + o Introduction + o Build from source + - SVN: Nutch 1.0-dev + - SVN: NutchWAX + - Build and Install + o Install binary package + + +====================================================================== +Introduction +====================================================================== + This installation guide assumes the reader is already familiar with building, packaging and deploying Nutch 1.0-dev. - The NutchWAX 0.12 source and build system are designed to integrate into the existing Nutch 1.0-dev source and build. @@ -20,12 +32,12 @@ proper, then builds NutchWAX components and integrates them into the Nutch build directory. -We recommend that you execute all build commands from the NutchWAX -directory. This way, NutchWAX will ensure that any and all +In order to build NutchWAX, execute all build commands from the +NutchWAX directory. This way, NutchWAX will ensure that any and all dependencies in Nutch will be properly built and kept up-to-date. Towards this goal, we have duplicated the most common build targets -from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, -such as: +from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, such +as: o compile o jar @@ -39,8 +51,15 @@ sub-directory as normal. -Nutch-1.0-dev -------------- +====================================================================== +Build from Source +====================================================================== + +To build from source, you must check-out the Nutch and NutchWAX sources +from their respective 'subversion' source control servers. + +SVN: nutch-1.0-dev +------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.3 is @@ -53,9 +72,12 @@ $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch +Please be sure to check-out this specific version of the Nutch source. +If you just grab the head of the trunk, there may be newer and +incompatible changed to Nutch. -NutchWAX --------- +SVN: NutchWAX +------------- Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into Nutch's "contrib" directory. @@ -65,7 +87,6 @@ This will create a sub-directory named "archive" containing the NutchWAX sources. - Build and install ----------------- Assuming you already have the required tool-set for building Nutch, @@ -91,3 +112,18 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz + $ mv nutch-1.0-dev nutchwax-0.12.3 + + +====================================================================== +Install binary package +====================================================================== + +Alternatively, grab a "binary" release package from the Internet +Archive's NutchWAX home page. + +Install it simply by untarring it, for example: + + $ cd /opt + $ tar xvfz nutchwax-0.12.3.tar.gz + Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -3,6 +3,16 @@ 2008-12-18 Aaron Binns +Table of Contents + o Introduction + o Build and Install + o Tutorial + + +====================================================================== +Introduction +====================================================================== + Welcome to NutchWAX 0.12.3! NutchWAX is a set of add-ons to Nutch in order to index and search @@ -17,7 +27,6 @@ Since NutchWAX is a set of add-ons to Nutch, you should already be familiar with Nutch before using NutchWAX. -====================================================================== The goal of NutchWAX is to enable full-text indexing and searching of documents stored in web archive file formats (ARC and WARC). @@ -26,13 +35,13 @@ to Nutch to read documents directly from ARC/WARC files. We call this process "importing" archive files. -Importing produces a Nutch segment, similar to Nutch crawling the -documents itself. In this scenario, document importing replaces the +Importing produces a Nutch segment, the same as when Nutch is used to +crawl documents itself. In essence, document importing replaces the conventional "generate/fetch/update" cycle of Nutch. Once the archival documents have been imported into a segment, the -regular Nutch commands to update the 'crawldb', invert the links and -index the document contents can proceed as normal. +regular Nutch commands to index the document contents can proceed as +normal. ====================================================================== @@ -71,73 +80,25 @@ conf/nutch-site.xml - Sample configuration properties file showing suggested settings for - Nutch and NutchWAX. + Additional configuration properties for NutchWAX, including + over-rides for properties defined in 'nutch-default.xml' There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX is distributed in source code form and is intended to be built in conjunction with Nutch. -See "INSTALL.txt" for details on building NutchWAX and Nutch. -See "HOWTO.txt" for a quick tutorial on importing, indexing and -searching a set of documents in a web archive file. - ====================================================================== - -This 0.12.x release of NutchWAX is radically different in source-code -form compared to the previous release, 0.10. - -One of the design goals of 0.12.x was to reduce or even eliminate the -"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX -releases had to copy/paste/edit large chunks of Nutch source code in -order to add the NutchWAX features. - -Also, the NutchWAX 0.12.x sources and build are designed to one day be -added into mainline Nutch as a proper "contrib" package; then -eventually be fully integrated into the core Nutch source code. - +Build and Install ====================================================================== -Most of the NutchWAX source code is relatively straightfoward to those -already familiar with the inner workings of Nutch. Still, special -attention on one class is worth while: +See "INSTALL.txt" for detailed instructions to build NutchWAX from +source or install a binary package. - src/java/org/archive/nutchwax/Importer.java -This is where ARC/WARC files are read and their documents are imported -into a Nutch segment. - -It is inspired by: - - nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java - -on the Nutch SVN head. - -Our implementation differs in a few important ways: - - o Rather than taking a directory with ARC files as input, we take a - manifest file with URLs to ARC files. This way, the manifest is - split up among the distributed Hadoop jobs and the ARC files are - processed in whole by each worker. - - In the Nutch SVN, the ArcSegmentCreator.java expects the input - directory to contain the ARC files and (AFAICT) splits them up and - distributes them across the Hadoop workers. - - o We use the standard Internet Archive ARCReader and WARCReader - classes. Thus, NutchWAX can read both ARC and WARC files, whereas - the ArcSegmentCreator class can only read ARC files. - - o We add metadata fields to the document, which are then available - to the "index-nutchwax" plugin at indexing-time. - - Importer.importRecord() - ... - contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); - contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); - contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); - contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); - ... - ====================================================================== +Tutorial +====================================================================== + +See "HOWTO.txt" for a quick tutorial on importing, indexing and +searching a set of documents in a web archive file. Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-16 19:53:25 UTC (rev 2677) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-12-18 18:37:40 UTC (rev 2678) @@ -21,8 +21,45 @@ o Enhanced OpenSearchServlet o Improved XSLT sample for OpenSearch o System init.d script for searcher slaves - o Enhanced searcher slave aware of NutchWAX extensions + o Enhanced searcher slave which supports NutchWAX extensions + +One of the major changes to 0.12.3 is not a feature, enhancement or +bug-fix, but the way the NutchWAX source is "integrated" into the +Nutch source. + +Yes, the NutchWAX source is still kept in the contrib/archive +sub-directory, but when you invoke a build command from the +NutchWAX directory, such as + + $ cd nutch/contrib/archive + $ ant tar + +Many files from the NutchWAX source tree are copied directly into the +Nutch source tree before the build process begins. + +The reason for this is to make NutchWAX easier to use. + +In previous versions of NutchWAX, once 'ant' build command was +finished, the operator had to manually patch configuration files in +the Nutch directory. Upon a subsequent build, the files would be +over-written by Nutch's and would have to be patched again. + +It was a major hassle and complication. + +Another impetus for copying files into the Nutch source was to patch +bugs and make enhancements in the Nutch Java code which couldn't be +effectively done keeping the sources separate. When an 'ant' build +command is run a few Java files are copied from the NutchWAX source +tree into the Nutch source tree. + +In release 0.12.3, the NutchWAX build file: 'build.xml' handles all of +this. Simply execute your build commands from 'contrib/archive' as +instructed in the HOWTO and no longer worry about patching +configuration files. If you wish to alter the NutchWAX configuration +file, make those changes in the NutchWAX source tree. + + ====================================================================== Issues ====================================================================== This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |