From: <bi...@us...> - 2008-05-21 00:01:54
|
Revision: 2266 http://archive-access.svn.sourceforge.net/archive-access/?rev=2266&view=rev Author: binzino Date: 2008-05-20 17:02:01 -0700 (Tue, 20 May 2008) Log Message: ----------- Total re-write of install, readme and howto documents. Modified Paths: -------------- trunk/archive-access/projects/nat/archive/INSTALL.txt trunk/archive-access/projects/nat/archive/README.txt Added Paths: ----------- trunk/archive-access/projects/nat/archive/HOWTO.txt Added: trunk/archive-access/projects/nat/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nat/archive/HOWTO.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/HOWTO.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -0,0 +1,325 @@ + +HOWTO.txt +2008-05-20 +Aaron Binns + +Table of Contents + o Prerequisites + - Nutch(WAX) installation + - ARC/WARC files + o Configuration & Patching + o Create a manifest + o Import, Invert and Index + o Search + o Web deployment + - Don't forget to config & patch again + +====================================================================== +Prerequisites +====================================================================== + +In order to use Nutch(WAX) you need the following prerequisites: + + 1. NutchWAX installed. + + See INSTALL.txt for instruction on building and installing + NutchWAX. + + This HOWTO assumes it is installed in + + /opt/nutch-1.0-dev + + 2. ARC/WARC files. + + The whole purpose of NutchWAX is to index ARC/WARC files. These + files are not produced by Nutch nor NutchWAX, they are produced by + other tools, such as Heritrix. + + If you don't have any ARC/WARC files, you have no need for + NutchWAX. + + +====================================================================== +Patching +====================================================================== + +The vanilla NutchWAX as built according to the INSTALL.txt guide is +not quite ready to be used out-of-the-box. + +Before you can use NutchWAX, you must first patch a bug that exists in +the current Nutch SVN head. + +The file + + /opt/nutch-1.0-dev/conf/tika-mimetypes.xml + +contains two errors: one where a mimetype is referenced before it is +defined; and a second where a definition has an illegal character. + +These errors cause Nutch to not recognize certain mimetypes and +therefore will ignore documents matching those mimetypes. + +There are two fixes: + + 1. Move + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + definition higher up in the file, before the reference to it. + + 2. Remove + + <mime-type type="application/x-ms-dos-executable"> + <alias type="application/x-dosexec;exe" /> + </mime-type> + + as the ';' character is illegal according to the comments in the + Nutch code. + +You can either apply these patches yourself, or copy an already-patched +copy from: + + /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml + +to + + /opt/nutch-1.0-dev/conf/tika-mimetypes.xml + + +====================================================================== +Configuring +====================================================================== + +Since we assume that you are already familiar with Nutch, then you +should already be familiar with configuring it. The configuration +is mainly defined in + + /opt/nutch-1.0-dev/conf/nutch-default.xml + +NutchWAX requires the modification of two existing properties and the +addition of two new ones. + +All of the modifications described below can be found in: + + /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml + +You can either apply the configuration changes yourself, or copy that +file to + + /opt/nutch-1.0-dev/conf/nutch-site.xml + +-------------------------------------------------- +plugin.includes +-------------------------------------------------- +Change the list of plugins from: + + protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) + +to + + protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic + +In short, we add: + + index-nutchwax + query-nutchwax + parse-pdf + +and remove: + + urlfilter-regex + urlnormalizer-(pass|regex|basic) + +The only *required* changes are the additions of the NutchWAX index +and query plugins. The rest are optional, but recommended. + +The addition of the "parse-pdf" plugin is simply because we have lots +of PDFs in our archives and we want to index them. We sometimes +remove the "parse-js" plugin if we don't care to index JavaScript +files. + +We also remove the URL filtering and normalizing plugins because we do +not need the URLs normalized nor filtered. We trust that the tool +that produced the ARC/WARC file will have normalized the URLs +contained therein according to its own rules so there's no need to +normalize here. Also, we don't filter by URL since we want to index +as much of the ARC/WARC file as we have parsers for. + +-------------------------------------------------- +mime.type.magic +-------------------------------------------------- +We disable mimetype detection in Nutch for two reasons: + +1. The ARC/WARC file specifies the Content-Type of the document. We + trust that the tool that created the ARC/WARC file got it right. + +2. The implementation in Nutch can use a lot of memory as the *entire* + document is read into memory as a byte[], then converted to a + String, then checked against the MIME database. This can lead to + out of memory errors for large files, such as music and video. + +To disable, simply set the property value to false. + + <property> + <name>mime.type.magic</name> + <value>false</value> + </property> + +-------------------------------------------------- +nutchwax.filter.index +-------------------------------------------------- +Configure the 'index-nutchwax' plugin. Specify how the metadata +fields added by the ArcsToSegment are mapped to the Lucene documents +during indexing. + +The specifications here are of the form: + + src-key:lowercase:store:tokenize:dest-key + +where the only required part is the "src-key", the rest will assume +the following defaults: + + lowercase = true + store = true + tokenize = false + dest-key = src-key + +We recommend: + +<property> + <name>nutchwax.filter.index</name> + <value> + arcname:false + collection + date + type + </value> +</property> + +-------------------------------------------------- +nutchwax.filter.query +-------------------------------------------------- +Configure the 'query-nutchwax' plugin. Specify which fields to make +searchable via "[field]:[term|phrase]" query syntax, and whether they +are "raw" fields or not. + +The specification format is + + raw:name:lowercase:boost +or + field:name:boost + +Default values are + + lowercase = true + boost = 1.0f + +There is no "lowercase" property for "field" specification because the +Nutch FieldQueryFilter doesn't expose the option, unlike the +RawFieldQueryFilter. + +NTOE: We do *not* use this filter for handling "date" queries, there is a +specific filter for that: DateQueryFilter + +We recommend: + +<property> + <name>nutchwax.filter.query</name> + <value> + raw:arcname:false + raw:collection + raw:type + field:anchor + field:content + field:host + field:title + </value> +</property> + + +====================================================================== +Create a manifest +====================================================================== + +The input to NutchWAX's import tool is a manifest file. This is a +simple text file where each line contains a URL to an ARC/WARC file +and an optional collection name. + +For example: + + $ cat > manifest + http://someserver/somepath/somearchive.arc.gz mycollection + ^D + +Creates a simple manifest file with one ARC file and a collection +name of "mycollection". + +You don't have to use collections at all. If you don't know how you +would use it, then simply leave it out here. + + +====================================================================== +Import, Invert and Index +====================================================================== + +The steps to import the files, invert the link and index the documents +are rather simple: + + $ mkdir crawl + $ cd crawl + $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest + $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments + $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* + $ ls -F1 + crawldb/ + indexes/ + linkdb/ + segments/ + +To those already familiar with Nutch, these steps should be quite +familiar. + +The first step, we call NutchWAX's "import" command which creates the +Nutch segment containing the documents in the ARC/WARC files listed in +the manifest. The rest is the same as regular Nutch. + + +====================================================================== +Search +====================================================================== +The resulting indexes can be searched in exactly the same manner as in +regular Nutch. For example, assuming you just completed the steps +above, now: + + $ cd ../ + $ ls -F1 + crawl/ + $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer + +This calls the NutchBean to execute a simple keyword search for +"computer". Use whatever query term you think appears in the +documents you imported. + + +====================================================================== +Web Deployment +====================================================================== + +As users of Nutch are aware, the web application (nutch-1.0-dev.war) +bundled with Nutch contains duplicate copies of the configuration +files. + +So, all patches and configuration changes that we made to the +files in + + /opt/nutch-1.0-dev/conf + +will have to be duplicated in the Nutch webapp when it is deployed. + +This is not due to NutchWAX, this is a "feature" of regular Nutch. I +just thought it would be good to remind everyone since we did make +configuration changes for NutchWAX. Modified: trunk/archive-access/projects/nat/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-14 00:20:24 UTC (rev 2265) +++ trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -1,236 +1,93 @@ INSTALL.txt -2008-05-06 +2008-05-20 Aaron Binns +This installation guide assumes the reader is already familiar with +building, packaging and deploying Nutch 1.0-dev. -The NutchWAX 0.12 build and installation is as an "add-on" to an -existing Nutch 1.0-dev installation. -NutchWAX 0.12 uses a simple 'ant' build script. The script compiles -the NutchWAX sources, using the libraries in the installed -Nutch-1.0-dev. +The NutchWAX 0.12 source and build system are designed to integrate +into the existing Nutch 1.0-dev source and build. -We strongly recommend having *two* Nutch-1.0-dev installation -directories: one that you build NutchWAX against, and another into -which NutchWAX is deployed. +The long-term goal is for the NutchWAX components to be fully +integrated into mainline Nutch. As a stepping-stone toward that goal, +we have packaged the NutchWAX source to be dropped into the Nutch +"contrib" directory and built from there. -NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file -*into* an existing Nutch-1.0-dev installation. Think of NutchWAX as -an add-on. We over-write a few Nutch config files, but the rest is -simply added to the existing Nutch-1.0-dev installation. +Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script. The +NutchWAX build script calls out to the Nutch script to build Nutch +proper, then builds NutchWAX components and integrates them into the +Nutch build directory. +We recommend that you execute all build commands from the NutchWAX +directory. This way, NutchWAX will ensure that any and all +dependencies in Nutch will be properly built and kept up-to-date. +Towards this goal, we have duplicated the most common build targets +from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, +such as: + o compile + o jar + o job + o tar + o clean + +Again, the idea is that if you're already used to building Nutch, you +can easily transition to building Nutch and NutchWAX together. All of +the build artifacts will still be placed in Nutch's 'build' +sub-directory as normal. + + Nutch-1.0-dev ------------- - -As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Now +As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is built against is: 650739 To checkout this revision of Nutch, use: - $ mkdir nutch + $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch - $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk -To build the nutch-1.0-dev.tar.gz package, use 'ant' - $ cd trunk - $ ant tar +NutchWAX +-------- +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into +Nutch's "contrib" directory. -This produces + $ cd contrib + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive - build/nutch-1.0-dev.tar.gz +This will create a sub-directory named "archive" containing the +NutchWAX sources. -Which we then install *twice* - $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev - $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz - $ mkdir -p /opt/nutch-1.0-dev - $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz - -The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which -we compile against, then, when we want to test NutcWAX, we deploy it -into ~/nutchwax-0.12/nutch-1.0-dev. - -Why can't we just use one installation of Nutch? Mainly to avoid -weirdness where we are compiling NutchWAX source against the same set -of libraries where we would be installing NutchWAX. Consider, when we -deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib' -directory. If we use that same 'lib' directory for dependencies when -compiling the source, 'ant'/'javac' will likely get confused when -calculating dependencies. - -It's possible that you could successfully go through the -build/test/release cycle using one Nutch-1.0-dev directory, but these -instructions assume you will have two. - - Build and install ----------------- +Assuming you already have the required tool-set for building Nutch, +building NutchWAX is a snap. - 1. Install two Nutch-1.0-dev packages per the instructions above. +Simply execute the same 'ant' build command in - 2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev + nutch/contrib/archive - <!-- NOTE: Point this to your Nutch 1.0-dev directory --> - <property name="nutch.dir" value="/opt/nutch-1.0-dev" /> +as you normally would and everything will build as normal. - 3. Build NutchWAX-0.12 +For example - $ ant + $ cd nutch/contrib/archive + $ ant tar - The default build rule is "package" which will compile all the source - and build an intallation tarball: nutchwax-0.12.tar.gz +This command will build all of Nutch, then the NutchWAX add-ons and +finally will package everything up into the "nutch-1.0-dev.tar.gz" +release package. - The "build.xml" file is pretty straightforward and just grepping - for the targets should be pretty obvious: compile, clean, etc. - - 4. Install NutchWAX into the build/test Nutch installation - - $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz - -That's it! - -All we do is add our libraries (nutchwax.jar and dependencies), the -'nutchwax' helper script, plugins for indexing and querying, and a few -config files. - -Except for the config files, no files in the Nutch-1.0-dev -installation are over-written, only added. The "nutch-site.xml" file -is over-written, but that file is empty in a vanilla Nutch -installation, so there's small risk of over-writing something. - - -HOWTO run and test ------------------- - -The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin' -directory next to the 'nutch' helper script. - -The 'nutchwax' script is used to run the NutchWAX-specific tools, use -the regular 'nutch' script for regular Nutch activities. - -The 'nutchwax' script runs two tools - - "import" Import a set of .arc/.warc files from a manifest, creating - a Nutch segment. - - "dumpindex" Debug tool that dumps a Lucene index, such as the ones - created by Nutch's "index" tool. - -The idea is that the NutchWAX "import" tool supplants the Nutch -generate and fetch cycle. Rather than generating and fetching -segments, we import the .arc/.warc files directly into a newly created -segment. Then, we process that segment just as we normally would with -Nutch. - -For example, - - $ cd nutch-test - $ cat > manifest - http://someserver/foo-bar-baz.arc mycollection - ^D - $ nutch-1.0-dev/bin/nutchwax import manifest - -This will import the arc file listed in the manifest into a newly -created segment. The segment is created by default in a directory -hierarchy of the form: - - segments/[date-timestamp] - -This mirrors the way segments are created in vanilla Nutch by the -"generate" command. - -You can explicitly name the segment if you want, e.g. - - nutchwax import manifest mysegment - -Once the segment is created by the importing of ARC files with -NutchWAX, you can use Nutch to perform the rest of the steps. For +Then, install the "nutch-1.0-dev.tar.gz" tarball as normal. For example: - $ nutch-1.0-dev/bin/nutchwax import manifest - $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* - $ nutch-1.0-dev/bin/nutch merge index indexes - -This is pretty much the minimal set of steps to import and index a set -of ARC files. The crawldb update and link inversion steps are pro -forma and don't have anything to do with NutchWAX specifically, but -are a part of regular Nutch processing. - -Now you have a Nutch "index" directory and are ready to search! - -Searching is done as in vanilla Nutch. Either launch the Nutch webapp -or use the command-line interface to NutchBean to run some test -searches. Nothing NutchWAX-specific here. - - -Miscellaneous notes -------------------- - -1. Plugins - -There are two plugins bundled with NutchWAX: - - index-nutchwax - query-nutchwax - -See the "plugin.includes" property in nutch-site.xml to see where -these plugins are added to the filter chain. - -The index-nutchwax plugin ensures that WAX-specifici metadata is -transferred from the Nutch Content object to the Lucene Document -object, which is placed in the Lucene index. - -The query-nutchwax plugin is used to process query requests against -those same meta-data fields. It also expands the capabilities of -searching the basic Nutch fields as well. - -2. URL filters - -Nutch's URL filter by default filters-out many common URL oddities -that would normally trip-up Nutch's crawler. However, when importing -content from ARC files, filtering out content probably doens't make -sense. That is, whatever content made it into the ARC file should be -imported, no matter what the URL looks like. - -To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'. -To pass all content through the filter, remove all filter rules except -for the last one: - - # accept anything else - +. - -3. conf/tika-mimetypes.xml - -NutchWAX comes with a fixed copy of tika-mimetypes.xml. The version -in Nutch revision 650739 has a few bugs in it which cause parsing to -fail for many document types. The bugs are: - - o Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - o Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes -these two bugs. + $ cd /opt + $ tar xvfz nutch-1.0-dev.tar.gz Modified: trunk/archive-access/projects/nat/archive/README.txt =================================================================== --- trunk/archive-access/projects/nat/archive/README.txt 2008-05-14 00:20:24 UTC (rev 2265) +++ trunk/archive-access/projects/nat/archive/README.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -1,105 +1,122 @@ README.txt -2008-05-06 +2008-05-20 Aaron Binns +Welcome to NutchWAX 0.12! -This is the NutchWAX-0.12 source that John Lee handed-off to me. It -is a work-in-progress. +NutchWAX is a set of add-ons to Nutch in order to index and search +archived web data. -Compared to NutchWAX-0.10 (and earlier) it is *much* simpler. The -main WAX-specific code is in just a few files really: +These add-ons are developed and maintained by the Internet Archive Web +Team in conjunction with a broad community of contributors, partners +and end-users. -src/java/org/archive/nutchwax/ArcsToSegment.java +The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions". - This is the meat of the WAX logic for processing .arc files and - generating Nutch segments. Once we use this to generate a set of - segments for the .arc files, we can use the rest of vanilla - Nutch-1.0-dev to invert links and index the content with Lucene. +Since NutchWAX is a set of add-ons to Nutch, you should already be +familiar with Nutch before using NutchWAX. - This conversion code is heavily edited from: +====================================================================== - nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java +The goal of NutchWAX is to enable full-text indexing and searching of +documents stored in web archive file formats (ARC and WARC). - taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development). +The way we achieve that goal is by providing add-on tools and plugins +to Nutch to read documents directly from ARC/WARC files. We call this +process "importing" archive files. - Ours differs in a few important ways: +Importing produces a Nutch segment, the same as if Nutch had actually +crawled the documents itself. In this scenario, document importing +replaces the conventional "generate/fetch/update" cycle of Nutch. - o Rather than taking a directory with .arc files as input, we take - a manifest file with URLs to .arc files. This way, the manifest - is split up among the distributed Hadoop jobs and the .arc files - are processed in whole by each worker. +Once the archival documents have been imported into a segment, the +regular Nutch commands to update the 'crawldb', invert the links and +index the document contents can proceed as normal. - In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the - input directory to contain the .arc files and (AFAICT) splits - them up and distributes them across the Hadoop workers. This - seems really inefficient to me, I think our approach is much - better -- at least for us. +====================================================================== - o Related to the way input files are split and processed, we use - the standard Archive ARCReader class just like Heritrix and - Wayback. +The NutchWAX add-ons consist of: - The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our - ARCReader because of licensing imcompatibility. Ours is under - GPL and Nutch-1.0-dev forbids the use of GPL code. - - We are in the process of re-licensing or dual-licensing with - Apache License, but until then, our ARCReader code won't be incldued - in mainline Nutch. + bin/nutchwax - This isn's a problem per se, but worth noting in case anyone - looks at the Nutch-1.0-dev code and wonders why they built their - own (horribly inefficient) .arc reader. + A shell script that is used to run the NutchWAX command-line tools, + such as document importing. - o We add metadata fields to the processed document for WAX-specific - purposes: + This is patterned after the 'bin/nutch' shell script. - content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); - content.getMetadata().set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ) ; - content.getMetadata().set( NutchWax.COLLECTION_KEY, collection); - content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() ); + plugins/index-nutchwax - The addition of the arcname and collection key is pretty - obvious. I don't know why the content-type isn't added in the - vanilla Nutch-1.0-dev. - - Also, we should review the use of the ARCHIVE_DATE_KEY in that - John Lee mentioned to me that there was possibly duplicate date - fields put in the index: one that is a plain old Java date, and - one that is a 14-digit date string for use with Wayback. + Indexing plugin which adds NutchWAX-specific metadata fields to the + indexed document. -src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java -src/java/plugin/index-nutchwax/plugin.xml + plugins/query-nutchwax - This filter is pretty straightforward. All it does is take the - metadata fields that were added to the document (as described above) - and placed in the Lucene index so that we can make use of them at - search-time. + Query plugin which allows for querying against the metadata fields + added by 'index-nutchwax'. -src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java -src/java/plugin/query-nutchwax/plugin.xml +There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX +is distributed in source code form and is intended to be built in +conjunction with Nutch. - This is a single query filter that can be used for querying single - fields from a single implementation. It does *not* allow for - querying multiple fields as you can already do that via Nutch. +See "INSTALL.txt" for details on building NutchWAX and Nutch. - What this filter does is allows one to more-or-less create query - filters in a data-driven manner rather than having to code-up a new - class for each field. That is, before one would have to create a - CollectionQueryFilter class to filter on the "collection" field. - With the MultipleFieldQueryFilter class, you can specify that the - "collection" field is to be filterable via the plugin.xml file and - "nutchwax.filter.query" configuration property. +See "HOWTO.txt" for a quick tutorial on importing, indexing and +searching a set of documents in a web archive file. -src/java/org/archive/nutchwax/NutchWax.java +====================================================================== - Just a simple enum used by the above two classes for the metadata - keys. +This 0.12 release of NutchWAX is radically different in source-code +form compared to the previous release, 0.10. -src/java/org/archive/nutchwax/tools/DumpIndex.java +One of the design goals of 0.12 was to reduce or even eliminate the +"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX +releases had to copy/paste/edit large chunks of Nutch source code in +order to add the NutchWAX features. - A simple command-line utility to dump the contents of a Lucene - index. Used for debugging. +Also, the NutchWAX 0.12 sources and build are designed to one day be +added into mainline Nutch as a proper "contrib" package; then +eventually be fully integrated into the core Nutch source code. +====================================================================== +Most of the NutchWAX source code is relatively straightfoward to those +already familiar with the inner workings of Nutch. Still, special +attention on one class is worth while: + + src/java/org/archive/nutchwax/ArcsToSegment.java + +This is where ARC/WARC files are read and their documents are imported +into a Nutch segment. + +It is inspired by: + + nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java + +on the Nutch SVN head. + +Our implementation differs in a few important ways: + + o Rather than taking a directory with ARC files as input, we take a + manifest file with URLs to ARC files. This way, the manifest is + split up among the distributed Hadoop jobs and the ARC files are + processed in whole by each worker. + + In the Nutch SVN, the ArcSegmentCreator.java expects the input + directory to contain the ARC files and (AFAICT) splits them up and + distributes them across the Hadoop workers. + + o We use the standard Internet Archive ARCReader and WARCReader + classes. Thus, NutchWAX can read both ARC and WARC files, whereas + the ArcSegmentCreator class can only read ARC files. + + o We add metadata fields to the document, which are then available + to the "index-nutchwax" plugin at indexing-time. + + ArcsToSegment.importRecord() + ... + contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); + contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); + contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); + contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); + ... This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |