From: <bi...@us...> - 2008-05-21 00:01:54
|
Revision: 2266 http://archive-access.svn.sourceforge.net/archive-access/?rev=2266&view=rev Author: binzino Date: 2008-05-20 17:02:01 -0700 (Tue, 20 May 2008) Log Message: ----------- Total re-write of install, readme and howto documents. Modified Paths: -------------- trunk/archive-access/projects/nat/archive/INSTALL.txt trunk/archive-access/projects/nat/archive/README.txt Added Paths: ----------- trunk/archive-access/projects/nat/archive/HOWTO.txt Added: trunk/archive-access/projects/nat/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nat/archive/HOWTO.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/HOWTO.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -0,0 +1,325 @@ + +HOWTO.txt +2008-05-20 +Aaron Binns + +Table of Contents + o Prerequisites + - Nutch(WAX) installation + - ARC/WARC files + o Configuration & Patching + o Create a manifest + o Import, Invert and Index + o Search + o Web deployment + - Don't forget to config & patch again + +====================================================================== +Prerequisites +====================================================================== + +In order to use Nutch(WAX) you need the following prerequisites: + + 1. NutchWAX installed. + + See INSTALL.txt for instruction on building and installing + NutchWAX. + + This HOWTO assumes it is installed in + + /opt/nutch-1.0-dev + + 2. ARC/WARC files. + + The whole purpose of NutchWAX is to index ARC/WARC files. These + files are not produced by Nutch nor NutchWAX, they are produced by + other tools, such as Heritrix. + + If you don't have any ARC/WARC files, you have no need for + NutchWAX. + + +====================================================================== +Patching +====================================================================== + +The vanilla NutchWAX as built according to the INSTALL.txt guide is +not quite ready to be used out-of-the-box. + +Before you can use NutchWAX, you must first patch a bug that exists in +the current Nutch SVN head. + +The file + + /opt/nutch-1.0-dev/conf/tika-mimetypes.xml + +contains two errors: one where a mimetype is referenced before it is +defined; and a second where a definition has an illegal character. + +These errors cause Nutch to not recognize certain mimetypes and +therefore will ignore documents matching those mimetypes. + +There are two fixes: + + 1. Move + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + definition higher up in the file, before the reference to it. + + 2. Remove + + <mime-type type="application/x-ms-dos-executable"> + <alias type="application/x-dosexec;exe" /> + </mime-type> + + as the ';' character is illegal according to the comments in the + Nutch code. + +You can either apply these patches yourself, or copy an already-patched +copy from: + + /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml + +to + + /opt/nutch-1.0-dev/conf/tika-mimetypes.xml + + +====================================================================== +Configuring +====================================================================== + +Since we assume that you are already familiar with Nutch, then you +should already be familiar with configuring it. The configuration +is mainly defined in + + /opt/nutch-1.0-dev/conf/nutch-default.xml + +NutchWAX requires the modification of two existing properties and the +addition of two new ones. + +All of the modifications described below can be found in: + + /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml + +You can either apply the configuration changes yourself, or copy that +file to + + /opt/nutch-1.0-dev/conf/nutch-site.xml + +-------------------------------------------------- +plugin.includes +-------------------------------------------------- +Change the list of plugins from: + + protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) + +to + + protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic + +In short, we add: + + index-nutchwax + query-nutchwax + parse-pdf + +and remove: + + urlfilter-regex + urlnormalizer-(pass|regex|basic) + +The only *required* changes are the additions of the NutchWAX index +and query plugins. The rest are optional, but recommended. + +The addition of the "parse-pdf" plugin is simply because we have lots +of PDFs in our archives and we want to index them. We sometimes +remove the "parse-js" plugin if we don't care to index JavaScript +files. + +We also remove the URL filtering and normalizing plugins because we do +not need the URLs normalized nor filtered. We trust that the tool +that produced the ARC/WARC file will have normalized the URLs +contained therein according to its own rules so there's no need to +normalize here. Also, we don't filter by URL since we want to index +as much of the ARC/WARC file as we have parsers for. + +-------------------------------------------------- +mime.type.magic +-------------------------------------------------- +We disable mimetype detection in Nutch for two reasons: + +1. The ARC/WARC file specifies the Content-Type of the document. We + trust that the tool that created the ARC/WARC file got it right. + +2. The implementation in Nutch can use a lot of memory as the *entire* + document is read into memory as a byte[], then converted to a + String, then checked against the MIME database. This can lead to + out of memory errors for large files, such as music and video. + +To disable, simply set the property value to false. + + <property> + <name>mime.type.magic</name> + <value>false</value> + </property> + +-------------------------------------------------- +nutchwax.filter.index +-------------------------------------------------- +Configure the 'index-nutchwax' plugin. Specify how the metadata +fields added by the ArcsToSegment are mapped to the Lucene documents +during indexing. + +The specifications here are of the form: + + src-key:lowercase:store:tokenize:dest-key + +where the only required part is the "src-key", the rest will assume +the following defaults: + + lowercase = true + store = true + tokenize = false + dest-key = src-key + +We recommend: + +<property> + <name>nutchwax.filter.index</name> + <value> + arcname:false + collection + date + type + </value> +</property> + +-------------------------------------------------- +nutchwax.filter.query +-------------------------------------------------- +Configure the 'query-nutchwax' plugin. Specify which fields to make +searchable via "[field]:[term|phrase]" query syntax, and whether they +are "raw" fields or not. + +The specification format is + + raw:name:lowercase:boost +or + field:name:boost + +Default values are + + lowercase = true + boost = 1.0f + +There is no "lowercase" property for "field" specification because the +Nutch FieldQueryFilter doesn't expose the option, unlike the +RawFieldQueryFilter. + +NTOE: We do *not* use this filter for handling "date" queries, there is a +specific filter for that: DateQueryFilter + +We recommend: + +<property> + <name>nutchwax.filter.query</name> + <value> + raw:arcname:false + raw:collection + raw:type + field:anchor + field:content + field:host + field:title + </value> +</property> + + +====================================================================== +Create a manifest +====================================================================== + +The input to NutchWAX's import tool is a manifest file. This is a +simple text file where each line contains a URL to an ARC/WARC file +and an optional collection name. + +For example: + + $ cat > manifest + http://someserver/somepath/somearchive.arc.gz mycollection + ^D + +Creates a simple manifest file with one ARC file and a collection +name of "mycollection". + +You don't have to use collections at all. If you don't know how you +would use it, then simply leave it out here. + + +====================================================================== +Import, Invert and Index +====================================================================== + +The steps to import the files, invert the link and index the documents +are rather simple: + + $ mkdir crawl + $ cd crawl + $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest + $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments + $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* + $ ls -F1 + crawldb/ + indexes/ + linkdb/ + segments/ + +To those already familiar with Nutch, these steps should be quite +familiar. + +The first step, we call NutchWAX's "import" command which creates the +Nutch segment containing the documents in the ARC/WARC files listed in +the manifest. The rest is the same as regular Nutch. + + +====================================================================== +Search +====================================================================== +The resulting indexes can be searched in exactly the same manner as in +regular Nutch. For example, assuming you just completed the steps +above, now: + + $ cd ../ + $ ls -F1 + crawl/ + $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer + +This calls the NutchBean to execute a simple keyword search for +"computer". Use whatever query term you think appears in the +documents you imported. + + +====================================================================== +Web Deployment +====================================================================== + +As users of Nutch are aware, the web application (nutch-1.0-dev.war) +bundled with Nutch contains duplicate copies of the configuration +files. + +So, all patches and configuration changes that we made to the +files in + + /opt/nutch-1.0-dev/conf + +will have to be duplicated in the Nutch webapp when it is deployed. + +This is not due to NutchWAX, this is a "feature" of regular Nutch. I +just thought it would be good to remind everyone since we did make +configuration changes for NutchWAX. Modified: trunk/archive-access/projects/nat/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-14 00:20:24 UTC (rev 2265) +++ trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -1,236 +1,93 @@ INSTALL.txt -2008-05-06 +2008-05-20 Aaron Binns +This installation guide assumes the reader is already familiar with +building, packaging and deploying Nutch 1.0-dev. -The NutchWAX 0.12 build and installation is as an "add-on" to an -existing Nutch 1.0-dev installation. -NutchWAX 0.12 uses a simple 'ant' build script. The script compiles -the NutchWAX sources, using the libraries in the installed -Nutch-1.0-dev. +The NutchWAX 0.12 source and build system are designed to integrate +into the existing Nutch 1.0-dev source and build. -We strongly recommend having *two* Nutch-1.0-dev installation -directories: one that you build NutchWAX against, and another into -which NutchWAX is deployed. +The long-term goal is for the NutchWAX components to be fully +integrated into mainline Nutch. As a stepping-stone toward that goal, +we have packaged the NutchWAX source to be dropped into the Nutch +"contrib" directory and built from there. -NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file -*into* an existing Nutch-1.0-dev installation. Think of NutchWAX as -an add-on. We over-write a few Nutch config files, but the rest is -simply added to the existing Nutch-1.0-dev installation. +Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script. The +NutchWAX build script calls out to the Nutch script to build Nutch +proper, then builds NutchWAX components and integrates them into the +Nutch build directory. +We recommend that you execute all build commands from the NutchWAX +directory. This way, NutchWAX will ensure that any and all +dependencies in Nutch will be properly built and kept up-to-date. +Towards this goal, we have duplicated the most common build targets +from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, +such as: + o compile + o jar + o job + o tar + o clean + +Again, the idea is that if you're already used to building Nutch, you +can easily transition to building Nutch and NutchWAX together. All of +the build artifacts will still be placed in Nutch's 'build' +sub-directory as normal. + + Nutch-1.0-dev ------------- - -As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Now +As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is built against is: 650739 To checkout this revision of Nutch, use: - $ mkdir nutch + $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch - $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk -To build the nutch-1.0-dev.tar.gz package, use 'ant' - $ cd trunk - $ ant tar +NutchWAX +-------- +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into +Nutch's "contrib" directory. -This produces + $ cd contrib + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive - build/nutch-1.0-dev.tar.gz +This will create a sub-directory named "archive" containing the +NutchWAX sources. -Which we then install *twice* - $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev - $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz - $ mkdir -p /opt/nutch-1.0-dev - $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz - -The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which -we compile against, then, when we want to test NutcWAX, we deploy it -into ~/nutchwax-0.12/nutch-1.0-dev. - -Why can't we just use one installation of Nutch? Mainly to avoid -weirdness where we are compiling NutchWAX source against the same set -of libraries where we would be installing NutchWAX. Consider, when we -deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib' -directory. If we use that same 'lib' directory for dependencies when -compiling the source, 'ant'/'javac' will likely get confused when -calculating dependencies. - -It's possible that you could successfully go through the -build/test/release cycle using one Nutch-1.0-dev directory, but these -instructions assume you will have two. - - Build and install ----------------- +Assuming you already have the required tool-set for building Nutch, +building NutchWAX is a snap. - 1. Install two Nutch-1.0-dev packages per the instructions above. +Simply execute the same 'ant' build command in - 2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev + nutch/contrib/archive - <!-- NOTE: Point this to your Nutch 1.0-dev directory --> - <property name="nutch.dir" value="/opt/nutch-1.0-dev" /> +as you normally would and everything will build as normal. - 3. Build NutchWAX-0.12 +For example - $ ant + $ cd nutch/contrib/archive + $ ant tar - The default build rule is "package" which will compile all the source - and build an intallation tarball: nutchwax-0.12.tar.gz +This command will build all of Nutch, then the NutchWAX add-ons and +finally will package everything up into the "nutch-1.0-dev.tar.gz" +release package. - The "build.xml" file is pretty straightforward and just grepping - for the targets should be pretty obvious: compile, clean, etc. - - 4. Install NutchWAX into the build/test Nutch installation - - $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz - -That's it! - -All we do is add our libraries (nutchwax.jar and dependencies), the -'nutchwax' helper script, plugins for indexing and querying, and a few -config files. - -Except for the config files, no files in the Nutch-1.0-dev -installation are over-written, only added. The "nutch-site.xml" file -is over-written, but that file is empty in a vanilla Nutch -installation, so there's small risk of over-writing something. - - -HOWTO run and test ------------------- - -The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin' -directory next to the 'nutch' helper script. - -The 'nutchwax' script is used to run the NutchWAX-specific tools, use -the regular 'nutch' script for regular Nutch activities. - -The 'nutchwax' script runs two tools - - "import" Import a set of .arc/.warc files from a manifest, creating - a Nutch segment. - - "dumpindex" Debug tool that dumps a Lucene index, such as the ones - created by Nutch's "index" tool. - -The idea is that the NutchWAX "import" tool supplants the Nutch -generate and fetch cycle. Rather than generating and fetching -segments, we import the .arc/.warc files directly into a newly created -segment. Then, we process that segment just as we normally would with -Nutch. - -For example, - - $ cd nutch-test - $ cat > manifest - http://someserver/foo-bar-baz.arc mycollection - ^D - $ nutch-1.0-dev/bin/nutchwax import manifest - -This will import the arc file listed in the manifest into a newly -created segment. The segment is created by default in a directory -hierarchy of the form: - - segments/[date-timestamp] - -This mirrors the way segments are created in vanilla Nutch by the -"generate" command. - -You can explicitly name the segment if you want, e.g. - - nutchwax import manifest mysegment - -Once the segment is created by the importing of ARC files with -NutchWAX, you can use Nutch to perform the rest of the steps. For +Then, install the "nutch-1.0-dev.tar.gz" tarball as normal. For example: - $ nutch-1.0-dev/bin/nutchwax import manifest - $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* - $ nutch-1.0-dev/bin/nutch merge index indexes - -This is pretty much the minimal set of steps to import and index a set -of ARC files. The crawldb update and link inversion steps are pro -forma and don't have anything to do with NutchWAX specifically, but -are a part of regular Nutch processing. - -Now you have a Nutch "index" directory and are ready to search! - -Searching is done as in vanilla Nutch. Either launch the Nutch webapp -or use the command-line interface to NutchBean to run some test -searches. Nothing NutchWAX-specific here. - - -Miscellaneous notes -------------------- - -1. Plugins - -There are two plugins bundled with NutchWAX: - - index-nutchwax - query-nutchwax - -See the "plugin.includes" property in nutch-site.xml to see where -these plugins are added to the filter chain. - -The index-nutchwax plugin ensures that WAX-specifici metadata is -transferred from the Nutch Content object to the Lucene Document -object, which is placed in the Lucene index. - -The query-nutchwax plugin is used to process query requests against -those same meta-data fields. It also expands the capabilities of -searching the basic Nutch fields as well. - -2. URL filters - -Nutch's URL filter by default filters-out many common URL oddities -that would normally trip-up Nutch's crawler. However, when importing -content from ARC files, filtering out content probably doens't make -sense. That is, whatever content made it into the ARC file should be -imported, no matter what the URL looks like. - -To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'. -To pass all content through the filter, remove all filter rules except -for the last one: - - # accept anything else - +. - -3. conf/tika-mimetypes.xml - -NutchWAX comes with a fixed copy of tika-mimetypes.xml. The version -in Nutch revision 650739 has a few bugs in it which cause parsing to -fail for many document types. The bugs are: - - o Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - o Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes -these two bugs. + $ cd /opt + $ tar xvfz nutch-1.0-dev.tar.gz Modified: trunk/archive-access/projects/nat/archive/README.txt =================================================================== --- trunk/archive-access/projects/nat/archive/README.txt 2008-05-14 00:20:24 UTC (rev 2265) +++ trunk/archive-access/projects/nat/archive/README.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -1,105 +1,122 @@ README.txt -2008-05-06 +2008-05-20 Aaron Binns +Welcome to NutchWAX 0.12! -This is the NutchWAX-0.12 source that John Lee handed-off to me. It -is a work-in-progress. +NutchWAX is a set of add-ons to Nutch in order to index and search +archived web data. -Compared to NutchWAX-0.10 (and earlier) it is *much* simpler. The -main WAX-specific code is in just a few files really: +These add-ons are developed and maintained by the Internet Archive Web +Team in conjunction with a broad community of contributors, partners +and end-users. -src/java/org/archive/nutchwax/ArcsToSegment.java +The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions". - This is the meat of the WAX logic for processing .arc files and - generating Nutch segments. Once we use this to generate a set of - segments for the .arc files, we can use the rest of vanilla - Nutch-1.0-dev to invert links and index the content with Lucene. +Since NutchWAX is a set of add-ons to Nutch, you should already be +familiar with Nutch before using NutchWAX. - This conversion code is heavily edited from: +====================================================================== - nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java +The goal of NutchWAX is to enable full-text indexing and searching of +documents stored in web archive file formats (ARC and WARC). - taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development). +The way we achieve that goal is by providing add-on tools and plugins +to Nutch to read documents directly from ARC/WARC files. We call this +process "importing" archive files. - Ours differs in a few important ways: +Importing produces a Nutch segment, the same as if Nutch had actually +crawled the documents itself. In this scenario, document importing +replaces the conventional "generate/fetch/update" cycle of Nutch. - o Rather than taking a directory with .arc files as input, we take - a manifest file with URLs to .arc files. This way, the manifest - is split up among the distributed Hadoop jobs and the .arc files - are processed in whole by each worker. +Once the archival documents have been imported into a segment, the +regular Nutch commands to update the 'crawldb', invert the links and +index the document contents can proceed as normal. - In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the - input directory to contain the .arc files and (AFAICT) splits - them up and distributes them across the Hadoop workers. This - seems really inefficient to me, I think our approach is much - better -- at least for us. +====================================================================== - o Related to the way input files are split and processed, we use - the standard Archive ARCReader class just like Heritrix and - Wayback. +The NutchWAX add-ons consist of: - The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our - ARCReader because of licensing imcompatibility. Ours is under - GPL and Nutch-1.0-dev forbids the use of GPL code. - - We are in the process of re-licensing or dual-licensing with - Apache License, but until then, our ARCReader code won't be incldued - in mainline Nutch. + bin/nutchwax - This isn's a problem per se, but worth noting in case anyone - looks at the Nutch-1.0-dev code and wonders why they built their - own (horribly inefficient) .arc reader. + A shell script that is used to run the NutchWAX command-line tools, + such as document importing. - o We add metadata fields to the processed document for WAX-specific - purposes: + This is patterned after the 'bin/nutch' shell script. - content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); - content.getMetadata().set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ) ; - content.getMetadata().set( NutchWax.COLLECTION_KEY, collection); - content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() ); + plugins/index-nutchwax - The addition of the arcname and collection key is pretty - obvious. I don't know why the content-type isn't added in the - vanilla Nutch-1.0-dev. - - Also, we should review the use of the ARCHIVE_DATE_KEY in that - John Lee mentioned to me that there was possibly duplicate date - fields put in the index: one that is a plain old Java date, and - one that is a 14-digit date string for use with Wayback. + Indexing plugin which adds NutchWAX-specific metadata fields to the + indexed document. -src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java -src/java/plugin/index-nutchwax/plugin.xml + plugins/query-nutchwax - This filter is pretty straightforward. All it does is take the - metadata fields that were added to the document (as described above) - and placed in the Lucene index so that we can make use of them at - search-time. + Query plugin which allows for querying against the metadata fields + added by 'index-nutchwax'. -src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java -src/java/plugin/query-nutchwax/plugin.xml +There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX +is distributed in source code form and is intended to be built in +conjunction with Nutch. - This is a single query filter that can be used for querying single - fields from a single implementation. It does *not* allow for - querying multiple fields as you can already do that via Nutch. +See "INSTALL.txt" for details on building NutchWAX and Nutch. - What this filter does is allows one to more-or-less create query - filters in a data-driven manner rather than having to code-up a new - class for each field. That is, before one would have to create a - CollectionQueryFilter class to filter on the "collection" field. - With the MultipleFieldQueryFilter class, you can specify that the - "collection" field is to be filterable via the plugin.xml file and - "nutchwax.filter.query" configuration property. +See "HOWTO.txt" for a quick tutorial on importing, indexing and +searching a set of documents in a web archive file. -src/java/org/archive/nutchwax/NutchWax.java +====================================================================== - Just a simple enum used by the above two classes for the metadata - keys. +This 0.12 release of NutchWAX is radically different in source-code +form compared to the previous release, 0.10. -src/java/org/archive/nutchwax/tools/DumpIndex.java +One of the design goals of 0.12 was to reduce or even eliminate the +"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX +releases had to copy/paste/edit large chunks of Nutch source code in +order to add the NutchWAX features. - A simple command-line utility to dump the contents of a Lucene - index. Used for debugging. +Also, the NutchWAX 0.12 sources and build are designed to one day be +added into mainline Nutch as a proper "contrib" package; then +eventually be fully integrated into the core Nutch source code. +====================================================================== +Most of the NutchWAX source code is relatively straightfoward to those +already familiar with the inner workings of Nutch. Still, special +attention on one class is worth while: + + src/java/org/archive/nutchwax/ArcsToSegment.java + +This is where ARC/WARC files are read and their documents are imported +into a Nutch segment. + +It is inspired by: + + nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java + +on the Nutch SVN head. + +Our implementation differs in a few important ways: + + o Rather than taking a directory with ARC files as input, we take a + manifest file with URLs to ARC files. This way, the manifest is + split up among the distributed Hadoop jobs and the ARC files are + processed in whole by each worker. + + In the Nutch SVN, the ArcSegmentCreator.java expects the input + directory to contain the ARC files and (AFAICT) splits them up and + distributes them across the Hadoop workers. + + o We use the standard Internet Archive ARCReader and WARCReader + classes. Thus, NutchWAX can read both ARC and WARC files, whereas + the ArcSegmentCreator class can only read ARC files. + + o We add metadata fields to the document, which are then available + to the "index-nutchwax" plugin at indexing-time. + + ArcsToSegment.importRecord() + ... + contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); + contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); + contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); + contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); + ... This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-05-27 18:58:15
|
Revision: 2268 http://archive-access.svn.sourceforge.net/archive-access/?rev=2268&view=rev Author: binzino Date: 2008-05-27 11:58:18 -0700 (Tue, 27 May 2008) Log Message: ----------- Updated license information: header comments, .LICENSE files, LICENSE.txt, etc. Modified Paths: -------------- trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/NutchWax.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java trunk/archive-access/projects/nat/archive/src/plugin/build.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/DateQueryFilter.java Added Paths: ----------- trunk/archive-access/projects/nat/archive/LICENSE.txt trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.LICENSE Added: trunk/archive-access/projects/nat/archive/LICENSE.txt =================================================================== --- trunk/archive-access/projects/nat/archive/LICENSE.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/LICENSE.txt 2008-05-27 18:58:18 UTC (rev 2268) @@ -0,0 +1,519 @@ + +NutchWAX is free software. Except as noted, it is licensed under the +terms of the GNU Lesser Public License (LGPL), reproduced below. + +Source code derived from Nutch retains the Apache License, as +stipulated by that license. + +Libraries used by NutchWAX are redistributed under their respective +liceneses, which can be found in a file with the same name as the +library, suffixed by ".LICENSE". For example, the license for +"foo.jar" can be found in "foo.LICENSE". + +All other files not carrying an explicit license are licensed under +the GNU Lesser General Public License version 2.1 (included below) + +====================================================================== + + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Libraries + + If you develop a new library, and you want it to be of the greatest +possible use to the public, we recommend making it free software that +everyone can redistribute and change. You can do so by permitting +redistribution under these terms (or, alternatively, under the terms of the +ordinary General Public License). + + To apply these terms, attach the following notices to the library. It is +safest to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + <one line to give the library's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + +Also add information on how to contact you by electronic and paper mail. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the library, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the + library `Frob' (a library for tweaking knobs) written by James Random Hacker. + + <signature of Ty Coon>, 1 April 1990 + Ty Coon, President of Vice + +That's all there is to it! Added: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE =================================================================== --- trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE (rev 0) +++ trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE 2008-05-27 18:58:18 UTC (rev 2268) @@ -0,0 +1,504 @@ + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Libraries + + If you develop a new library, and you want it to be of the greatest +possible use to the public, we recommend making it free software that +everyone can redistribute and change. You can do so by permitting +redistribution under these terms (or, alternatively, under the terms of the +ordinary General Public License). + + To apply these terms, attach the following notices to the library. It is +safest to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + <one line to give the library's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +Also add information on how to contact you by electronic and paper mail. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the library, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the + library `Frob' (a library for tweaking knobs) written by James Random Hacker. + + <signature of Ty Coon>, 1 April 1990 + Ty Coon, President of Vice + +That's all there is to it! + + Added: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE =================================================================== --- trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE (rev 0) +++ trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE 2008-05-27 18:58:18 UTC (rev 2268) @@ -0,0 +1,176 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the ... [truncated message content] |