From: <bi...@us...> - 2008-07-28 21:58:22
|
Revision: 2515 http://archive-access.svn.sourceforge.net/archive-access/?rev=2515&view=rev Author: binzino Date: 2008-07-28 21:58:30 +0000 (Mon, 28 Jul 2008) Log Message: ----------- Creation of NutchWAX 0.12.1 release tag. Added Paths: ----------- tags/nutchwax-0_12_1/ tags/nutchwax-0_12_1/archive/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-13 18:08:38
|
Revision: 2610 http://archive-access.svn.sourceforge.net/archive-access/?rev=2610&view=rev Author: binzino Date: 2008-10-13 18:08:29 +0000 (Mon, 13 Oct 2008) Log Message: ----------- Ooops, trying to abort this copy/delete... Added Paths: ----------- tags/nutchwax-0_12_2/ Removed Paths: ------------- tags/nutchwax-0_12_2/archive/HOWTO-dedup.txt tags/nutchwax-0_12_2/archive/HOWTO-xslt.txt tags/nutchwax-0_12_2/archive/HOWTO.txt tags/nutchwax-0_12_2/archive/INSTALL.txt tags/nutchwax-0_12_2/archive/LICENSE.txt tags/nutchwax-0_12_2/archive/README-dedup.txt tags/nutchwax-0_12_2/archive/README.txt tags/nutchwax-0_12_2/archive/RELEASE-NOTES.txt tags/nutchwax-0_12_2/archive/bin/ tags/nutchwax-0_12_2/archive/build.xml tags/nutchwax-0_12_2/archive/conf/ tags/nutchwax-0_12_2/archive/lib/ tags/nutchwax-0_12_2/archive/src/ tags/nutchwax-0_12_2/imagesearch/README.txt tags/nutchwax-0_12_2/imagesearch/bin/ tags/nutchwax-0_12_2/imagesearch/build.xml tags/nutchwax-0_12_2/imagesearch/conf/ tags/nutchwax-0_12_2/imagesearch/lib/ tags/nutchwax-0_12_2/imagesearch/src/ Property changes on: tags/nutchwax-0_12_2 ___________________________________________________________________ Added: svn:mergeinfo + Deleted: tags/nutchwax-0_12_2/archive/HOWTO-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/HOWTO-dedup.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,323 +0,0 @@ - -HOWTO-dedup.txt -2008-07-03 -Aaron Binns - -Table of Contents - o Prerequisites - - NutchWAX HOWTO.txt - - Wayback 1.2.1 - o Overview - o Generate CDX - o Generate DUP - o Import - o Update and Invert - o Index - o Add Revisit Dates - o Search - o Web deployment - - -====================================================================== -Prerequisites -====================================================================== - -This de-duplication HOWTO assumes you've already read the main HOWTO -and are familiar with importing and indexing archive files with -NutchWAX. - -For de-duplication, the Wayback Machine tools are required. This guide -assumes you have Wayback 1.2.1 installed in - - /opt/wayback-1.2.1 - - -====================================================================== -Overview -====================================================================== - -The README-dedup.txt explains the de-duplication process in greater -detail, including implementation details. - -NutchWAX does not automagically detect and eliminate duplicate records -when importing and indexing. However, tools are provided to help the -user implement a system to perform de-duplication. - -This guide describes one such system using the tools provided by -NutchWAX and Wayback. - - -====================================================================== -Generate CDX -====================================================================== - -The first step is to generate a list of duplicate records for a set of -ARC files. - -This step is not necessary if your archive files are in WARC format -and de-duplication was performed during the crawl. - -To generate the list of duplicates, we use the Wayback 'arc-indexer' -with the NutchWAX 'dedup-cdx' utility. The CDX files *must* be -sorted. - - $ arc-indexer foo.arc.gz | sort > foo.cdx - $ arc-indexer bar.arc.gz | sort > bar.cdx - $ arc-indexer baz.arc.gz | sort > baz.cdx - -Then we combine the CDX files into one sorted CDX containing all the -records: - - $ sort -m foo.cdx bar.cdx baz.cdx > all.cdx - -The "-m" option speeds up the sort by merging the already-sorted -files. - - -====================================================================== -Generate DUP/Revisits -====================================================================== - -Now that we have 'all.cdx' containing a sorted list of all the records -in the ARC files, we can generate a list of duplicates therein: - - $ dedup-cdx all.cdx > all.dup - -This "all.dup" file contains lines of the form - - example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 - -Where each line is - - URL digest date - -This file is then used as an exlusion filter for importing. - - -WARC ----- -If we are using WARC files with revisit records instead of ARC files, -then we don't generate a list of duplicate records because there -shouldn't be any. - -However, the revisit records in the WARC files do have the dates when -a URL was revisited and seen to have not changed -- which is more or -less the same thing as our "dup" lines above. - -For extracting these revisits from WARC CDX files, we use the -'revisits' utility provided by NutchWAX - - $ revisits all-warc.cdx > all-warc.dup - -The output of 'revisits' is in the same format as 'dedup-cdx'. - - -====================================================================== -Import -====================================================================== - -The import process is essentially the same as in NutchWAX, but now -we use "all.dup" as our exclusion list. - -First, we create a manifest - - $ cat > manifest - foo.arc.gz test-collection - bar.arc.gz test-collection - baz.arc.gz test-collection - ^D - - $ nutchwax import -e all.dup manifest - -The result will be a newly-created Nutch segment, same as importing -without de-duplication. - -If you examine the Nutch "hadoop.log" file, you will see INFO-level -lines from the NutchWAX Importer showing which URLs were excluded. - -WARC ----- -If you are importing WARC files with revisit records, then you -typically won't need to provide an exclusion file as the WARC files -were de-duplicated during the crawl. - - -====================================================================== -Update and Invert -====================================================================== - -Perform the Nutch "updatedb" and "invertlinks" steps as normal. - -Nothing special/different to do here with respect to de-duplication. - - -====================================================================== -Index -====================================================================== - -The only chage we make to the indexing step is the destination of the -index directory. - -By default, Nutch expects the per-segment index directory to live in a -sub-directory called 'indexes' and the index command is accordingly - - $ nutch index indexes crawldb linkdb segments/* - -Resulting in an index directory structure of the form - - indexes/part-00000 - -For de-duplication, we use a slightly different directory structure, -which will be used by a de-duplication-aware NutchWaxBean at -search-time. The directory structure we use is: - - pindexes/<segment>/part-00000 - -Using the segment name is not strictly required, but it is a good -practice and is strongly recommended. This way the segment and its -corresponding index directory are easily matched. - -Let's assume that the segment directory created during the import is -named - - segments/20080703050349 - -In that case, our index command becomes: - - $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349 - -Upon completion, the Lucene index is created in - - pindexes/20080703050349/part-0000 - -This index is exactly the same as one normally created by Nutch, the -only difference is the location. - - -====================================================================== -Add Revisit Dates -====================================================================== - -Now that we have the Nutch index, we add the revisit dates to it. - -Examine the "all.dup" file again, it has lines of the form - - example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 - -These are the revisit dates that need to be added to the records in -the Lucene index. When we generated the index, only the date of the -first visit was put in the index. Now we have to add these. - -As explained in README-dedup.txt, modifying the Lucene index to -actually add these dates is infeasible. What we do is create a -parallel index next to the main index (the part-00000 created above) -that contains all the dates for each record. - -The NutchWAX 'add-dates' command creates this parallel index for us. - - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ - all.dup - -Yes, the part-0000 argument does appear twice. This is beacuse it is -both the "key" index and the "source" index. - - -Suppose we did another crawl and had even more dates to add to the -existing index. In that case we would run - - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ - pindexes/20080703050349/new-dates \ - new-crawl.dup - $ rm -r pindexes/20080703050349/dates - $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates - -This copies the existing dates from "dates" to "new-dates" and adds -additional ones from "new-crawl.dup" along the way. Then we replace -the previous "dates" index with the new one. - - -WARC ----- -This step is the same for ARCs and WARCs. - -The only difference is that our "all.dup" file containing the list of -revisit dates was created by different utilities: 'dedup-cdx' for ARCs -and 'revisits' for WARCs. - - -====================================================================== -Search -====================================================================== - -Test/debug searches can be run from the command-line, but instead of -using the 'NutchBean' we use 'NutchWaxBean'. - -The "NutchWaxBean" extends NutchBean by adding support for parallel -indexes. - - $ nutch org.archive.nutchwax.NutchWaxBean <query> - -The "NutchWaxBean" also gives slightly more verbose and useful ouput, - - $ nutch org.archive.nutchwax.NutchWaxBean carolina - Total hits: 247338 - 0 [20080702053119] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080618133218, 20080618133218] - ... Studios Blue Ridge Motion Pictures Carolina Pinnacle Creative Network EUE/Screen ... Trailblazer Studios Federal Tax Incentive Carolina Pinnacle Studios ... - 1 [20080703023605] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080613200046, 20080618133218] - -The output consists of - - hit number - segment - url - key (which is url + digest) - digest - dates - -The most useful bit here for testing de-duplication is the list of -dates. - - -====================================================================== -Web Deployment -====================================================================== - -As noted in the HOWTO.txt document, when the nutch(wax) webapp is -deployed, changes made to the configuration must be also applied to -the deployed webapp. - -In addition to those configuration changes, the "web.xml" file must -also be modified. - -In Nutch, the "web.xml" file contains a directive to call a static -method on 'NutchBean' to initialize it. In order to search the -parallel indexes we have to use 'NutchWaxBean'. This is done by -modifying the "web.xml" to call a NutchWaxBean initializer after the -NutchBean initializer. - -Change "web.xml" from - - <listener> - <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> - </listener> - -to: - - <listener> - <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> - <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> - </listener> - Deleted: tags/nutchwax-0_12_2/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/HOWTO-xslt.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,105 +0,0 @@ - -HOWTO-xslt.txt -2008-07-25 -Aaron Binns - -Table of Contents - o Prerequisites - - NutchWAX HOWTO.txt - o Overview - o XSLTFilter and web.xml - - -====================================================================== -Prerequisites -====================================================================== - -This HOWTO assumes you've already read the main NutchWAX HOWTO and are -familiar with importing and indexing archive files with NutchWAX. - -Also, we assume that you are familiar with deploying the Nutch(WAX) -web application into a servlet container such as Tomcat. - - -====================================================================== -Overview -====================================================================== - -Nutch is bundled with two search interfaces - - JSP pages: search.jsp, refine-query.jsp, etc. - Servlet : OpenSearchServlet - -If you read the OpenSearchServlet.java source code and the search.jsp -page, you'll notice a lot of similarity, if not duplication of code. - -The Internet Archive Web Team plans to improve and expand upon the -existing OpenSearchServlet interface as well as adding more XML-based -capabilities, including replacements for the existing JSP pages. In -short, moving away from JSP and toward XML. - -But by favoring XML over JSP, how does one make an HTML UI? By adding -XSLT to the XML interfaces. - -This HOWTO describes the process for adding an XSL transformation to -the OpenSearch XML output. - -This shall be the blueprint for future XML-based interfaces as well. - - -====================================================================== -XSLTFilter and web.xml -====================================================================== - -Adding an XSL transformation to an XML-based interface, such as the -OpenSearchServlet is straightforward. Simply add the XSLTFilter to -the servlet's path and specify the XSL transform to apply. - -For example, consider the default Nutch web.xml - - <servlet> - <servlet-name>OpenSearch</servlet-name> - <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> - </servlet> - - <servlet-mapping> - <servlet-name>OpenSearch</servlet-name> - <url-pattern>/opensearch</url-pattern> - </servlet-mapping> - -Let's say we want to retain the '/opensearch' path for the XML output, -and add the human-friendly HTML page at '/coolsearch' - -First, we add an additional 'servlet-mapping' for our new path: - - <servlet-mapping> - <servlet-name>OpenSearch</servlet-name> - <url-pattern>/coolsearch</url-pattern> - </servlet-mapping> - -Then, we add the XSLTFilter, passing it a URL to the XSLT file - - <filter> - <filter-name>XSLT Filter</filter-name> - <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> - <init-param> - <param-name>xsltUrl</param-name> - <param-value>[URL to XSLT file]</param-value> - </init-param> - </filter> - -Lastly, we apply the filter to the same path as the our human-friendly -HTML path: - - <filter-mapping> - <filter-name>XSLT Filter</filter-name> - <url-pattern>/coolsearch</url-pattern> - </filter-mapping> - -This way, we have two URLs, which run the exact same -OpenSearchServlet, but one produces the unperturbed OpenSearch XML -output whereas the other produces human-friendly HTML output. - - OpenSearch XML : http://someserver/opensearch?query=foo - Human-friendly HTML : http://someserver/coolsearch?query=foo - Deleted: tags/nutchwax-0_12_2/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/HOWTO.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,466 +0,0 @@ - -HOWTO.txt -2008-07-28 -Aaron Binns - -Table of Contents - o Prerequisites - - Nutch(WAX) installation - - ARC/WARC files - o Configuration & Patching - o Create a manifest - o Import, Invert and Index - o Search - o Web deployment - - Don't forget to config & patch again - -====================================================================== -Prerequisites -====================================================================== - -In order to use Nutch(WAX) you need the following prerequisites: - - 1. NutchWAX installed. - - See INSTALL.txt for instruction on building and installing - NutchWAX. - - This HOWTO assumes it is installed in - - /opt/nutch-1.0-dev - - 2. ARC/WARC files. - - The whole purpose of NutchWAX is to index ARC/WARC files. These - files are not produced by Nutch nor NutchWAX, they are produced by - other tools, such as Heritrix. - - If you don't have any ARC/WARC files, you have no need for - NutchWAX. - - -====================================================================== -Patching -====================================================================== - -The vanilla NutchWAX as built according to the INSTALL.txt guide is -not quite ready to be used out-of-the-box. - -Before you can use NutchWAX, you must first patch a bug that exists in -the current Nutch SVN head. - -The file - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - -contains two errors: one where a mimetype is referenced before it is -defined; and a second where a definition has an illegal character. - -These errors cause Nutch to not recognize certain mimetypes and -therefore will ignore documents matching those mimetypes. - -There are two fixes: - - 1. Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - 2. Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -You can either apply these patches yourself, or copy an already-patched -copy from: - - /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml - -to - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - - -====================================================================== -Configuring -====================================================================== - -Since we assume that you are already familiar with Nutch, then you -should already be familiar with configuring it. The configuration -is mainly defined in - - /opt/nutch-1.0-dev/conf/nutch-default.xml - -NutchWAX requires the modification of two existing properties and the -addition of two new ones. - -All of the modifications described below can be found in: - - /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml - -You can either apply the configuration changes yourself, or copy that -file to - - /opt/nutch-1.0-dev/conf/nutch-site.xml - --------------------------------------------------- -plugin.includes --------------------------------------------------- -Change the list of plugins from: - - protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) - -to - - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax - -In short, we add: - - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf - -and remove: - - urlfilter-regex - urlnormalizer-(pass|regex|basic) - -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. - -The "parse-pdf" plugin is added simply because we have lots of PDFs in -our archives and we want to index them. We sometimes remove the -"parse-js" plugin if we don't care to index JavaScript files. - -We also remove the default Nutch URL filtering and normalizing plugins -because we do not need the URLs normalized nor filtered. We trust -that the tool that produced the ARC/WARC file will have normalized the -URLs contained therein according to its own rules so there's no need -to normalize here. Also, we don't filter by URL since we want to -index as much of the ARC/WARC file as we have parsers for. - -We do, however, add the NutchWAX URL filter. If de-duplication is -being performed upon import, this plugin is required. It performs URL -filtering of the list of ARC records to exclude based on -URL+digest+date. - --------------------------------------------------- -indexingfilter.order --------------------------------------------------- - -Add this property with a value of - - org.apache.nutch.indexer.basic.BasicIndexingFilter - org.archive.nutchwax.index.ConfigurableIndexingFilter - -So that the NutchWAX indexing filter is run after the Nutch basic -indexing filter. - -A full explanation is given in "README-dedup.txt". - --------------------------------------------------- -mime.type.magic --------------------------------------------------- -We disable mimetype detection in Nutch for two reasons: - -1. The ARC/WARC file specifies the Content-Type of the document. We - trust that the tool that created the ARC/WARC file got it right. - -2. The implementation in Nutch can use a lot of memory as the *entire* - document is read into memory as a byte[], then converted to a - String, then checked against the MIME database. This can lead to - out of memory errors for large files, such as music and video. - -To disable, simply set the property value to false. - - <property> - <name>mime.type.magic</name> - <value>false</value> - </property> - --------------------------------------------------- -nutchwax.filter.index --------------------------------------------------- -Configure the 'index-nutchwax' plugin. Specify how the metadata -fields added by the Importer are mapped to the Lucene documents during -indexing. - -The specifications here are of the form: - - src-key:lowercase:store:tokenize:exclusive:dest-key - -where the only required part is the "src-key", the rest will assume -the following defaults: - - lowercase = true - store = true - tokenize = false - exclusive = true - dest-key = src-key - -We recommend: - -<property> - <name>nutchwax.filter.index</name> - <value> - url:false:true:true - url:flase:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length - </value> -</property> - -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. - --------------------------------------------------- -nutchwax.filter.query --------------------------------------------------- -Configure the 'query-nutchwax' plugin. Specify which fields to make -searchable via "field:[term|phrase]" query syntax, and whether they -are "raw" fields or not. - -The specification format is one of: - - field:<name>:<boost> - raw:<name>:<lowercase>:<boost> - group:<name>:<lowercase>:<delimiter>:<boost> - -Default values are - - lowercase = true - delimiter = "," - boost = 1.0f - -There is no "lowercase" property for "field" specification because the -Nutch FieldQueryFilter doesn't expose the option, unlike the -RawFieldQueryFilter. - -The "group" fields are raw fields that can accept multiple values, -separated by a delimiter. Multiple values appearing in a query are -automagically translated into required OR-groups, such as - - collection:"193,221,36" => +(collection:193 collection:221 collection:36) - -NOTE: We do *not* use this filter for handling "date" queries, there -is a specific filter for that: DateQueryFilter - -We recommend: - -<property> - <name>nutchwax.filter.query</name> - <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false - group:collection - group:type - field:anchor - field:content - field:host - field:title - </value> -</property> - - --------------------------------------------------- -nutchwax.urlfilter.wayback.exclusions --------------------------------------------------- -File containing the exclusion list for importing. - -Normally, this is specified on the command line with the NutchWAX -Importer is invoked. It can be specified here if preferred. - --------------------------------------------------- -nutchwax.urlfilter.wayback.canonicalizer --------------------------------------------------- - -For CDX-based de-duplication, the same URL canonicalization algorithm -must be used here as was used to generate the CDX files. - -The default canonicalizer in Wayback's '(w)arc-indexer' utility -is - - org.archive.wayback.util.url.AggressiveUrlCanonicalizer - -which is the value provided in "nutch-site.xml". - -If the '(w)arc-indexer' is executed with the "-i" (identity) -command-line option, then the matching canonicalizer - - org.archive.wayback.util.url.IdentityUrlCanonicalizer - -must be specified here. - --------------------------------------------------- -nutchwax.filter.http.status --------------------------------------------------- -This property configures a filter with a list of ranges -of HTTP status codes to allow. - -Typically, most NutchWAX implementors do not wish to import and index -404, 500, 302 and other non-success pages. This is an inclusion -filter, meaning that only ARC records with an HTTP status code -matching any of the values will be imported. - -There is a special "unknown" value which can be used to include ARC -records that don't have an HTTP status code (for whatever reason). - -The default setting provided in nutch-site.xml is to allow any 2XX -success code: - - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - </value> - </property> - -But some other examples are: - - Allow any 2XX success code *and* redirects, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - 300-399 - </value> - </property> - - Be really strict about only certain codes, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200 - 301 - 302 - 304 - </value> - </property> - - Mix of ranges and specific codes, including the "unknown" - <property> - <name>nutchwax.filter.http.status</name> - <value> - Unknown - 200 - 300-399 - </value> - </property> - --------------------------------------------------- -nutchwax.import.content.limit --------------------------------------------------- -Similar to Nutch's - - file.content.limit - http.content.limit - ftp.content.limit - -properties, this specifies a limit on the size of a document imported -via NutchWAX. - -We recommend setting this to a size compatible with the memory -capacity of the computers performing the import. Something in the -1-4MB range is typical. - - -====================================================================== -Create a manifest -====================================================================== - -The input to NutchWAX's import tool is a manifest file. This is a -simple text file where each line contains a URL to an ARC/WARC file -and an optional collection name. - -For example: - - $ cat > manifest - http://someserver/somepath/somearchive.arc.gz mycollection - ^D - -Creates a simple manifest file with one ARC file and a collection -name of "mycollection". - -You don't have to use collections at all. If you don't know how you -would use it, then simply leave it out here. - - -====================================================================== -Import, Invert and Index -====================================================================== - -The steps to import the files, invert the link and index the documents -are rather simple: - - $ mkdir crawl - $ cd crawl - $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest - $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* - $ ls -F1 - crawldb/ - indexes/ - linkdb/ - segments/ - -To those already familiar with Nutch, these steps should be quite -familiar. - -The first step, we call NutchWAX's "import" command which creates the -Nutch segment containing the documents in the ARC/WARC files listed in -the manifest. The rest is the same as regular Nutch. - - -====================================================================== -Search -====================================================================== -The resulting indexes can be searched in exactly the same manner as in -regular Nutch. For example, assuming you just completed the steps -above, now: - - $ cd ../ - $ ls -F1 - crawl/ - $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer - -This calls the NutchBean to execute a simple keyword search for -"computer". Use whatever query term you think appears in the -documents you imported. - - -====================================================================== -Web Deployment -====================================================================== - -As users of Nutch are aware, the web application (nutch-1.0-dev.war) -bundled with Nutch contains duplicate copies of the configuration -files. - -So, all patches and configuration changes that we made to the -files in - - /opt/nutch-1.0-dev/conf - -will have to be duplicated in the Nutch webapp when it is deployed. - -This is not due to NutchWAX, this is a "feature" of regular Nutch. I -just thought it would be good to remind everyone since we did make -configuration changes for NutchWAX. Deleted: tags/nutchwax-0_12_2/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/INSTALL.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,93 +0,0 @@ - -INSTALL.txt -2008-10-01 -Aaron Binns - -This installation guide assumes the reader is already familiar with -building, packaging and deploying Nutch 1.0-dev. - - -The NutchWAX 0.12 source and build system are designed to integrate -into the existing Nutch 1.0-dev source and build. - -The long-term goal is for the NutchWAX components to be fully -integrated into mainline Nutch. As a stepping-stone toward that goal, -we have packaged the NutchWAX source to be dropped into the Nutch -"contrib" directory and built from there. - -Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script. The -NutchWAX build script calls out to the Nutch script to build Nutch -proper, then builds NutchWAX components and integrates them into the -Nutch build directory. - -We recommend that you execute all build commands from the NutchWAX -directory. This way, NutchWAX will ensure that any and all -dependencies in Nutch will be properly built and kept up-to-date. -Towards this goal, we have duplicated the most common build targets -from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, -such as: - - o compile - o jar - o job - o tar - o clean - -Again, the idea is that if you're already used to building Nutch, you -can easily transition to building Nutch and NutchWAX together. All of -the build artifacts will still be placed in Nutch's 'build' -sub-directory as normal. - - -Nutch-1.0-dev -------------- -As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. -Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.2 is -built against is: - - 701524 - -To checkout this revision of Nutch, use: - - $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch - $ cd nutch - - -NutchWAX --------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into -Nutch's "contrib" directory. - - $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive - -This will create a sub-directory named "archive" containing the -NutchWAX sources. - - -Build and install ------------------ -Assuming you already have the required tool-set for building Nutch, -building NutchWAX is a snap. - -Simply execute the same 'ant' build command in - - nutch/contrib/archive - -as you normally would and everything will build as normal. - -For example - - $ cd nutch/contrib/archive - $ ant tar - -This command will build all of Nutch, then the NutchWAX add-ons and -finally will package everything up into the "nutch-1.0-dev.tar.gz" -release package. - -Then, install the "nutch-1.0-dev.tar.gz" tarball as normal. For -example: - - $ cd /opt - $ tar xvfz nutch-1.0-dev.tar.gz Deleted: tags/nutchwax-0_12_2/archive/LICENSE.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/LICENSE.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/LICENSE.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,519 +0,0 @@ - -NutchWAX is free software. Except as noted, it is licensed under the -terms of the GNU Lesser Public License (LGPL), reproduced below. - -Source code derived from Nutch retains the Apache License, as -stipulated by that license. - -Libraries used by NutchWAX are redistributed under their respective -liceneses, which can be found in a file with the same name as the -library, suffixed by ".LICENSE". For example, the license for -"foo.jar" can be found in "foo.LICENSE". - -All other files not carrying an explicit license are licensed under -the GNU Lesser General Public License version 2.1 (included below) - -====================================================================== - - GNU LESSER GENERAL PUBLIC LICENSE - Version 2.1, February 1999 - - Copyright (C) 1991, 1999 Free Software Foundation, Inc. - 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - Everyone is permitted to copy and distribute verbatim copies - of this license document, but changing it is not allowed. - -[This is the first released version of the Lesser GPL. It also counts - as the successor of the GNU Library Public License, version 2, hence - the version number 2.1.] - - Preamble - - The licenses for most software are designed to take away your -freedom to share and change it. By contrast, the GNU General Public -Licenses are intended to guarantee your freedom to share and change -free software--to make sure the software is free for all its users. - - This license, the Lesser General Public License, applies to some -specially designated software packages--typically libraries--of the -Free Software Foundation and other authors who decide to use it. You -can use it too, but we suggest you first think carefully about whether -this license or the ordinary General Public License is the better -strategy to use in any particular case, based on the explanations below. - - When we speak of free software, we are referring to freedom of use, -not price. Our General Public Licenses are designed to make sure that -you have the freedom to distribute copies of free software (and charge -for this service if you wish); that you receive source code or can get -it if you want it; that you can change the software and use pieces of -it in new free programs; and that you are informed that you can do -these things. - - To protect your rights, we need to make restrictions that forbid -distributors to deny you these rights or to ask you to surrender these -rights. These restrictions translate to certain responsibilities for -you if you distribute copies of the library or if you modify it. - - For example, if you distribute copies of the library, whether gratis -or for a fee, you must give the recipients all the rights that we gave -you. You must make sure that they, too, receive or can get the source -code. If you link other code with the library, you must provide -complete object files to the recipients, so that they can relink them -with the library after making changes to the library and recompiling -it. And you must show them these terms so they know their rights. - - We protect your rights with a two-step method: (1) we copyright the -library, and (2) we offer you this license, which gives you legal -permission to copy, distribute and/or modify the library. - - To protect each distributor, we want to make it very clear that -there is no warranty for the free library. Also, if the library is -modified by someone else and passed on, the recipients should know -that what they have is not the original version, so that the original -author's reputation will not be affected by problems that might be -introduced by others. - - Finally, software patents pose a constant threat to the existence of -any free program. We wish to make sure that a company cannot -effectively restrict the users of a free program by obtaining a -restrictive license from a patent holder. Therefore, we insist that -any patent license obtained for a version of the library must be -consistent with the full freedom of use specified in this license. - - Most GNU software, including some libraries, is covered by the -ordinary GNU General Public License. This license, the GNU Lesser -General Public License, applies to certain designated libraries, and -is quite different from the ordinary General Public License. We use -this license for certain libraries in order to permit linking those -libraries into non-free programs. - - When a program is linked with a library, whether statically or using -a shared library, the combination of the two is legally speaking a -combined work, a derivative of the original library. The ordinary -General Public License therefore permits such linking only if the -entire combination fits its criteria of freedom. The Lesser General -Public License permits more lax criteria for linking other code with -the library. - - We call this license the "Lesser" General Public License because it -does Less to protect the user's freedom than the ordinary General -Public License. It also provides other free software developers Less -of an advantage over competing non-free programs. These disadvantages -are the reason we use the ordinary General Public License for many -libraries. However, the Lesser license provides advantages in certain -special circumstances. - - For example, on rare occasions, there may be a special need to -encourage the widest possible use of a certain library, so that it becomes -a de-facto standard. To achieve this, non-free programs must be -allowed to use the library. A more frequent case is that a free -library does the same job as widely used non-free libraries. In this -case, there is little to gain by limiting the free library to free -software only, so we use the Lesser General Public License. - - In other cases, permission to use a particular library in non-free -programs enables a greater number of people to use a large body of -free software. For example, permission to use the GNU C Library in -non-free programs enables many more people to use the whole GNU -operating system, as well as its variant, the GNU/Linux operating -system. - - Although the Lesser General Public License is Less protective of the -users' freedom, it does ensure that the user of a program that is -linked with the Library has the freedom and the wherewithal to run -that program using a modified version of the Library. - - The precise terms and conditions for copying, distribution and -modification follow. Pay close attention to the difference between a -"work based on the library" and a "work that uses the library". The -former contains code derived from the library, whereas the latter must -be combined with the library in order to run. - - GNU LESSER GENERAL PUBLIC LICENSE - TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION - - 0. This License Agreement applies to any software library or other -program which contains a notice placed by the copyright holder or -other authorized party saying it may be distributed under the terms of -this Lesser General Public License (also called "this License"). -Each licensee is addressed as "you". - - A "library" means a collection of software functions and/or data -prepared so as to be conveniently linked with application programs -(which use some of those functions and data) to form executables. - - The "Library", below, refers to any such software library or work -which has been distributed under these terms. A "work based on the -Library" means either the Library or any derivative work under -copyright law: that is to say, a work containing the Library or a -portion of it, either verbatim or with modifications and/or translated -straightforwardly into another language. (Hereinafter, translation is -included without limitation in the term "modification".) - - "Source code" for a work means the preferred form of the work for -making modifications to it. For a library, complete source code means -all the source code for all modules it contains, plus any associated -interface definition files, plus the scripts used to control compilation -and installation of the library. - - Activities other than copying, distribution and modification are not -covered by this License; they are outside its scope. The act of -running a program using the Library is not restricted, and output from -such a program is covered only if its contents constitute a work based -on the Library (independent of the use of the Library in a tool for -writing it). Whether that is true depends on what the Library does -and what the program that uses the Library does. - - 1. You may copy and distribute verbatim copies of the Library's -complete source code as you receive it, in any medium, provided that -you conspicuously and appropriately publish on each copy an -appropriate copyright notice and disclaimer of warranty; keep intact -all the notices that refer to this License and to the absence of any -warranty; and distribute a copy of this License along with the -Library. - - You may charge a fee for the physical act of transferring a copy, -and you may at your option offer warranty protection in exchange for a -fee. - - 2. You may modify your copy or copies of the Library or any portion -of it, thus forming a work based on the Library, and copy and -distribute such modifications or work under the terms of Section 1 -above, provided that you also meet all of these conditions: - - a) The modified work must itself be a software library. - - b) You must cause the files modified to carry prominent notices - stating that you changed the files and the date of any change. - - c) You must cause the whole of the work to be licensed at no - charge to all third parties under the terms of this License. - - d) If a facility in the modified Library refers to a function or a - table of data to be supplied by an application program that uses - the facility, other than as an argument passed when the facility - is invoked, then you must make a good faith effort to ensure that, - in the event an application does not supply such function or - table, the facility still operates, and performs whatever part of - its purpose remains meaningful. - - (For example, a function in a library to compute square roots has - a purpose that is entirely well-defined independent of the - application. Therefore, Subsection 2d requires that any - application-supplied function or table used by this function must - be optional: if the application does not supply it, the square - root function must still compute square roots.) - -These requirements apply to the modified work as a whole. If -identifiable sections of that work are not derived from the Library, -and can be reasonably considered independent and separate works in -themselves, then this License, and its terms, do not apply to those -sections when you distribute them as separate works. But when you -distribute the same sections as part of a whole which is a work based -on the Library, the distribution of the whole must be on the terms of -this License, whose permissions for other licensees extend to the -entire whole, and thus to each and every part regardless of who wrote -it. - -Thus, it is not the intent of this section to claim rights or contest -your rights to work written entirely by you; rather, the intent is to -exercise the right to control the distribution of derivative or -collective works based on the Library. - -In addition, mere aggregation of another work not based on the Library -with the Library (or with a work based on the Library) on a volume of -a storage or distribution medium does not bring the other work under -the scope of this License. - - 3. You may opt to apply the terms of the ordinary GNU General Public -License instead of this License to a given copy of the Library. To do -this, you must alter all the notices that refer to this License, so -that they refer to the ordinary GNU General Public License, version 2, -instead of to this License. (If a newer version than version 2 of the -ordinary GNU General Public License has appeared, then you can specify -that version instead if you wish.) Do not make any other change in -these notices. - - Once this change is made in a given copy, it is irreversible for -that copy, so the ordinary GNU General Public License applies to all -subsequent copies and derivative works made from that copy. - - This option is useful when you wish to copy part of the code of -the Library into a program that is not a library. - - 4. You may copy and distribute the Library (or a portion or -derivative of it, under Section 2) in object code or executable form -under the terms of Sections 1 and 2 above provided that you accompany -it with the complete corresponding machine-readable source code, which -must be distributed under the terms of Sections 1 and 2 above on a -medium customarily used for software interchange. - - If distribution of object code is made by offering access to copy -from a designated place, then offering equivalent access to copy the -source code from the same place satisfies the requirement to -distribute the source code, even though third parties are not -compelled to copy the source along with the object code. - - 5. A program that contains no derivative of any portion of the -Library, but is designed to work with the Library by being compiled or -linked with it, is called a "work that uses the Library". Such a -work, in isolation, is not a derivative work of the Library, and -therefore falls outside the scope of this License. - - However, linking a "work that uses the Library" with the Library -creates an executable that is a derivative of the Library (because it -contains portions of the Library), rather than a "work that uses the -library". The executable is therefore covered by this License. -Section 6 states terms for distribution of such executables. - - When a "work that uses the Library" uses material from a header file -that is part of the Library, the object code for the work may be a -derivative work of the Library even though the source code is not. -Whether this is true is especially significant if the work can be -linked without the Library, or if the work is itself a library. The -threshold for this to be true is not precisely defined by law. - - If such an object file uses only numerical parameters, data -structure layouts and accessors, and small macros and small inline -functions (ten lines or less in length), then the use of the object -file is unrestricted, regardless of whether it is legally a derivative -work. (Executables containing this object code plus portions of the -Library will still fall under Section 6.) - - Otherwise, if the work is a derivative of the Library, you may -distribute the object code for the work under the terms of Section 6. -Any executables containing that work also fall under Section 6, -whether or not they are linked directly with the Library itself. - - 6. As an exception to the Sections above, you may also combine or -link a "work that uses the Library" with the Library to produce a -work containing portions of the Library, and distribute that work -under terms of your choice, provided that the terms permit -modification of the work for the customer's own use and reverse -engineering for debugging such modifications. - - You must give prominent notice with each copy of the work that the -Library is used in it and that the Library and its use are covered by -this License. You must supply a copy of this License. If the work -during execution displays copyright notices, you must include the -copyright notice for the Library among them, as well as a reference -directing the user to the copy of this License. Also, you must do one -of these things: - - a) Accompany the work with the complete corresponding - machine-readable source code for the Library including whatever - changes were used in the work (which must be distributed under - Sections 1 and 2 above); and, if the work is an executable linked - with the Library, with the complete machine-readable "work that - uses the Library", as object code and/or source code, so that the - user can modify the Library and then relink to produce a modified - executable containing the modified Library. (It is understood - that the user who changes the contents of definitions files in the - Library will not necessarily be able to recompile the application - to use the modified definitions.) - - b) Use a suitable shared library mechanism for linking with the - Library. A suitable mechanism is one that (1) uses at run time a - copy of the library already present on the user's computer system, - rather than copying library functions into the executable, and (2) - will operate properly with a modified version of the library, if - the user installs one, as long as the modified version is - interface-compatible with the version that the work was made with. - - c) Accompany the work with a written offer, valid for at - least three years, to give the same user the materials - specified in Subsection 6a, above, for a charge no more - than the cost of performing this distribution. - - d) If distribution of the work is made by offering access to copy - from a designated place, offer equivalent access to copy the above - specified materials from the same place. - - e) Verify that the user has already received a copy of these - materials or that you have already sent this user a copy. - - For an executable, the required form of the "work that uses the -Library" must include any data and utility programs needed for -reproducing the executable from it. However, as a special exception, -the materials to be distributed need not include anything that is -normally distributed (in either source or binary form) with the major -components (compiler, kernel, and so on) of the operating system on -which the executable runs, unless that component itself accompanies -the executable. - - It may happen that this requirement contradicts the license -restrictions of other proprietary libraries that do not normally -accompany the operating system. Such a contradiction means you cannot -use both them and the Library together in an executable that you -distribute. - - 7. You may place library facilities that are a work based on the -Library side-by-side in a single library together with other library -facilities not covered by this License, and distribute such a combined -library, provided that the separate distribution of the work based on -the Library and of the other library facilities is otherwise -permitted, and provided that you do these two things: - - a) Accompany the combined library with a copy of the same work - based on the Library, uncombined with any other library - facilities. This must be distributed under the terms of the - Sections above. - - b) Give prominent notice with the combined library of the fact - that part of it is a work based on the Library, and explaining - where to find the accompanying uncombined form of the same work. - - 8. You may not copy, modify, sublicense, link with, or distribute -the Library except as expressly provided under this License. Any -attempt otherwise to copy, modify, sublicense, link with, or -distribute the Library is void, and will automatically terminate your -rights under this License. However, parties who have received copies, -or rights, from you under this License will not have their licenses -terminated so long as such parties remain in full compliance. - - 9. You are not required to accept this License, since you have not -signed it. However, nothing else grants you permission to modify or -distribute the Library or its derivative works. These actions are -prohibited by law if you do not accept this License. Therefore, by -modifying or distributing the Library (or any work based on the -Library), you indicate your acceptance of this License to do so, and -all its terms and conditions for copying, distributing or modifying -the Library or works based on it. - - 10. Each time you redistribute the Library (or any work based on the -Library), the recipient automatically receives a license from the -original licensor to copy, distribute, link with or modify the Library -subject to these terms and conditions. You may not impose any further -restrictions on the recipients' exercise of the rights granted herein. -You are not responsible for enforcing compliance by third parties with -this License. - - 11. If, as a consequence of a court judgment or allegation of patent -infringement or for any other reason (not limited to patent issues), -conditions are imposed on you (whether by court order, agreement or -otherwise) that contradict the conditions of this License, they do not -excuse you from the conditions of this License. If you cannot -distribute so as to satisfy simultaneously your obligations under this -License and any other pertinent obligations, then as a consequence you -may not distribute the Library at all. For example, if a patent -license would not permit royalty-free redistribution of the Library by -all those who receive copies directly or indirectly through you, then -the only way you could satisfy both it and this License would be to -refrain entirely from distribution of the Library. - -If any portion of this section is held invalid or unenforceable under any -particular circumstance, the balance of the section is intended to apply, -and the section as a whole is intended to apply in other circumstances. - -It is not the purpose of this section to induce you to infringe any -patents or other property right claims or to contest validity of any -such claims; this section has the sole purpose of protecting the -integrity of the free software distribution system which is -implemented by public license practices. Many people have made -generous contributions to the wide range of software distributed -through that system in reliance on consistent application of that -system; it is up to the author/donor to decide if he or she is willing -to distribute software through any other system and a licensee cannot -impose that choice. - -This section is intended to make thoroughly clear what is believed to -be a consequence of the rest of this License. - - 12. If the distribution and/or use of the Library is restricted in -certain countries either by patents or by copyrighted interfaces, the -original copyright holder who places the Library under this License may add -an explicit geographical distribution limitation excluding those countries, -so that distribution is permitted only in or among countries not thus -excluded. In such case, this License incorporates the limitation as if -written in the body of this License. - - 13. The Free Software Foundation may publish revised and/or new -versions of the Lesser General Public License from time to time. -Such new versions will be similar in spirit to the present version, -but may differ in detail to address new problems or concerns. - -Each version is given a distinguishing version number. If the Library -specifies a version number of this License which applies to it and -"any later version", you have the option of following the terms and -conditions either of that version or of any later version published by -the Free Software Foundation. If the Library does not specify a -license version number, you may choose any version ever published by -the Free Software Foundation. - - 14. If you wish to incorporate parts of the Library into other free -programs whose distribution conditions are incompatible with these, -write to the author to ask for permission. For software which is -copyrighted by the Free Software Foundation, write to the Free -Software Foundation; we sometimes make exceptions for this. Our -decision will be guided by the two goals of preserving the free status -of all derivatives of our free software and of promoting the sharing -and reuse of software generally. - - NO WARRANTY - - 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO -WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. -EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR -OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY -KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE -LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME -THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. - - 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN -WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY -AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU -FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR -CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE -LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING -RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A -FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF -SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH -DAMAGES. - - END OF TERMS AND CONDITIONS - - How to Apply These Terms to Your New Libraries - - If you develop a new library, and you want it to be of the greatest -possible use to the public, we recommend making it free software that -everyone can redistribute and change. You can do so by permitting -redistribution under these terms (or, alternatively, under the terms of the -ordinary General Public License). - - To apply these terms, attach the following notices to the library. It is -safest to attach them to the start of each source file to most effectively -convey the exclusion of warranty; and each file should have at least the -"copyright" line and a pointer to where the full notice is found. - - <one line to give the library's name and a brief idea of what it does.> - Copyright (C) <year> <name of author> - - This... [truncated message content] |
From: <bi...@us...> - 2008-12-18 19:12:56
|
Revision: 2679 http://archive-access.svn.sourceforge.net/archive-access/?rev=2679&view=rev Author: binzino Date: 2008-12-18 19:12:47 +0000 (Thu, 18 Dec 2008) Log Message: ----------- Make NutchWAX 0.12.3 release tag. Added Paths: ----------- tags/nutchwax-0_12_3/ tags/nutchwax-0_12_3/archive/ Property changes on: tags/nutchwax-0_12_3/archive ___________________________________________________________________ Added: svn:mergeinfo + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2009-05-05 21:44:37
|
Revision: 2702 http://archive-access.svn.sourceforge.net/archive-access/?rev=2702&view=rev Author: binzino Date: 2009-05-05 21:44:29 +0000 (Tue, 05 May 2009) Log Message: ----------- NutchWAX 0.12.4 release. Added Paths: ----------- tags/nutchwax-0_12_4/ tags/nutchwax-0_12_4/archive/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-03-18 23:05:50
|
Revision: 2981 http://archive-access.svn.sourceforge.net/archive-access/?rev=2981&view=rev Author: binzino Date: 2010-03-18 23:05:40 +0000 (Thu, 18 Mar 2010) Log Message: ----------- NutchWAX 0.13 release tag/branch. Added Paths: ----------- tags/nutchwax-0_13/ tags/nutchwax-0_13/archive/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |