You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bra...@us...> - 2008-06-05 20:35:00
|
Revision: 2280 http://archive-access.svn.sourceforge.net/archive-access/?rev=2280&view=rev Author: bradtofel Date: 2008-06-05 13:34:57 -0700 (Thu, 05 Jun 2008) Log Message: ----------- FEATURE: added method to return iterator from a pathOrUrl (String) Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java 2008-06-04 00:08:01 UTC (rev 2279) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java 2008-06-05 20:34:57 UTC (rev 2280) @@ -59,7 +59,7 @@ public ArcIndexer() { canonicalizer = new AggressiveUrlCanonicalizer(); } - + /** * @param arc * @return Iterator of SearchResults for input arc File @@ -67,7 +67,26 @@ */ public CloseableIterator<SearchResult> iterator(File arc) throws IOException { - ARCReader arcReader = ARCReaderFactory.get(arc); + return iterator(ARCReaderFactory.get(arc)); + } + + /** + * @param pathOrUrl + * @return Iterator of SearchResults for input pathOrUrl + * @throws IOException + */ + public CloseableIterator<SearchResult> iterator(String pathOrUrl) + throws IOException { + return iterator(ARCReaderFactory.get(pathOrUrl)); + } + + /** + * @param arcReader + * @return Iterator of SearchResults for input ARCReader + * @throws IOException + */ + public CloseableIterator<SearchResult> iterator(ARCReader arcReader) + throws IOException { arcReader.setParseHttpHeaders(true); Adapter<ArchiveRecord,ARCRecord> adapter1 = Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java 2008-06-04 00:08:01 UTC (rev 2279) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java 2008-06-05 20:34:57 UTC (rev 2280) @@ -31,21 +31,37 @@ } /** - * @param arc + * @param warc * @return Iterator of SearchResults for input arc File * @throws IOException */ public CloseableIterator<SearchResult> iterator(File warc) throws IOException { + return iterator(WARCReaderFactory.get(warc)); + } + /** + * @param pathOrUrl + * @return Iterator of SearchResults for input pathOrUrl + * @throws IOException + */ + public CloseableIterator<SearchResult> iterator(String pathOrUrl) + throws IOException { + return iterator(WARCReaderFactory.get(pathOrUrl)); + } + /** + * @param arc + * @return Iterator of SearchResults for input arc File + * @throws IOException + */ + public CloseableIterator<SearchResult> iterator(WARCReader reader) + throws IOException { Adapter<ArchiveRecord, WARCRecord> adapter1 = new ArchiveRecordToWARCRecordAdapter(); WARCRecordToSearchResultAdapter adapter2 = new WARCRecordToSearchResultAdapter(); adapter2.setCanonicalizer(canonicalizer); - - WARCReader reader = WARCReaderFactory.get(warc); - + ArchiveReaderCloseableIterator itr1 = new ArchiveReaderCloseableIterator(reader,reader.iterator()); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-06-04 00:07:57
|
Revision: 2279 http://archive-access.svn.sourceforge.net/archive-access/?rev=2279&view=rev Author: bradtofel Date: 2008-06-03 17:08:01 -0700 (Tue, 03 Jun 2008) Log Message: ----------- BUGFIX(unreported): was appending file.toString() not itr.next() in store() Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java 2008-06-04 00:05:34 UTC (rev 2278) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java 2008-06-04 00:08:01 UTC (rev 2279) @@ -183,7 +183,7 @@ public void store(Iterator<String> itr) throws IOException { PrintWriter pw = new PrintWriter(file); while(itr.hasNext()) { - pw.println(file); + pw.println(itr.next()); } pw.close(); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2278 http://archive-access.svn.sourceforge.net/archive-access/?rev=2278&view=rev Author: bradtofel Date: 2008-06-03 17:05:34 -0700 (Tue, 03 Jun 2008) Log Message: ----------- FEATURE: added handling of ASX format XML documents. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java 2008-06-04 00:04:33 UTC (rev 2277) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java 2008-06-04 00:05:34 UTC (rev 2278) @@ -50,6 +50,9 @@ private final static String TEXT_HTML_MIME = "text/html"; private final static String TEXT_XHTML_MIME = "application/xhtml"; private final static String TEXT_CSS_MIME = "text/css"; + private final static String ASX_MIME = "video/x-ms-asf"; + private final static String ASX_EXTENSION = ".asx"; + // TODO: make this configurable private final static long MAX_HTML_MARKUP_LENGTH = 1024 * 1024 * 5; @@ -62,6 +65,8 @@ new ArchivalUrlReplayRenderer(); private ArchivalUrlCSSReplayRenderer archivalCSS = new ArchivalUrlCSSReplayRenderer(); + private ArchivalUrlASXReplayRenderer archivalASX = + new ArchivalUrlASXReplayRenderer(); /* (non-Javadoc) * @see org.archive.wayback.replay.ReplayRendererDispatcher#getRenderer(org.archive.wayback.core.WaybackRequest, org.archive.wayback.core.SearchResult, org.archive.wayback.core.Resource) @@ -82,20 +87,30 @@ // only bother attempting markup on pages smaller than some size: if (resource.getRecordLength() < MAX_HTML_MARKUP_LENGTH) { + String resultMime = result.get(WaybackConstants.RESULT_MIME_TYPE); // HTML and XHTML docs get marked up as HTML - if (-1 != result.get(WaybackConstants.RESULT_MIME_TYPE).indexOf( - TEXT_HTML_MIME)) { + if (-1 != resultMime.indexOf(TEXT_HTML_MIME)) { return archivalHTML; } - if (-1 != result.get(WaybackConstants.RESULT_MIME_TYPE).indexOf( - TEXT_XHTML_MIME)) { + if (-1 != resultMime.indexOf(TEXT_XHTML_MIME)) { return archivalHTML; } // CSS docs get marked up as CSS - if (-1 != result.get(WaybackConstants.RESULT_MIME_TYPE).indexOf( - TEXT_CSS_MIME)) { + if (-1 != resultMime.indexOf(TEXT_CSS_MIME)) { return archivalCSS; } + if (-1 != resultMime.indexOf(ASX_MIME)) { + return archivalASX; + } + String resultPath = result.get(WaybackConstants.RESULT_URL_KEY); + resultPath = resultPath.substring(resultPath.indexOf('/')); + int queryIdx = resultPath.indexOf('?'); + if(queryIdx > 0) { + resultPath = resultPath.substring(0,queryIdx-1); + } + if(resultPath.endsWith(ASX_EXTENSION)) { + return archivalASX; + } } // everything else goes transparently: This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2277 http://archive-access.svn.sourceforge.net/archive-access/?rev=2277&view=rev Author: bradtofel Date: 2008-06-03 17:04:33 -0700 (Tue, 03 Jun 2008) Log Message: ----------- INITIAL REV: ReplayRenderer responsible for rewriting ASX format XML documents as they are replayed. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java 2008-06-04 00:04:33 UTC (rev 2277) @@ -0,0 +1,50 @@ +package org.archive.wayback.archivalurl; + +import java.io.IOException; +import java.util.Map; + +import javax.servlet.ServletException; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + +import org.archive.wayback.ResultURIConverter; +import org.archive.wayback.core.Resource; +import org.archive.wayback.core.SearchResult; +import org.archive.wayback.core.SearchResults; +import org.archive.wayback.core.WaybackRequest; +import org.archive.wayback.exception.BadContentException; +import org.archive.wayback.replay.HTMLPage; +import org.archive.wayback.replay.HttpHeaderOperation; + +public class ArchivalUrlASXReplayRenderer extends ArchivalUrlReplayRenderer { + /* (non-Javadoc) + * @see org.archive.wayback.ReplayRenderer#renderResource(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse, org.archive.wayback.core.WaybackRequest, org.archive.wayback.core.SearchResult, org.archive.wayback.core.Resource, org.archive.wayback.ResultURIConverter, org.archive.wayback.core.SearchResults) + */ + public void renderResource(HttpServletRequest httpRequest, + HttpServletResponse httpResponse, WaybackRequest wbRequest, + SearchResult result, Resource resource, + ResultURIConverter uriConverter, SearchResults results) + throws ServletException, IOException, BadContentException { + + + HttpHeaderOperation.copyHTTPMessageHeader(resource, httpResponse); + + Map<String,String> headers = HttpHeaderOperation.processHeaders( + resource, result, uriConverter, this); + + // Load content into an HTML page, and resolve embedded HREF urls: + HTMLPage page = new HTMLPage(resource,result,uriConverter); + page.readFully(); + + page.resolveASXRefUrls(); + + // set the corrected length: + int bytes = page.getBytes().length; + headers.put(HTTP_LENGTH_HEADER, String.valueOf(bytes)); + + // send back the headers: + HttpHeaderOperation.sendHeaders(headers, httpResponse); + + page.writeToOutputStream(httpResponse.getOutputStream()); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-06-04 00:03:34
|
Revision: 2276 http://archive-access.svn.sourceforge.net/archive-access/?rev=2276&view=rev Author: bradtofel Date: 2008-06-03 17:03:40 -0700 (Tue, 03 Jun 2008) Log Message: ----------- BUGFIX: moved extract HTTP request call to beginning of fixup. FEATURE: added keySet() to get Set of request filters. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-06-04 00:02:04 UTC (rev 2275) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-06-04 00:03:40 UTC (rev 2276) @@ -27,6 +27,7 @@ import java.util.Iterator; import java.util.Locale; import java.util.ResourceBundle; +import java.util.Set; import java.io.UnsupportedEncodingException; import java.net.URLEncoder; @@ -259,6 +260,7 @@ * @param httpRequest */ public void fixup(HttpServletRequest httpRequest) { + extractHttpRequestInfo(httpRequest); String startDate = get(WaybackConstants.REQUEST_START_DATE); String endDate = get(WaybackConstants.REQUEST_END_DATE); String exactDate = get(WaybackConstants.REQUEST_EXACT_DATE); @@ -287,7 +289,6 @@ put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp .padEndDateStr(exactDate)); } - extractHttpRequestInfo(httpRequest); } /** @@ -408,4 +409,8 @@ public void setExclusionFilter(ObjectFilter<SearchResult> exclusionFilter) { this.exclusionFilter = exclusionFilter; } + + public Set<String> keySet() { + return filters.keySet(); + } } \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-06-04 00:02:04
|
Revision: 2275 http://archive-access.svn.sourceforge.net/archive-access/?rev=2275&view=rev Author: bradtofel Date: 2008-06-03 17:02:04 -0700 (Tue, 03 Jun 2008) Log Message: ----------- FEATURE: added ASX markup method, which rewrites ASX XML documents, converting mms:// to http:// as it rewrites urls.. This might even be the "right thing" to do for mms://... Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java 2008-06-02 22:01:49 UTC (rev 2274) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java 2008-06-04 00:02:04 UTC (rev 2275) @@ -309,6 +309,18 @@ TagMagix.markupCSSImports(sb,uriConverter, captureDate, pageUrl); } + public void resolveASXRefUrls() { + + // TODO: get url from Resource instead of SearchResult? + String pageUrl = result.getAbsoluteUrl(); + String captureDate = result.getCaptureDate(); + ResultURIConverter ruc = new MMSToHTTPResultURIConverter(uriConverter); + + TagMagix.markupTagREURIC(sb, ruc, captureDate, pageUrl, + "REF", "HREF"); + } + + /** * @param charSet * @throws IOException @@ -475,4 +487,20 @@ return base.makeReplayURI(datespec, url); } } + + private class MMSToHTTPResultURIConverter implements ResultURIConverter { + private static final String MMS_PROTOCOL_PREFIX = "mms://"; + private static final String HTTP_PROTOCOL_PREFIX = "http://"; + private ResultURIConverter base = null; + public MMSToHTTPResultURIConverter(ResultURIConverter base) { + this.base = base; + } + public String makeReplayURI(String datespec, String url) { + if(url.startsWith(MMS_PROTOCOL_PREFIX)) { + url = HTTP_PROTOCOL_PREFIX + + url.substring(MMS_PROTOCOL_PREFIX.length()); + } + return base.makeReplayURI(datespec, url); + } + } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-06-02 22:01:44
|
Revision: 2274 http://archive-access.svn.sourceforge.net/archive-access/?rev=2274&view=rev Author: binzino Date: 2008-06-02 15:01:49 -0700 (Mon, 02 Jun 2008) Log Message: ----------- NutchWAX 0.12 Beta-1 release tag. Added Paths: ----------- tags/nutchwax-0_12-beta1/ Copied: tags/nutchwax-0_12-beta1 (from rev 2273, trunk/archive-access/projects/nutchwax) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-06-02 19:00:38
|
Revision: 2273 http://archive-access.svn.sourceforge.net/archive-access/?rev=2273&view=rev Author: binzino Date: 2008-06-02 11:58:52 -0700 (Mon, 02 Jun 2008) Log Message: ----------- Updated with current location of NutchWAX source in SVN. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-06-02 18:53:46 UTC (rev 2272) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-06-02 18:58:52 UTC (rev 2273) @@ -1,6 +1,6 @@ INSTALL.txt -2008-05-20 +2008-06-02 Aaron Binns This installation guide assumes the reader is already familiar with @@ -60,7 +60,7 @@ Nutch's "contrib" directory. $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive This will create a sub-directory named "archive" containing the NutchWAX sources. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-06-02 18:55:10
|
Revision: 2272 http://archive-access.svn.sourceforge.net/archive-access/?rev=2272&view=rev Author: binzino Date: 2008-06-02 11:53:46 -0700 (Mon, 02 Jun 2008) Log Message: ----------- Move NutchWAX 0.12 from 'projects/nat' to 'projects/nutchwax'. Added Paths: ----------- trunk/archive-access/projects/nutchwax/ Removed Paths: ------------- trunk/archive-access/projects/nat/ Copied: trunk/archive-access/projects/nutchwax (from rev 2271, trunk/archive-access/projects/nat) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-06-02 18:36:45
|
Revision: 2271 http://archive-access.svn.sourceforge.net/archive-access/?rev=2271&view=rev Author: binzino Date: 2008-06-02 11:36:19 -0700 (Mon, 02 Jun 2008) Log Message: ----------- Move pre-0.12 NutchWAX code to branches/nutchwax-pre-0.12. Removed Paths: ------------- trunk/archive-access/projects/nutchwax/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-06-02 18:33:31
|
Revision: 2270 http://archive-access.svn.sourceforge.net/archive-access/?rev=2270&view=rev Author: binzino Date: 2008-06-02 11:32:32 -0700 (Mon, 02 Jun 2008) Log Message: ----------- Removed svn:externals property since NutchWAX 0.12 doesn't require it. Property Changed: ---------------- trunk/archive-access/projects/nutchwax/nutchwax-thirdparty/ Property changes on: trunk/archive-access/projects/nutchwax/nutchwax-thirdparty ___________________________________________________________________ Name: svn:externals - nutch http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9 This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-06-02 18:11:48
|
Revision: 2269 http://archive-access.svn.sourceforge.net/archive-access/?rev=2269&view=rev Author: binzino Date: 2008-06-02 11:11:05 -0700 (Mon, 02 Jun 2008) Log Message: ----------- Moving all pre-0.12 code from 'projects/nutchwax' to here for posterity. Added Paths: ----------- branches/nutchwax-pre-0_12/ Copied: branches/nutchwax-pre-0_12 (from rev 2268, trunk/archive-access/projects/nutchwax) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-05-27 18:58:15
|
Revision: 2268 http://archive-access.svn.sourceforge.net/archive-access/?rev=2268&view=rev Author: binzino Date: 2008-05-27 11:58:18 -0700 (Tue, 27 May 2008) Log Message: ----------- Updated license information: header comments, .LICENSE files, LICENSE.txt, etc. Modified Paths: -------------- trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/NutchWax.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java trunk/archive-access/projects/nat/archive/src/plugin/build.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/DateQueryFilter.java Added Paths: ----------- trunk/archive-access/projects/nat/archive/LICENSE.txt trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.LICENSE Added: trunk/archive-access/projects/nat/archive/LICENSE.txt =================================================================== --- trunk/archive-access/projects/nat/archive/LICENSE.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/LICENSE.txt 2008-05-27 18:58:18 UTC (rev 2268) @@ -0,0 +1,519 @@ + +NutchWAX is free software. Except as noted, it is licensed under the +terms of the GNU Lesser Public License (LGPL), reproduced below. + +Source code derived from Nutch retains the Apache License, as +stipulated by that license. + +Libraries used by NutchWAX are redistributed under their respective +liceneses, which can be found in a file with the same name as the +library, suffixed by ".LICENSE". For example, the license for +"foo.jar" can be found in "foo.LICENSE". + +All other files not carrying an explicit license are licensed under +the GNU Lesser General Public License version 2.1 (included below) + +====================================================================== + + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Libraries + + If you develop a new library, and you want it to be of the greatest +possible use to the public, we recommend making it free software that +everyone can redistribute and change. You can do so by permitting +redistribution under these terms (or, alternatively, under the terms of the +ordinary General Public License). + + To apply these terms, attach the following notices to the library. It is +safest to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + <one line to give the library's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + +Also add information on how to contact you by electronic and paper mail. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the library, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the + library `Frob' (a library for tweaking knobs) written by James Random Hacker. + + <signature of Ty Coon>, 1 April 1990 + Ty Coon, President of Vice + +That's all there is to it! Added: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE =================================================================== --- trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE (rev 0) +++ trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE 2008-05-27 18:58:18 UTC (rev 2268) @@ -0,0 +1,504 @@ + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Libraries + + If you develop a new library, and you want it to be of the greatest +possible use to the public, we recommend making it free software that +everyone can redistribute and change. You can do so by permitting +redistribution under these terms (or, alternatively, under the terms of the +ordinary General Public License). + + To apply these terms, attach the following notices to the library. It is +safest to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + <one line to give the library's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +Also add information on how to contact you by electronic and paper mail. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the library, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the + library `Frob' (a library for tweaking knobs) written by James Random Hacker. + + <signature of Ty Coon>, 1 April 1990 + Ty Coon, President of Vice + +That's all there is to it! + + Added: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE =================================================================== --- trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE (rev 0) +++ trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE 2008-05-27 18:58:18 UTC (rev 2268) @@ -0,0 +1,176 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the ... [truncated message content] |
From: <bi...@us...> - 2008-05-22 23:27:25
|
Revision: 2267 http://archive-access.svn.sourceforge.net/archive-access/?rev=2267&view=rev Author: binzino Date: 2008-05-22 16:27:32 -0700 (Thu, 22 May 2008) Log Message: ----------- Added 0.12 pre-announcement. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/xdocs/index.xml Modified: trunk/archive-access/projects/nutchwax/xdocs/index.xml =================================================================== --- trunk/archive-access/projects/nutchwax/xdocs/index.xml 2008-05-21 00:02:01 UTC (rev 2266) +++ trunk/archive-access/projects/nutchwax/xdocs/index.xml 2008-05-22 23:27:32 UTC (rev 2267) @@ -60,6 +60,17 @@ </table> </section> <section name="News"> + <subsection name="Upcoming Release 0.12.0 - 05/22/2007"> + <p> + With this upcoming release, NutchWAX 0.12 will "catch-up" to + Nutch 1.0-dev (which uses Hadoop 0.16), thereby benefiting from + numerous bug fixes and enhancements contained therein. + </p> + <p> + We are on target for releasing a public beta on June 2nd. + Watch this space for further announcements. + </p> + </subsection> <subsection name="Release 0.10.0 - 01/17/2007"> <p> Bug fixes and improvements in the quality of search results This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-05-21 00:01:54
|
Revision: 2266 http://archive-access.svn.sourceforge.net/archive-access/?rev=2266&view=rev Author: binzino Date: 2008-05-20 17:02:01 -0700 (Tue, 20 May 2008) Log Message: ----------- Total re-write of install, readme and howto documents. Modified Paths: -------------- trunk/archive-access/projects/nat/archive/INSTALL.txt trunk/archive-access/projects/nat/archive/README.txt Added Paths: ----------- trunk/archive-access/projects/nat/archive/HOWTO.txt Added: trunk/archive-access/projects/nat/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nat/archive/HOWTO.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/HOWTO.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -0,0 +1,325 @@ + +HOWTO.txt +2008-05-20 +Aaron Binns + +Table of Contents + o Prerequisites + - Nutch(WAX) installation + - ARC/WARC files + o Configuration & Patching + o Create a manifest + o Import, Invert and Index + o Search + o Web deployment + - Don't forget to config & patch again + +====================================================================== +Prerequisites +====================================================================== + +In order to use Nutch(WAX) you need the following prerequisites: + + 1. NutchWAX installed. + + See INSTALL.txt for instruction on building and installing + NutchWAX. + + This HOWTO assumes it is installed in + + /opt/nutch-1.0-dev + + 2. ARC/WARC files. + + The whole purpose of NutchWAX is to index ARC/WARC files. These + files are not produced by Nutch nor NutchWAX, they are produced by + other tools, such as Heritrix. + + If you don't have any ARC/WARC files, you have no need for + NutchWAX. + + +====================================================================== +Patching +====================================================================== + +The vanilla NutchWAX as built according to the INSTALL.txt guide is +not quite ready to be used out-of-the-box. + +Before you can use NutchWAX, you must first patch a bug that exists in +the current Nutch SVN head. + +The file + + /opt/nutch-1.0-dev/conf/tika-mimetypes.xml + +contains two errors: one where a mimetype is referenced before it is +defined; and a second where a definition has an illegal character. + +These errors cause Nutch to not recognize certain mimetypes and +therefore will ignore documents matching those mimetypes. + +There are two fixes: + + 1. Move + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + definition higher up in the file, before the reference to it. + + 2. Remove + + <mime-type type="application/x-ms-dos-executable"> + <alias type="application/x-dosexec;exe" /> + </mime-type> + + as the ';' character is illegal according to the comments in the + Nutch code. + +You can either apply these patches yourself, or copy an already-patched +copy from: + + /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml + +to + + /opt/nutch-1.0-dev/conf/tika-mimetypes.xml + + +====================================================================== +Configuring +====================================================================== + +Since we assume that you are already familiar with Nutch, then you +should already be familiar with configuring it. The configuration +is mainly defined in + + /opt/nutch-1.0-dev/conf/nutch-default.xml + +NutchWAX requires the modification of two existing properties and the +addition of two new ones. + +All of the modifications described below can be found in: + + /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml + +You can either apply the configuration changes yourself, or copy that +file to + + /opt/nutch-1.0-dev/conf/nutch-site.xml + +-------------------------------------------------- +plugin.includes +-------------------------------------------------- +Change the list of plugins from: + + protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) + +to + + protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic + +In short, we add: + + index-nutchwax + query-nutchwax + parse-pdf + +and remove: + + urlfilter-regex + urlnormalizer-(pass|regex|basic) + +The only *required* changes are the additions of the NutchWAX index +and query plugins. The rest are optional, but recommended. + +The addition of the "parse-pdf" plugin is simply because we have lots +of PDFs in our archives and we want to index them. We sometimes +remove the "parse-js" plugin if we don't care to index JavaScript +files. + +We also remove the URL filtering and normalizing plugins because we do +not need the URLs normalized nor filtered. We trust that the tool +that produced the ARC/WARC file will have normalized the URLs +contained therein according to its own rules so there's no need to +normalize here. Also, we don't filter by URL since we want to index +as much of the ARC/WARC file as we have parsers for. + +-------------------------------------------------- +mime.type.magic +-------------------------------------------------- +We disable mimetype detection in Nutch for two reasons: + +1. The ARC/WARC file specifies the Content-Type of the document. We + trust that the tool that created the ARC/WARC file got it right. + +2. The implementation in Nutch can use a lot of memory as the *entire* + document is read into memory as a byte[], then converted to a + String, then checked against the MIME database. This can lead to + out of memory errors for large files, such as music and video. + +To disable, simply set the property value to false. + + <property> + <name>mime.type.magic</name> + <value>false</value> + </property> + +-------------------------------------------------- +nutchwax.filter.index +-------------------------------------------------- +Configure the 'index-nutchwax' plugin. Specify how the metadata +fields added by the ArcsToSegment are mapped to the Lucene documents +during indexing. + +The specifications here are of the form: + + src-key:lowercase:store:tokenize:dest-key + +where the only required part is the "src-key", the rest will assume +the following defaults: + + lowercase = true + store = true + tokenize = false + dest-key = src-key + +We recommend: + +<property> + <name>nutchwax.filter.index</name> + <value> + arcname:false + collection + date + type + </value> +</property> + +-------------------------------------------------- +nutchwax.filter.query +-------------------------------------------------- +Configure the 'query-nutchwax' plugin. Specify which fields to make +searchable via "[field]:[term|phrase]" query syntax, and whether they +are "raw" fields or not. + +The specification format is + + raw:name:lowercase:boost +or + field:name:boost + +Default values are + + lowercase = true + boost = 1.0f + +There is no "lowercase" property for "field" specification because the +Nutch FieldQueryFilter doesn't expose the option, unlike the +RawFieldQueryFilter. + +NTOE: We do *not* use this filter for handling "date" queries, there is a +specific filter for that: DateQueryFilter + +We recommend: + +<property> + <name>nutchwax.filter.query</name> + <value> + raw:arcname:false + raw:collection + raw:type + field:anchor + field:content + field:host + field:title + </value> +</property> + + +====================================================================== +Create a manifest +====================================================================== + +The input to NutchWAX's import tool is a manifest file. This is a +simple text file where each line contains a URL to an ARC/WARC file +and an optional collection name. + +For example: + + $ cat > manifest + http://someserver/somepath/somearchive.arc.gz mycollection + ^D + +Creates a simple manifest file with one ARC file and a collection +name of "mycollection". + +You don't have to use collections at all. If you don't know how you +would use it, then simply leave it out here. + + +====================================================================== +Import, Invert and Index +====================================================================== + +The steps to import the files, invert the link and index the documents +are rather simple: + + $ mkdir crawl + $ cd crawl + $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest + $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments + $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments + $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* + $ ls -F1 + crawldb/ + indexes/ + linkdb/ + segments/ + +To those already familiar with Nutch, these steps should be quite +familiar. + +The first step, we call NutchWAX's "import" command which creates the +Nutch segment containing the documents in the ARC/WARC files listed in +the manifest. The rest is the same as regular Nutch. + + +====================================================================== +Search +====================================================================== +The resulting indexes can be searched in exactly the same manner as in +regular Nutch. For example, assuming you just completed the steps +above, now: + + $ cd ../ + $ ls -F1 + crawl/ + $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer + +This calls the NutchBean to execute a simple keyword search for +"computer". Use whatever query term you think appears in the +documents you imported. + + +====================================================================== +Web Deployment +====================================================================== + +As users of Nutch are aware, the web application (nutch-1.0-dev.war) +bundled with Nutch contains duplicate copies of the configuration +files. + +So, all patches and configuration changes that we made to the +files in + + /opt/nutch-1.0-dev/conf + +will have to be duplicated in the Nutch webapp when it is deployed. + +This is not due to NutchWAX, this is a "feature" of regular Nutch. I +just thought it would be good to remind everyone since we did make +configuration changes for NutchWAX. Modified: trunk/archive-access/projects/nat/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-14 00:20:24 UTC (rev 2265) +++ trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -1,236 +1,93 @@ INSTALL.txt -2008-05-06 +2008-05-20 Aaron Binns +This installation guide assumes the reader is already familiar with +building, packaging and deploying Nutch 1.0-dev. -The NutchWAX 0.12 build and installation is as an "add-on" to an -existing Nutch 1.0-dev installation. -NutchWAX 0.12 uses a simple 'ant' build script. The script compiles -the NutchWAX sources, using the libraries in the installed -Nutch-1.0-dev. +The NutchWAX 0.12 source and build system are designed to integrate +into the existing Nutch 1.0-dev source and build. -We strongly recommend having *two* Nutch-1.0-dev installation -directories: one that you build NutchWAX against, and another into -which NutchWAX is deployed. +The long-term goal is for the NutchWAX components to be fully +integrated into mainline Nutch. As a stepping-stone toward that goal, +we have packaged the NutchWAX source to be dropped into the Nutch +"contrib" directory and built from there. -NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file -*into* an existing Nutch-1.0-dev installation. Think of NutchWAX as -an add-on. We over-write a few Nutch config files, but the rest is -simply added to the existing Nutch-1.0-dev installation. +Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script. The +NutchWAX build script calls out to the Nutch script to build Nutch +proper, then builds NutchWAX components and integrates them into the +Nutch build directory. +We recommend that you execute all build commands from the NutchWAX +directory. This way, NutchWAX will ensure that any and all +dependencies in Nutch will be properly built and kept up-to-date. +Towards this goal, we have duplicated the most common build targets +from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, +such as: + o compile + o jar + o job + o tar + o clean + +Again, the idea is that if you're already used to building Nutch, you +can easily transition to building Nutch and NutchWAX together. All of +the build artifacts will still be placed in Nutch's 'build' +sub-directory as normal. + + Nutch-1.0-dev ------------- - -As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Now +As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is built against is: 650739 To checkout this revision of Nutch, use: - $ mkdir nutch + $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch - $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk -To build the nutch-1.0-dev.tar.gz package, use 'ant' - $ cd trunk - $ ant tar +NutchWAX +-------- +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into +Nutch's "contrib" directory. -This produces + $ cd contrib + $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive - build/nutch-1.0-dev.tar.gz +This will create a sub-directory named "archive" containing the +NutchWAX sources. -Which we then install *twice* - $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev - $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz - $ mkdir -p /opt/nutch-1.0-dev - $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz - -The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which -we compile against, then, when we want to test NutcWAX, we deploy it -into ~/nutchwax-0.12/nutch-1.0-dev. - -Why can't we just use one installation of Nutch? Mainly to avoid -weirdness where we are compiling NutchWAX source against the same set -of libraries where we would be installing NutchWAX. Consider, when we -deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib' -directory. If we use that same 'lib' directory for dependencies when -compiling the source, 'ant'/'javac' will likely get confused when -calculating dependencies. - -It's possible that you could successfully go through the -build/test/release cycle using one Nutch-1.0-dev directory, but these -instructions assume you will have two. - - Build and install ----------------- +Assuming you already have the required tool-set for building Nutch, +building NutchWAX is a snap. - 1. Install two Nutch-1.0-dev packages per the instructions above. +Simply execute the same 'ant' build command in - 2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev + nutch/contrib/archive - <!-- NOTE: Point this to your Nutch 1.0-dev directory --> - <property name="nutch.dir" value="/opt/nutch-1.0-dev" /> +as you normally would and everything will build as normal. - 3. Build NutchWAX-0.12 +For example - $ ant + $ cd nutch/contrib/archive + $ ant tar - The default build rule is "package" which will compile all the source - and build an intallation tarball: nutchwax-0.12.tar.gz +This command will build all of Nutch, then the NutchWAX add-ons and +finally will package everything up into the "nutch-1.0-dev.tar.gz" +release package. - The "build.xml" file is pretty straightforward and just grepping - for the targets should be pretty obvious: compile, clean, etc. - - 4. Install NutchWAX into the build/test Nutch installation - - $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz - -That's it! - -All we do is add our libraries (nutchwax.jar and dependencies), the -'nutchwax' helper script, plugins for indexing and querying, and a few -config files. - -Except for the config files, no files in the Nutch-1.0-dev -installation are over-written, only added. The "nutch-site.xml" file -is over-written, but that file is empty in a vanilla Nutch -installation, so there's small risk of over-writing something. - - -HOWTO run and test ------------------- - -The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin' -directory next to the 'nutch' helper script. - -The 'nutchwax' script is used to run the NutchWAX-specific tools, use -the regular 'nutch' script for regular Nutch activities. - -The 'nutchwax' script runs two tools - - "import" Import a set of .arc/.warc files from a manifest, creating - a Nutch segment. - - "dumpindex" Debug tool that dumps a Lucene index, such as the ones - created by Nutch's "index" tool. - -The idea is that the NutchWAX "import" tool supplants the Nutch -generate and fetch cycle. Rather than generating and fetching -segments, we import the .arc/.warc files directly into a newly created -segment. Then, we process that segment just as we normally would with -Nutch. - -For example, - - $ cd nutch-test - $ cat > manifest - http://someserver/foo-bar-baz.arc mycollection - ^D - $ nutch-1.0-dev/bin/nutchwax import manifest - -This will import the arc file listed in the manifest into a newly -created segment. The segment is created by default in a directory -hierarchy of the form: - - segments/[date-timestamp] - -This mirrors the way segments are created in vanilla Nutch by the -"generate" command. - -You can explicitly name the segment if you want, e.g. - - nutchwax import manifest mysegment - -Once the segment is created by the importing of ARC files with -NutchWAX, you can use Nutch to perform the rest of the steps. For +Then, install the "nutch-1.0-dev.tar.gz" tarball as normal. For example: - $ nutch-1.0-dev/bin/nutchwax import manifest - $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* - $ nutch-1.0-dev/bin/nutch merge index indexes - -This is pretty much the minimal set of steps to import and index a set -of ARC files. The crawldb update and link inversion steps are pro -forma and don't have anything to do with NutchWAX specifically, but -are a part of regular Nutch processing. - -Now you have a Nutch "index" directory and are ready to search! - -Searching is done as in vanilla Nutch. Either launch the Nutch webapp -or use the command-line interface to NutchBean to run some test -searches. Nothing NutchWAX-specific here. - - -Miscellaneous notes -------------------- - -1. Plugins - -There are two plugins bundled with NutchWAX: - - index-nutchwax - query-nutchwax - -See the "plugin.includes" property in nutch-site.xml to see where -these plugins are added to the filter chain. - -The index-nutchwax plugin ensures that WAX-specifici metadata is -transferred from the Nutch Content object to the Lucene Document -object, which is placed in the Lucene index. - -The query-nutchwax plugin is used to process query requests against -those same meta-data fields. It also expands the capabilities of -searching the basic Nutch fields as well. - -2. URL filters - -Nutch's URL filter by default filters-out many common URL oddities -that would normally trip-up Nutch's crawler. However, when importing -content from ARC files, filtering out content probably doens't make -sense. That is, whatever content made it into the ARC file should be -imported, no matter what the URL looks like. - -To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'. -To pass all content through the filter, remove all filter rules except -for the last one: - - # accept anything else - +. - -3. conf/tika-mimetypes.xml - -NutchWAX comes with a fixed copy of tika-mimetypes.xml. The version -in Nutch revision 650739 has a few bugs in it which cause parsing to -fail for many document types. The bugs are: - - o Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - o Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes -these two bugs. + $ cd /opt + $ tar xvfz nutch-1.0-dev.tar.gz Modified: trunk/archive-access/projects/nat/archive/README.txt =================================================================== --- trunk/archive-access/projects/nat/archive/README.txt 2008-05-14 00:20:24 UTC (rev 2265) +++ trunk/archive-access/projects/nat/archive/README.txt 2008-05-21 00:02:01 UTC (rev 2266) @@ -1,105 +1,122 @@ README.txt -2008-05-06 +2008-05-20 Aaron Binns +Welcome to NutchWAX 0.12! -This is the NutchWAX-0.12 source that John Lee handed-off to me. It -is a work-in-progress. +NutchWAX is a set of add-ons to Nutch in order to index and search +archived web data. -Compared to NutchWAX-0.10 (and earlier) it is *much* simpler. The -main WAX-specific code is in just a few files really: +These add-ons are developed and maintained by the Internet Archive Web +Team in conjunction with a broad community of contributors, partners +and end-users. -src/java/org/archive/nutchwax/ArcsToSegment.java +The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions". - This is the meat of the WAX logic for processing .arc files and - generating Nutch segments. Once we use this to generate a set of - segments for the .arc files, we can use the rest of vanilla - Nutch-1.0-dev to invert links and index the content with Lucene. +Since NutchWAX is a set of add-ons to Nutch, you should already be +familiar with Nutch before using NutchWAX. - This conversion code is heavily edited from: +====================================================================== - nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java +The goal of NutchWAX is to enable full-text indexing and searching of +documents stored in web archive file formats (ARC and WARC). - taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development). +The way we achieve that goal is by providing add-on tools and plugins +to Nutch to read documents directly from ARC/WARC files. We call this +process "importing" archive files. - Ours differs in a few important ways: +Importing produces a Nutch segment, the same as if Nutch had actually +crawled the documents itself. In this scenario, document importing +replaces the conventional "generate/fetch/update" cycle of Nutch. - o Rather than taking a directory with .arc files as input, we take - a manifest file with URLs to .arc files. This way, the manifest - is split up among the distributed Hadoop jobs and the .arc files - are processed in whole by each worker. +Once the archival documents have been imported into a segment, the +regular Nutch commands to update the 'crawldb', invert the links and +index the document contents can proceed as normal. - In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the - input directory to contain the .arc files and (AFAICT) splits - them up and distributes them across the Hadoop workers. This - seems really inefficient to me, I think our approach is much - better -- at least for us. +====================================================================== - o Related to the way input files are split and processed, we use - the standard Archive ARCReader class just like Heritrix and - Wayback. +The NutchWAX add-ons consist of: - The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our - ARCReader because of licensing imcompatibility. Ours is under - GPL and Nutch-1.0-dev forbids the use of GPL code. - - We are in the process of re-licensing or dual-licensing with - Apache License, but until then, our ARCReader code won't be incldued - in mainline Nutch. + bin/nutchwax - This isn's a problem per se, but worth noting in case anyone - looks at the Nutch-1.0-dev code and wonders why they built their - own (horribly inefficient) .arc reader. + A shell script that is used to run the NutchWAX command-line tools, + such as document importing. - o We add metadata fields to the processed document for WAX-specific - purposes: + This is patterned after the 'bin/nutch' shell script. - content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); - content.getMetadata().set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ) ; - content.getMetadata().set( NutchWax.COLLECTION_KEY, collection); - content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() ); + plugins/index-nutchwax - The addition of the arcname and collection key is pretty - obvious. I don't know why the content-type isn't added in the - vanilla Nutch-1.0-dev. - - Also, we should review the use of the ARCHIVE_DATE_KEY in that - John Lee mentioned to me that there was possibly duplicate date - fields put in the index: one that is a plain old Java date, and - one that is a 14-digit date string for use with Wayback. + Indexing plugin which adds NutchWAX-specific metadata fields to the + indexed document. -src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java -src/java/plugin/index-nutchwax/plugin.xml + plugins/query-nutchwax - This filter is pretty straightforward. All it does is take the - metadata fields that were added to the document (as described above) - and placed in the Lucene index so that we can make use of them at - search-time. + Query plugin which allows for querying against the metadata fields + added by 'index-nutchwax'. -src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java -src/java/plugin/query-nutchwax/plugin.xml +There is no separate 'lib/nutchwax.jar' file for NutchWAX. NutchWAX +is distributed in source code form and is intended to be built in +conjunction with Nutch. - This is a single query filter that can be used for querying single - fields from a single implementation. It does *not* allow for - querying multiple fields as you can already do that via Nutch. +See "INSTALL.txt" for details on building NutchWAX and Nutch. - What this filter does is allows one to more-or-less create query - filters in a data-driven manner rather than having to code-up a new - class for each field. That is, before one would have to create a - CollectionQueryFilter class to filter on the "collection" field. - With the MultipleFieldQueryFilter class, you can specify that the - "collection" field is to be filterable via the plugin.xml file and - "nutchwax.filter.query" configuration property. +See "HOWTO.txt" for a quick tutorial on importing, indexing and +searching a set of documents in a web archive file. -src/java/org/archive/nutchwax/NutchWax.java +====================================================================== - Just a simple enum used by the above two classes for the metadata - keys. +This 0.12 release of NutchWAX is radically different in source-code +form compared to the previous release, 0.10. -src/java/org/archive/nutchwax/tools/DumpIndex.java +One of the design goals of 0.12 was to reduce or even eliminate the +"copy/paste/edit" approach of 0.10. The 0.10 (and prior) NutchWAX +releases had to copy/paste/edit large chunks of Nutch source code in +order to add the NutchWAX features. - A simple command-line utility to dump the contents of a Lucene - index. Used for debugging. +Also, the NutchWAX 0.12 sources and build are designed to one day be +added into mainline Nutch as a proper "contrib" package; then +eventually be fully integrated into the core Nutch source code. +====================================================================== +Most of the NutchWAX source code is relatively straightfoward to those +already familiar with the inner workings of Nutch. Still, special +attention on one class is worth while: + + src/java/org/archive/nutchwax/ArcsToSegment.java + +This is where ARC/WARC files are read and their documents are imported +into a Nutch segment. + +It is inspired by: + + nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java + +on the Nutch SVN head. + +Our implementation differs in a few important ways: + + o Rather than taking a directory with ARC files as input, we take a + manifest file with URLs to ARC files. This way, the manifest is + split up among the distributed Hadoop jobs and the ARC files are + processed in whole by each worker. + + In the Nutch SVN, the ArcSegmentCreator.java expects the input + directory to contain the ARC files and (AFAICT) splits them up and + distributes them across the Hadoop workers. + + o We use the standard Internet Archive ARCReader and WARCReader + classes. Thus, NutchWAX can read both ARC and WARC files, whereas + the ArcSegmentCreator class can only read ARC files. + + o We add metadata fields to the document, which are then available + to the "index-nutchwax" plugin at indexing-time. + + ArcsToSegment.importRecord() + ... + contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); + contentMetadata.set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ); + contentMetadata.set( NutchWax.COLLECTION_KEY, collectionName ); + contentMetadata.set( NutchWax.DATE_KEY, meta.getDate() ); + ... This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-05-14 00:20:16
|
Revision: 2265 http://archive-access.svn.sourceforge.net/archive-access/?rev=2265&view=rev Author: binzino Date: 2008-05-13 17:20:24 -0700 (Tue, 13 May 2008) Log Message: ----------- Initial checkin of NutchWAX 0.12, a.k.a Nutch Archive Tools (NAT). Added Paths: ----------- trunk/archive-access/projects/nat/ trunk/archive-access/projects/nat/archive/ trunk/archive-access/projects/nat/archive/INSTALL.txt trunk/archive-access/projects/nat/archive/README.txt trunk/archive-access/projects/nat/archive/bin/ trunk/archive-access/projects/nat/archive/bin/nutchwax trunk/archive-access/projects/nat/archive/build.xml trunk/archive-access/projects/nat/archive/conf/ trunk/archive-access/projects/nat/archive/conf/nutch-site.xml trunk/archive-access/projects/nat/archive/conf/search-servers.txt trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml trunk/archive-access/projects/nat/archive/lib/ trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.jar trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.jar trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.jar trunk/archive-access/projects/nat/archive/src/ trunk/archive-access/projects/nat/archive/src/java/ trunk/archive-access/projects/nat/archive/src/java/org/ trunk/archive-access/projects/nat/archive/src/java/org/archive/ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/NutchWax.java trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java trunk/archive-access/projects/nat/archive/src/plugin/ trunk/archive-access/projects/nat/archive/src/plugin/build-plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/build.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/build.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/build.xml trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/plugin.xml trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/ trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/ConfigurableQueryFilter.java trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/DateQueryFilter.java Added: trunk/archive-access/projects/nat/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nat/archive/INSTALL.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/INSTALL.txt 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,236 @@ + +INSTALL.txt +2008-05-06 +Aaron Binns + + +The NutchWAX 0.12 build and installation is as an "add-on" to an +existing Nutch 1.0-dev installation. + +NutchWAX 0.12 uses a simple 'ant' build script. The script compiles +the NutchWAX sources, using the libraries in the installed +Nutch-1.0-dev. + +We strongly recommend having *two* Nutch-1.0-dev installation +directories: one that you build NutchWAX against, and another into +which NutchWAX is deployed. + +NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file +*into* an existing Nutch-1.0-dev installation. Think of NutchWAX as +an add-on. We over-write a few Nutch config files, but the rest is +simply added to the existing Nutch-1.0-dev installation. + + +Nutch-1.0-dev +------------- + +As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Now +Nutch doesn't have a 1.0 release package yet, so we have to use the +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is +built against is: + + 650739 + +To checkout this revision of Nutch, use: + + $ mkdir nutch + $ cd nutch + $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk + +To build the nutch-1.0-dev.tar.gz package, use 'ant' + + $ cd trunk + $ ant tar + +This produces + + build/nutch-1.0-dev.tar.gz + +Which we then install *twice* + + $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev + $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz + $ mkdir -p /opt/nutch-1.0-dev + $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz + +The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which +we compile against, then, when we want to test NutcWAX, we deploy it +into ~/nutchwax-0.12/nutch-1.0-dev. + +Why can't we just use one installation of Nutch? Mainly to avoid +weirdness where we are compiling NutchWAX source against the same set +of libraries where we would be installing NutchWAX. Consider, when we +deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib' +directory. If we use that same 'lib' directory for dependencies when +compiling the source, 'ant'/'javac' will likely get confused when +calculating dependencies. + +It's possible that you could successfully go through the +build/test/release cycle using one Nutch-1.0-dev directory, but these +instructions assume you will have two. + + +Build and install +----------------- + + 1. Install two Nutch-1.0-dev packages per the instructions above. + + 2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev + + <!-- NOTE: Point this to your Nutch 1.0-dev directory --> + <property name="nutch.dir" value="/opt/nutch-1.0-dev" /> + + 3. Build NutchWAX-0.12 + + $ ant + + The default build rule is "package" which will compile all the source + and build an intallation tarball: nutchwax-0.12.tar.gz + + The "build.xml" file is pretty straightforward and just grepping + for the targets should be pretty obvious: compile, clean, etc. + + 4. Install NutchWAX into the build/test Nutch installation + + $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz + +That's it! + +All we do is add our libraries (nutchwax.jar and dependencies), the +'nutchwax' helper script, plugins for indexing and querying, and a few +config files. + +Except for the config files, no files in the Nutch-1.0-dev +installation are over-written, only added. The "nutch-site.xml" file +is over-written, but that file is empty in a vanilla Nutch +installation, so there's small risk of over-writing something. + + +HOWTO run and test +------------------ + +The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin' +directory next to the 'nutch' helper script. + +The 'nutchwax' script is used to run the NutchWAX-specific tools, use +the regular 'nutch' script for regular Nutch activities. + +The 'nutchwax' script runs two tools + + "import" Import a set of .arc/.warc files from a manifest, creating + a Nutch segment. + + "dumpindex" Debug tool that dumps a Lucene index, such as the ones + created by Nutch's "index" tool. + +The idea is that the NutchWAX "import" tool supplants the Nutch +generate and fetch cycle. Rather than generating and fetching +segments, we import the .arc/.warc files directly into a newly created +segment. Then, we process that segment just as we normally would with +Nutch. + +For example, + + $ cd nutch-test + $ cat > manifest + http://someserver/foo-bar-baz.arc mycollection + ^D + $ nutch-1.0-dev/bin/nutchwax import manifest + +This will import the arc file listed in the manifest into a newly +created segment. The segment is created by default in a directory +hierarchy of the form: + + segments/[date-timestamp] + +This mirrors the way segments are created in vanilla Nutch by the +"generate" command. + +You can explicitly name the segment if you want, e.g. + + nutchwax import manifest mysegment + +Once the segment is created by the importing of ARC files with +NutchWAX, you can use Nutch to perform the rest of the steps. For +example: + + $ nutch-1.0-dev/bin/nutchwax import manifest + $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments + $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments + $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* + $ nutch-1.0-dev/bin/nutch merge index indexes + +This is pretty much the minimal set of steps to import and index a set +of ARC files. The crawldb update and link inversion steps are pro +forma and don't have anything to do with NutchWAX specifically, but +are a part of regular Nutch processing. + +Now you have a Nutch "index" directory and are ready to search! + +Searching is done as in vanilla Nutch. Either launch the Nutch webapp +or use the command-line interface to NutchBean to run some test +searches. Nothing NutchWAX-specific here. + + +Miscellaneous notes +------------------- + +1. Plugins + +There are two plugins bundled with NutchWAX: + + index-nutchwax + query-nutchwax + +See the "plugin.includes" property in nutch-site.xml to see where +these plugins are added to the filter chain. + +The index-nutchwax plugin ensures that WAX-specifici metadata is +transferred from the Nutch Content object to the Lucene Document +object, which is placed in the Lucene index. + +The query-nutchwax plugin is used to process query requests against +those same meta-data fields. It also expands the capabilities of +searching the basic Nutch fields as well. + +2. URL filters + +Nutch's URL filter by default filters-out many common URL oddities +that would normally trip-up Nutch's crawler. However, when importing +content from ARC files, filtering out content probably doens't make +sense. That is, whatever content made it into the ARC file should be +imported, no matter what the URL looks like. + +To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'. +To pass all content through the filter, remove all filter rules except +for the last one: + + # accept anything else + +. + +3. conf/tika-mimetypes.xml + +NutchWAX comes with a fixed copy of tika-mimetypes.xml. The version +in Nutch revision 650739 has a few bugs in it which cause parsing to +fail for many document types. The bugs are: + + o Move + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + definition higher up in the file, before the reference to it. + + o Remove + + <mime-type type="application/x-ms-dos-executable"> + <alias type="application/x-dosexec;exe" /> + </mime-type> + + as the ';' character is illegal according to the comments in the + Nutch code. + +The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes +these two bugs. Added: trunk/archive-access/projects/nat/archive/README.txt =================================================================== --- trunk/archive-access/projects/nat/archive/README.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/README.txt 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,105 @@ + +README.txt +2008-05-06 +Aaron Binns + + +This is the NutchWAX-0.12 source that John Lee handed-off to me. It +is a work-in-progress. + +Compared to NutchWAX-0.10 (and earlier) it is *much* simpler. The +main WAX-specific code is in just a few files really: + +src/java/org/archive/nutchwax/ArcsToSegment.java + + This is the meat of the WAX logic for processing .arc files and + generating Nutch segments. Once we use this to generate a set of + segments for the .arc files, we can use the rest of vanilla + Nutch-1.0-dev to invert links and index the content with Lucene. + + This conversion code is heavily edited from: + + nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java + + taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development). + + Ours differs in a few important ways: + + o Rather than taking a directory with .arc files as input, we take + a manifest file with URLs to .arc files. This way, the manifest + is split up among the distributed Hadoop jobs and the .arc files + are processed in whole by each worker. + + In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the + input directory to contain the .arc files and (AFAICT) splits + them up and distributes them across the Hadoop workers. This + seems really inefficient to me, I think our approach is much + better -- at least for us. + + o Related to the way input files are split and processed, we use + the standard Archive ARCReader class just like Heritrix and + Wayback. + + The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our + ARCReader because of licensing imcompatibility. Ours is under + GPL and Nutch-1.0-dev forbids the use of GPL code. + + We are in the process of re-licensing or dual-licensing with + Apache License, but until then, our ARCReader code won't be incldued + in mainline Nutch. + + This isn's a problem per se, but worth noting in case anyone + looks at the Nutch-1.0-dev code and wonders why they built their + own (horribly inefficient) .arc reader. + + o We add metadata fields to the processed document for WAX-specific + purposes: + + content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() ); + content.getMetadata().set( NutchWax.ARCNAME_KEY, meta.getArcFile().getName() ) ; + content.getMetadata().set( NutchWax.COLLECTION_KEY, collection); + content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() ); + + The addition of the arcname and collection key is pretty + obvious. I don't know why the content-type isn't added in the + vanilla Nutch-1.0-dev. + + Also, we should review the use of the ARCHIVE_DATE_KEY in that + John Lee mentioned to me that there was possibly duplicate date + fields put in the index: one that is a plain old Java date, and + one that is a 14-digit date string for use with Wayback. + +src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java +src/java/plugin/index-nutchwax/plugin.xml + + This filter is pretty straightforward. All it does is take the + metadata fields that were added to the document (as described above) + and placed in the Lucene index so that we can make use of them at + search-time. + +src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java +src/java/plugin/query-nutchwax/plugin.xml + + This is a single query filter that can be used for querying single + fields from a single implementation. It does *not* allow for + querying multiple fields as you can already do that via Nutch. + + What this filter does is allows one to more-or-less create query + filters in a data-driven manner rather than having to code-up a new + class for each field. That is, before one would have to create a + CollectionQueryFilter class to filter on the "collection" field. + With the MultipleFieldQueryFilter class, you can specify that the + "collection" field is to be filterable via the plugin.xml file and + "nutchwax.filter.query" configuration property. + +src/java/org/archive/nutchwax/NutchWax.java + + Just a simple enum used by the above two classes for the metadata + keys. + +src/java/org/archive/nutchwax/tools/DumpIndex.java + + A simple command-line utility to dump the contents of a Lucene + index. Used for debugging. + + Added: trunk/archive-access/projects/nat/archive/bin/nutchwax =================================================================== --- trunk/archive-access/projects/nat/archive/bin/nutchwax (rev 0) +++ trunk/archive-access/projects/nat/archive/bin/nutchwax 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,70 @@ +#!/usr/bin/env bash + +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed +# with this work for additional information regarding copyright +# ownership. The ASF licenses this file to You under the Apache +# License, Version 2.0 (the "License"); you may not use this file +# except in compliance with the License. You may obtain a copy of the +# License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or +# implied. See the License for the specific language governing +# permissions and limitations under the License. + + +# The following is cribbed from the 'nutch' script to ascertain the +# location of Nutch so we can call its scripts. +# +# resolve links - $0 may be a softlink +THIS="$0" +while [ -h "$THIS" ]; do + ls=`ls -ld "$THIS"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '.*/.*' > /dev/null; then + THIS="$link" + else + THIS=`dirname "$THIS"`/"$link" + fi +done + +THIS_DIR=`dirname "$THIS"` +NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd` + +# Now that we have NUTCH_HOME, process the command-line. + +case "$1" in + import) + shift + if [ $# -eq 0 ]; then + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.ArcsToSegment + exit 1 + fi + if [ -z "$2" ]; then + segment=`date +"%Y%m%d%H%M%S"` + segment="segments/${segment}" + else + segment="$2" + fi + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.ArcsToSegment "$1" "${segment}" + ;; + dumpindex) + shift + ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpIndex $@ + ;; + *) + echo "" + echo "Usage: nutchwax COMMAND" + echo "where COMMAND is one of:" + echo " import Import ARCs into a new Nutch segment" + echo " dumpindex Dump an index to the screen" + echo "" + exit 1 + ;; +esac + +exit 0 Property changes on: trunk/archive-access/projects/nat/archive/bin/nutchwax ___________________________________________________________________ Name: svn:executable + * Added: trunk/archive-access/projects/nat/archive/build.xml =================================================================== --- trunk/archive-access/projects/nat/archive/build.xml (rev 0) +++ trunk/archive-access/projects/nat/archive/build.xml 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,138 @@ +<?xml version="1.0"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<project name="nutchwax" default="job"> + + <property name="nutch.dir" value="../../" /> + + <property name="src.dir" value="src" /> + <property name="lib.dir" value="lib" /> + <property name="build.dir" value="${nutch.dir}/build" /> + <!-- HACK: Need to import default.properties like Nutch does --> + <property name="dist.dir" value="${build.dir}/nutch-1.0-dev" /> + + <target name="nutch-compile-core"> + <ant dir="${nutch.dir}" target="compile-core" inheritAll="false" /> + </target> + + <target name="nutch-compile-plugins"> + <ant dir="${nutch.dir}" target="compile-plugins" inheritAll="false" /> + </target> + + <target name="compile-core" depends="nutch-compile-core"> + <javac + destdir="${build.dir}/classes" + debug="true" + verbose="false" + source="1.5" + target="1.5" + encoding="UTF-8" + fork="true" + nowarn="true" + deprecation="false"> + <src path="${src.dir}/java" /> + <include name="**/*.java" /> + <classpath> + <pathelement location="${build.dir}/classes" /> + <fileset dir="${lib.dir}"> + <include name="*.jar"/> + </fileset> + <fileset dir="${nutch.dir}/lib"> + <include name="*.jar"/> + </fileset> + </classpath> + </javac> + </target> + + <target name="compile-plugins"> + <ant dir="src/plugin" target="deploy" inheritAll="false" /> + </target> + + <!-- + These targets all call down to the corresponding target in the + Nutch build.xml file. This way all of the 'ant' build commands + can be executed from this directory and everything should get + built as expected. + --> + <target name="compile" depends="compile-core, compile-plugins, nutch-compile-plugins"> + </target> + + <target name="jar" depends="compile-core"> + <ant dir="${nutch.dir}" target="jar" inheritAll="false" /> + </target> + + <target name="job" depends="compile"> + <ant dir="${nutch.dir}" target="job" inheritAll="false" /> + </target> + + <target name="war" depends="compile"> + <ant dir="${nutch.dir}" target="war" inheritAll="false" /> + </target> + + <target name="javadoc" depends="compile"> + <ant dir="${nutch.dir}" target="javadoc" inheritAll="false" /> + </target> + + <target name="tar" depends="package"> + <ant dir="${nutch.dir}" target="tar" inheritAll="false" /> + </target> + + <target name="clean"> + <ant dir="${nutch.dir}" target="clean" inheritAll="false" /> + </target> + + <!-- This one does a little more after calling down to the relevant + Nutch target. After Nutch has copied everything into the + distribution directory, we add our script, libraries, etc. + + Rather than over-write the standard Nutch configuration files, + we place ours in a newly created directory + + contrib/archive/conf + + and let the individual user decide whether or not to + incorporate our modifications. + --> + <target name="package" depends="jar, job, war, javadoc"> + <ant dir="${nutch.dir}" target="package" inheritAll="false" /> + + <copy todir="${dist.dir}/lib" includeEmptyDirs="false"> + <fileset dir="lib"/> + </copy> + + <copy todir="${dist.dir}/bin"> + <fileset dir="bin"/> + </copy> + + <chmod perm="ugo+x" type="file"> + <fileset dir="${dist.dir}/bin"/> + </chmod> + + <mkdir dir="${dist.dir}/contrib/archive/conf"/> + <copy todir="${dist.dir}/contrib/archive/conf"> + <fileset dir="conf" /> + </copy> + + <copy todir="${dist.dir}/contrib/archive"> + <fileset dir="."> + <include name="*.txt" /> + </fileset> + </copy> + + </target> + +</project> Added: trunk/archive-access/projects/nat/archive/conf/nutch-site.xml =================================================================== --- trunk/archive-access/projects/nat/archive/conf/nutch-site.xml (rev 0) +++ trunk/archive-access/projects/nat/archive/conf/nutch-site.xml 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,65 @@ +<?xml version="1.0"?> +<?xml-stylesheet type="text/xsl" href="configuration.xsl"?> + +<!-- Put site-specific property overrides in this file. --> + +<configuration> + +<property> + <name>plugin.includes</name> + <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. --> + <!-- Also, add 'parse-pdf' --> + <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' --> + <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic</value> +</property> + +<property> + <!-- Configure the 'index-nutchwax' plugin. Specify how the metadata fields added by the ArcsToSegment are mapped to the Lucene documents during indexing. + The specifications here are of the form "src-key:lowercase:store:tokenize:dest-key" + Where the only required part is the "src-key", the rest will assume the following defaults: + lowercase = true + store = true + tokenize = false + dest-key = src-key + --> + <name>nutchwax.filter.index</name> + <value> + arcname:false + collection + date + type + </value> +</property> + +<property> + <!-- Configure the 'query-nutchwax' plugin. Specify which fields to make searchable via "field:[term|phrase]" query syntax, and whether they are "raw" fields or not. + The specification format is "raw:name:lowercase:boost" or "field:name:boost". Default values are + lowercase = true + boost = 1.0f + There is no "lowercase" property for "field" specification because the Nutch FieldQueryFilter doesn't expose the option, unlike the RawFieldQueryFilter. + AFAICT, the order isn't important. --> + <!-- We do *not* use this filter for handling "date" queries, there is a specific filter for that: DateQueryFilter --> + <name>nutchwax.filter.query</name> + <value> + raw:arcname:false + raw:collection + raw:type + field:anchor + field:content + field:host + field:title + </value> +</property> + +<!-- Over-ride setting in Nutch "nutch-default.xml" file. We do *not* want Content-Type detection via magic resolution because the implementation + in Nutch reads in the entire content body (which could be a 1GB MPG movie), then converts it to a String before examining the first dozen or + so bytes/characters for magic matching. Since we archvie large files, this is bad, and OOMs occur. So, we disable this feature and keep + the Content-Type that is already in the (W)ARC file. --> +<property> + <name>mime.type.magic</name> + <value>false</value> + <description>Defines if the mime content type detector uses magic resolution. + </description> +</property> + +</configuration> Added: trunk/archive-access/projects/nat/archive/conf/search-servers.txt =================================================================== --- trunk/archive-access/projects/nat/archive/conf/search-servers.txt (rev 0) +++ trunk/archive-access/projects/nat/archive/conf/search-servers.txt 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1 @@ +localhost 9000 Added: trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml =================================================================== --- trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml (rev 0) +++ trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,364 @@ +<?xml version="1.0" encoding="UTF-8"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. + + Description: This xml file defines the valid mime types used by Tika. + The mime types within this file are based on the types in the mime-types.xml + file available in Apache Nutch. +--> + +<mime-info> + + <mime-type type="text/plain"> + <magic priority="50"> + <match value="This is TeX," type="string" offset="0" /> + <match value="This is METAFONT," type="string" offset="0" /> + </magic> + <glob pattern="*.txt" /> + <glob pattern="*.asc" /> + </mime-type> + + <mime-type type="text/html"> + <magic priority="50"> + <match value="<!DOCTYPE HTML" type="string" + offset="0:64" /> + <match value="<!doctype html" type="string" + offset="0:64" /> + <match value="<HEAD" type="string" offset="0:64" /> + <match value="<head" type="string" offset="0:64" /> + <match value="<TITLE" type="string" offset="0:64" /> + <match value="<title" type="string" offset="0:64" /> + <match value="<html" type="string" offset="0:64" /> + <match value="<HTML" type="string" offset="0:64" /> + <match value="<BODY" type="string" offset="0" /> + <match value="<body" type="string" offset="0" /> + <match value="<TITLE" type="string" offset="0" /> + <match value="<title" type="string" offset="0" /> + <match value="<!--" type="string" offset="0" /> + <match value="<h1" type="string" offset="0" /> + <match value="<H1" type="string" offset="0" /> + <match value="<!doctype HTML" type="string" offset="0" /> + <match value="<!DOCTYPE html" type="string" offset="0" /> + </magic> + <glob pattern="*.html" /> + <glob pattern="*.htm" /> + </mime-type> + + <mime-type type="application/xml"> + <alias type="text/xml" /> + <glob pattern="*.xml" /> + </mime-type> + + <mime-type type="application/xhtml+xml"> + <sub-class-of type="text/xml" /> + <glob pattern="*.xhtml" /> + <root-XML namespaceURI='http://www.w3.org/1999/xhtml' + localName='html' /> + </mime-type> + + <mime-type type="application/vnd.ms-powerpoint"> + <glob pattern="*.ppz" /> + <glob pattern="*.ppt" /> + <glob pattern="*.pps" /> + <glob pattern="*.pot" /> + <magic priority="50"> + <match value="0xcfd0e011" type="little32" offset="0" /> + </magic> + </mime-type> + + <mime-type type="application/vnd.ms-excel"> + <magic priority="50"> + <match value="Microsoft Excel 5.0 Worksheet" type="string" + offset="2080" /> + </magic> + <glob pattern="*.xls" /> + <glob pattern="*.xlc" /> + <glob pattern="*.xll" /> + <glob pattern="*.xlm" /> + <glob pattern="*.xlw" /> + <glob pattern="*.xla" /> + <glob pattern="*.xlt" /> + <glob pattern="*.xld" /> + <alias type="application/msexcel" /> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.text"> + <glob pattern="*.odt" /> + </mime-type> + + + <mime-type type="application/zip"> + <alias type="application/x-zip-compressed" /> + <magic priority="40"> + <match value="PK\003\004" type="string" offset="0" /> + </magic> + <glob pattern="*.zip" /> + </mime-type> + + <mime-type type="application/vnd.oasis.opendocument.text"> + <glob pattern="*.oth" /> + </mime-type> + + <mime-type type="application/msword"> + <magic priority="50"> + <match value="\x31\xbe\x00\x00" type="string" offset="0" /> + <match value="PO^Q`" type="string" offset="0" /> + <match value="\376\067\0\043" type="string" offset="0" /> + <match value="\333\245-\0\0\0" type="string" offset="0" /> + <match value="Microsoft Word 6.0 Document" type="string" + offset="2080" /> + <match value="Microsoft Word document data" type="string" + offset="2112" /> + </magic> + <glob pattern="*.doc" /> + <alias type="application/vnd.ms-word" /> + </mime-type> + + <mime-type type="application/octet-stream"> + <magic priority="50"> + <match value="\037\036" type="string" offset="0" /> + <match value="017437" type="host16" offset="0" /> + <match value="0x1fff" type="host16" offset="0" /> + <match value="\377\037" type="string" offset="0" /> + <match value="0145405" type="host16" offset="0" /> + </magic> + <glob pattern="*.bin" /> + </mime-type> + + <mime-type type="application/pdf"> + <magic priority="50"> + <match value="%PDF-" type="string" offset="0" /> + </magic> + <glob pattern="*.pdf" /> + <alias type="application/x-pdf" /> + </mime-type> + + <mime-type type="application/atom+xml"> + <root-XML localName="feed" + namespaceURI="http://purl.org/atom/ns#" /> + </mime-type> + + <mime-type type="application/mac-binhex40"> + <glob pattern="*.hqx" /> + </mime-type> + + <mime-type type="application/mac-compactpro"> + <glob pattern="*.cpt" /> + </mime-type> + + <mime-type type="application/rtf"> + <glob pattern="*.rtf"/> + <alias type="text/rtf" /> + </mime-type> + + <mime-type type="application/rss+xml"> + <alias type="text/rss" /> + <root-XML localName="rss" /> + <root-XML namespaceURI="http://purl.org/rss/1.0/" /> + <glob pattern="*.rss" /> + </mime-type> + + <!-- added in by mattmann --> + <mime-type type="application/x-mif"> + <alias type="application/vnd.mif" /> + </mime-type> + + <mime-type type="application/vnd.wap.wbxml"> + <glob pattern="*.wbxml" /> + </mime-type> + + <mime-type type="application/vnd.wap.wmlc"> + <_comment>Compiled WML Document</_comment> + <glob pattern="*.wmlc" /> + </mime-type> + + <mime-type type="application/vnd.wap.wmlscriptc"> + <_comment>Compiled WML Script</_comment> + <glob pattern="*.wmlsc" /> + </mime-type> + + <mime-type type="text/vnd.wap.wmlscript"> + <_comment>WML Script</_comment> + <glob pattern="*.wmls" /> + </mime-type> + + <mime-type type="application/x-bzip"> + <alias type="application/x-bzip2" /> + </mime-type> + + <mime-type type="application/x-bzip-compressed-tar"> + <glob pattern="*.tbz" /> + <glob pattern="*.tbz2" /> + </mime-type> + + <mime-type type="application/x-cdlink"> + <_comment>Virtual CD-ROM CD Image File</_comment> + <glob pattern="*.vcd" /> + </mime-type> + + <mime-type type="application/x-director"> + <_comment>Shockwave Movie</_comment> + <glob pattern="*.dcr" /> + <glob pattern="*.dir" /> + <glob pattern="*.dxr" /> + </mime-type> + + <mime-type type="application/x-futuresplash"> + <_comment>Macromedia FutureSplash File</_comment> + <glob pattern="*.spl" /> + </mime-type> + + <mime-type type="application/x-java"> + <alias type="application/java" /> + </mime-type> + + <mime-type type="application/x-koan"> + <_comment>SSEYO Koan File</_comment> + <glob pattern="*.skp" /> + <glob pattern="*.skd" /> + <glob pattern="*.skt" /> + <glob pattern="*.skm" /> + </mime-type> + + <mime-type type="application/x-latex"> + <_comment>LaTeX Source Document</_comment> + <glob pattern="*.latex" /> + </mime-type> + + <!-- JC CHANGED + <mime-type type="application/x-mif"> + <_comment>FrameMaker MIF document</_comment> + <glob pattern="*.mif"/> + </mime-type> --> + + <mime-type type="application/ogg"> + <alias type="application/x-ogg" /> + </mime-type> + + <mime-type type="application/x-rar"> + <alias type="application/x-rar-compressed" /> + </mime-type> + + <mime-type type="application/x-shellscript"> + <alias type="application/x-sh" /> + </mime-type> + + <mime-type type="application/xhtml+xml"> + <glob pattern="*.xht" /> + </mime-type> + + <mime-type type="audio/midi"> + <glob pattern="*.kar" /> + </mime-type> + + <mime-type type="audio/x-pn-realaudio"> + <alias type="audio/x-realaudio" /> + </mime-type> + + <mime-type type="image/tiff"> + <magic priority="50"> + <match value="0x4d4d2a00" type="string" offset="0" /> + <match value="0x49492a00" type="string" offset="0" /> + </magic> + </mime-type> + + <mime-type type="message/rfc822"> + <magic priority="50"> + <match type="string" value="Relay-Version:" offset="0" /> + <match type="string" value="#! rnews" offset="0" /> + <match type="string" value="N#! rnews" offset="0" /> + <match type="string" value="Forward to" offset="0" /> + <match type="string" value="Pipe to" offset="0" /> + <match type="string" value="Return-Path:" offset="0" /> + <match type="string" value="From:" offset="0" /> + <match type="string" value="Message-ID:" offset="0" /> + <match type="string" value="Date:" offset="0" /> + </magic> + </mime-type> + + <mime-type type="application/x-javascript"> + <glob pattern="*.js" /> + </mime-type> + + + <mime-type type="image/vnd.wap.wbmp"> + <_comment>Wireless Bitmap File Format</_comment> + <glob pattern="*.wbmp" /> + </mime-type> + + <mime-type type="image/x-psd"> + <alias type="image/photoshop" /> + </mime-type> + + <mime-type type="image/x-xcf"> + <alias type="image/xcf" /> + <magic priority="50"> + <match type="string" value="gimp xcf " offset="0" /> + </magic> + </mime-type> + + <mime-type type="application/x-shockwave-flash"> + <glob pattern="*.swf"/> + <magic priority="50"> + <match type="string" value="FWS" offset="0"/> + <match type="string" value="CWS" offset="0"/> + </magic> + </mime-type> + + <mime-type type="model/iges"> + <_comment> + Initial Graphics Exchange Specification Format + </_comment> + <glob pattern="*.igs" /> + <glob pattern="*.iges" /> + </mime-type> + + <mime-type type="model/mesh"> + <glob pattern="*.msh" /> + <glob pattern="*.mesh" /> + <glob pattern="*.silo" /> + </mime-type> + + <mime-type type="model/vrml"> + <glob pattern="*.vrml" /> + </mime-type> + + <mime-type type="text/x-tcl"> + <alias type="application/x-tcl" /> + </mime-type> + + <mime-type type="text/x-tex"> + <alias type="application/x-tex" /> + </mime-type> + + <mime-type type="text/x-texinfo"> + <alias type="application/x-texinfo" /> + </mime-type> + + <mime-type type="text/x-troff-me"> + <alias type="application/x-troff-me" /> + </mime-type> + + <mime-type type="video/vnd.mpegurl"> + <glob pattern="*.mxu" /> + </mime-type> + + <mime-type type="x-conference/x-cooltalk"> + <_comment>Cooltalk Audio</_comment> + <glob pattern="*.ice" /> + </mime-type> + +</mime-info> Added: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.jar =================================================================== (Binary files differ) Property changes on: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.jar ___________________________________________________________________ Name: svn:mime-type + application/octet-stream Added: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.jar =================================================================== (Binary files differ) Property changes on: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.jar ___________________________________________________________________ Name: svn:mime-type + application/octet-stream Added: trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.jar =================================================================== (Binary files differ) Property changes on: trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.jar ___________________________________________________________________ Name: svn:mime-type + application/octet-stream Added: trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java =================================================================== --- trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java (rev 0) +++ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,273 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.archive.nutchwax; + +import java.util.Iterator; +import java.util.Map; +import java.util.HashMap; +import java.io.IOException; + +import org.archive.io.ArchiveReader; +import org.archive.io.ArchiveReaderFactory; +import org.archive.io.ArchiveRecord; +import org.archive.io.ArchiveRecordHeader; + +import org.archive.io.arc.ARCConstants; +import org.archive.io.arc.ARCReader; +import org.archive.io.arc.ARCRecord; +import org.archive.io.arc.ARCRecordMetaData; +import org.archive.io.warc.WARCConstants; +import org.archive.io.warc.WARCRecord; + +import org.apache.commons.httpclient.Header; + + +/** + * <p> + * Reader of both ARC and WARC format archive files. This is not a + * general-purpose archive file reader, but is written specifically + * for NutchWAX. It's possible that this could become a + * general-purpose archive file reader, but for now, consider it + * custom-tailored to the needs of NutchWAX. + * </p> + * <p> + * <code>ArcReader</code> is a wrapper around the underlying + * <code>ArchiveReader</code> implementation + * (<code>ARCReader</code>/<code>WARCReader</code>) which converts + * <code>WARCRecord</code>s to <code>ARCRecord</code>s on the fly. + * </p> + * <p> + * If an <code>ARCReader</code> is being wrapped, then the + * underlying <code>ARCRecord</code>s are read and passed-through + * unmolested. + * </p> + * <p> + * If a <code>WARCReader</code> is being wrapped, then the + * <code>WARCRecord</code>s are converted to <code>ARCRecord</code>s + * on the fly. + * </p> + * <p> + * <strong>WARNING:</strong> We only convert WARC + * <code>response</code> records. All other WARC record types are + * returned as <code>null</code> by the iterator's + * <code>next()</code> method. So, when using the iterator, don't + * forget to check for a <code>null</code> value returned by + * <code>next()</code>. + * </p> + */ +public class ArcReader implements Iterable<ARCRecord> +{ + private ArchiveReader reader; + + /** + * Construct an <code>ArcReader<code> wrapping an + * <code>ArchiveReader</code> instance. + * + * @param reader the ArchiveReader instance to wrap + */ + public ArcReader( ArchiveReader reader ) + { + this.reader = reader; + } + + /** + * Returns an iterator over <code>ARCRecord</code>s in the wrapped + * <code>ArchiveReader</code>, converting <code>WARCRecords</code> + * to <code>ARCRecords</code> on-the-fly. + * + * @return an interator + */ + public Iterator<ARCRecord> iterator( ) + { + return new ArcIterator( ); + } + + /** + * + */ + private class ArcIterator implements Iterator<ARCRecord> + { + private Iterator<ArchiveRecord> i; + + /** + * Construct an <code>ArcIterator</code>, skipping the header + * record if the wrapped reader is an <code>ARCReader</code>. + */ + public ArcIterator( ) + { + this.i = ArcReader.this.reader.iterator( ); + + if ( ArcReader.this.reader instanceof ARCReader ) + { + // Skip the first record, which is a "filedesc://" + // record describing the ARC file. + if ( this.i.hasNext( ) ) this.i.next( ); + } + } + + /** + * Returns <code>true</code> if the iteration has more elements. + * Will return <code>true</code> even if the value returned by the + * next call to <code>next()</code> returns <code>null</code>. + * + * @return <code>true</code> if the iterator has more elements. + */ + public boolean hasNext( ) + { + return this.i.hasNext( ); + } + + /** + * Returns the next element in the iteration. Calling this method + * repeatedly until the <code>hasNext()</code> method returns + * <code>false</code> will return each element in the underlying + * collection exactly once. + * + * @return the next element in the iteration, which can be <code>null</code> + */ + public ARCRecord next( ) + { + try + { + ArchiveRecord record = this.i.next( ); + + if ( record instanceof ARCRecord ) + { + // Just return the ARCRecord as-is. + ARCRecord arc = (ARCRecord) record; + + return arc; + } + + if ( record instanceof WARCRecord ) + { + WARCRecord warc = (WARCRecord) record; + + ARCRecord arc = convert( warc ); + + return arc; + } + + // If we get here then the record we reaad in was neither an ARC + // or WARC record. What is a good exception to throw? + throw new RuntimeException( "Record neither ARC nor WARC: " + record.getClass( ) ); + } + catch ( IOException ioe ) + { + throw new RuntimeException( ioe ); + } + } + + /** + * Unsupported optional operation. + * + * @throw UnsupportedOperationException + */ + public void remove( ) + { + throw new UnsupportedOperationException( ); + } + + /** + * Convert a WARCRecord to an ARCRecord. Only "response" + * WARCRecords are converted to meaningful ARCRecords. All other + * WARCRecord types are converted to <code>null</code>. + * + * @param warc the WARCRecord to convert + * @return the corresponding ARCRecord, <code>null</code> if WARCRecord not a "reponse" record + */ + private ARCRecord convert( WARCRecord warc ) + throws IOException + { + ArchiveRecordHeader header = warc.getHeader( ); + + // We only care about "response" WARC records. + if ( ! WARCConstants.RESPONSE.equals( header.getHeaderValue( WARCConstants.HEADER_KEY_TYPE ) ) ) + { + return null; + } + + // Construct an ARCRecordMetadata object based on the info in + // the ArchiveRecordHeader. + Map arcMetadataFields = new HashMap( ); + arcMetadataFields.put( ARCConstants.URL_FIELD_KEY, header.getHeaderValue( WARCConstants.HEADER_KEY_URI ) ); + arcMetadataFields.put( ARCConstants.IP_HEADER_FIELD_KEY, header.getHeaderValue( WARCConstants.HEADER_KEY_IP ) ); + arcMetadataFields.put( ARCConstants.DATE_FIELD_KEY, header.getHeaderValue( WARCConstants.HEADER_KEY_DATE ) ); + arcMetadataFields.put( ARCConstants.MIMETYPE_FIELD_KEY, header.getHeaderValue( null ) ); // We don't know the MIME type of the *payload* in a WARC (yet) + arcMetadataFields.put( ARCConstants.LENGTH_FIELD_KEY, header.getHeaderValue( WARCConstants.CONTENT_LENGTH ) ); + arcMetadataFields.put( ARCConstants.VERSION_FIELD_KEY, header.getHeaderValue( null ) ); // FIXME: Do we need actual values for these? + arcMetadataFields.put( ARCConstants.ABSOLUTE_OFFSET_KEY, header.getHeaderValue( null ) ); // FIXME: Do we need actual values for these? + + ARCRecordMetaData metadata = new ARCRecordMetaData( header.getReaderIdentifier( ), arcMetadataFields ); + + // Then, create an ARCRecord using the WARCRecord and the + // ARCRecordMetaData object we just created. + ARCRecord arc = new ARCRecord( warc, + metadata, + 0, // offset + ArcReader.this.reader.isDigest( ), + ArcReader.this.reader.isStrict( ), + true // parse HTTP headers + ); + + // Now that we've created the ARCRecord, we get the HTTP headers + // from it. From these HTTP headers, we obtain the Content-Type + // of the ARCRecord's payload, then set value as the MIME-type + // of the ARCRecord itself. + + // If the response is something other than HTTP + // (like DNS) there are no HTTP headers. + if ( arc.getHttpHeaders( ) != null ) + { + for ( Header h : arc.getHttpHeaders( ) ) + { + if ( h.getName( ).equals( "Content-Type" ) ) + { + arc.getMetaData( ).getHeaderFields( ).put( ARCConstants.MIMETYPE_FIELD_KEY, h.getValue( ) ); + } + } + } + + return arc; + } + + } + + /** + * Simple test/debug driver to read an archive file and print out + * the header for each record. + */ + public static void main( String args[] ) throws Exception + { + if ( args.length != 1 ) + { + System.out.println( "ReaderTest <(w)arc file>" ); + System.exit( 1 ); + } + + String arcName = args[0]; + + ArchiveReader r = ArchiveReaderFactory.get( arcName ); + + ArcReader reader = new ArcReader( r ); + + for ( ARCRecord rec : reader ) + { + if ( rec != null ) System.out.println( rec.getHeader( ) ); + } + } +} Added: trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java =================================================================== --- trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java (rev 0) +++ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java 2008-05-14 00:20:24 UTC (rev 2265) @@ -0,0 +1,553 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.archive.nutchwax; + +import java.io.IOException; +import java.net.MalformedURLException; +import java.util.Map.Entry; +import java.util.Iterator; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.conf.Configured; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.io.Text; +import org.apache.hadoop.io.Writable; +import org.apache.hadoop.io.WritableComparable; +import org.apache.hadoop.mapred.JobClient; +import org.apache.hadoop.mapred.JobConf; +import org.apache.hadoop.mapred.Mapper; +import org.apache.hadoop.mapred.OutputCollector; +import org.apache.hadoop.mapred.Reporter; +import org.apache.hadoop.mapred.TextInputFormat; +import org.apache.hadoop.mapred.TextOutputFormat; +import org.apache.hadoop.util.StringUtils; +import org.apache.hadoop.util.Tool; +import org.apache.hadoop.util.ToolRunner; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.NutchWritable; +import org.apache.nutch.crawl.SignatureFactory; +import org.apache.nutch.fetcher.FetcherOutputFormat; +import org.apache.nutch.metadata.Metadata; +import org.apache.nutch.metadata.Nutch; +import org.apache.nutch.net.URLFilters; +import org.apache.nutch.net.URLFilterException; +import org.apache.nutch.net.URLNormalizers; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseImpl; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.parse.ParseUtil; +import org.apache.nutch.protocol.Content; +import org.apache.nutch.protocol.ProtocolStatus; +import org.apache.nutch.scoring.ScoringFilters; +import org.apache.nutch.util.LogUtil; +import org.apache.nutch.util.NutchConfiguration; +import org.apache.nutch.util.NutchJob; +import org.apache.nutch.util.StringUtil; + +import org.archive.io.ArchiveReader; +import org.archive.io.ArchiveReaderFactory; +import org.archive.io.arc.ARCRecord; +import org.archive.io.arc.ARCRecordMetaData; + + +/** + * Convert Archive files (.arc/.warc) files to a Nutch segment. This + * is sometimes called "importing" other times "converting", the terms + * are equivalent. + * + * <code>ArcsToSegment</code> is coded as a Hadoop job and is intended + * to be run within the Hadoop framework, or at least started by the + * Hadoop launcher incorporated into Nutch. Although there is a + * <code>main</code> driver, the Nutch launcher script is strongly + * recommended. + * + * This class was initially adapted from the Nutch + * <code>Fetcher</code> class. The premise is since the Nutch + * fetching process acquires external content and places it in a Nutch + * segment, we can perform a similar activity by taking content from + * the ARC files and place that content in a Nutch segment in a + * similar fashion. Ideally, once the <code>ArcsToSegment</code> is + * used to import a set of ARCs into a Nutch segment, the resulting + * segment should be more-or-less the same as one created by Nutch's + * own Fetcher. + * + * Since we are mimicing the Nutch Fetcher, we have to be careful + * about some implementation details that might not seem relevant + * to the importing of ARC files. I've noted those details with + * comments prefaced with "?:". + */ +public class ArcsToSegment extends Configured implements Tool, Mapper +{ + + public static final Log LOG = LogFactory.getLog( ArcsToSegment.class ); + + private JobConf jobConf; + private URLFilters urlFilters; + private ScoringFilters scfilters; + private ParseUtil parseUtil; + private URLNormalizers normalizers; + private int interval; + + private long numSkipped; + private long numImported; + private long bytesSkipped; + private long bytesImported; + + /** + * ?: Is this necessary? + */ + public ArcsToSegment() + { + + } + + /** + * <p>Constructor that sets the job configuration.</p> + * + * @param conf + */ + public ArcsToSegment( Configuration conf ) + { + setConf( conf ); + } + + /** + * <p>Configures the job. Sets the url filters, scoring filters, url normalizers + * and other relevant data.</p> + * + * @param job The job configuration. + */ + public void configure( JobConf job ) + { + // set the url filters, scoring filters the parse util and the url + // normalizers + this.jobConf = job; + this.urlFilters = new URLFilters ( jobConf ); + this.scfilters = new ScoringFilters( jobConf ); + this.parseUtil = new ParseUtil ( jobConf ); + this.normalizers = new URLNormalizers( jobConf, URLNormalizers.SCOPE_FETCHER ); + this.interval = jobConf.getInt( "db.fetch.interval.default", 2592000 ); + } + + /** + * In Mapper interface. + * @inherit + */ + public void close() + { + + } + + /** + * <p>Runs the Map job to translate an arc file into output for Nutch + * segments.</p> + * + * @param key Line number in manifest corresponding to the <code>value</code> + * @param value A line from the manifest + * @param output The output collecter. + * @param reporter The progress reporter. + */ + public void map( WritableComparable key, + Writable value, + OutputCollector output, + Reporter reporter ) + throws IOException + { + String arcUrl = ""; + String collection = ""; + String segmentName = getConf().get( Nutch.SEGMENT_NAME_KEY ); + + // Each line of the manifest is "<url> <collection>" where <collection> is optional + String[] line = value.toString().split( " " ); + arcUrl = line[0]; + + if ( line.length > 1 ) + { + collection = line[1]; + } + + if ( LOG.isInfoEnabled() ) LOG.info( "Importing ARC: " + arcUrl ); + + ArchiveReader r = ArchiveReaderFactory.get( arcUrl ); + + ArcReader reader = new ArcReader( r ); + + try + { + for ( ARCRecord record : reader ) + { + // When reading WARC files, records of type other than + // "response" are returned as 'null' by the Iterator, so + // we skip them. + if ( record == null ) continue ; + + importRecord( record, segmentName, collection, output ); + + // FIXME: What does this do exactly? + reporter.progress(); + } + } + finally + { + r.close(); + + if ( LOG.isInfoEnabled() ) + { + LOG.info( "Completed ARC: " + arcUrl ); + LOG.info( "URLs skipped : " + this.numSkipped ); + LOG.info( "URLs imported: " + this.numImported ); + LOG.info( "URLs total : " + ( this.numSkipped + this.numImported ) ); + } + } + + } + + /** + * Import an ARCRecord. + * + * @param record + * @param segmentName + * @param collectionName + * @param output + * @return whether record was imported or not (i.e. filtered out due to URL filtering rules, etc.) + */ + private boolean importRecord( ARCRecord record, String segmentName, String collectionName, OutputCollector output ) + { + ARCRecordMetaData meta = record.getMetaData(); + + if ( LOG.isInfoEnabled() ) LOG.info( "Consider URL: " + meta.getUrl() + " (" + meta.getMimetype() + ")" ); + + /* ?: On second thought, DON'T do this. Even if we don't have a + parser registered for a content-type, we still want to index + its URL and possibly other meta-data. + */ + /* + // First, check to see if we have a parser registered for the + // URL's Content-Type, so we don't read in some huge video file + // only to discover we don't have a parser for it. + if ( ! this.hasRegisteredParser( meta.getMimetype() ) ) + { + if ( LOG.isInfoEnabled() ) LOG.info( "No parser registered for: " + meta.getMimetype() ); + + this.numSkipped++; + this.bytesSkipped += meta.getLength(); + + return false ; + } + */ + + // ?: Arguably, we shouldn't be normalizing nor filtering based + // on the URL. If the document made it into the (W)ARC file, then + // it should be indexed. But then again, the normalizers and + // filters can be disabled in the Nutch configuration files. + String url = this.normalizeAndFilterUrl( meta.getUrl() ); + + if ( url == null ) + { + if ( LOG.isInfoEnabled() ) LOG.info( "Skip URL: " + meta.getUrl() ); + + this.numSkipped++; + this.bytesSkipped += meta.getLength(); + + return false; + } + + // URL is good, let's import the content. + if ( LOG.isInfoEnabled() ) LOG.info( "Import URL: " + meta.getUrl() ); + this.numImported++; + this.bytesImported += meta.getLength(); + + try + { + ... [truncated message content] |
From: <al...@us...> - 2008-05-12 01:00:30
|
Revision: 2264 http://archive-access.svn.sourceforge.net/archive-access/?rev=2264&view=rev Author: alexoz Date: 2008-05-11 17:59:30 -0700 (Sun, 11 May 2008) Log Message: ----------- Added work in progress administrator manual and developer manual stub. Added Paths: ----------- trunk/archive-access/projects/access-control/dist/ trunk/archive-access/projects/access-control/dist/src/ trunk/archive-access/projects/access-control/dist/src/site/ trunk/archive-access/projects/access-control/dist/src/site/xdoc/ trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml Added: trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml (rev 0) +++ trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml 2008-05-12 00:59:30 UTC (rev 2264) @@ -0,0 +1,104 @@ +<document> + <properties> + <title>Stayback Administrator Manual</title> + <author email="aosborne nla gov au">Alex Osborne</author> + </properties> + <body> + + <section name="Requirements"> + <ul> + <li>Java 1.5 or later</li> + <li>A servlet container such + as <a href="http://tomcat.apache.org/">Tomcat</a></li> + <li>A database that + supported by <a href="http://hibernate.org">Hibernate</a> (all the + usual suspects do).</li> + </ul> + </section> + + <section name="Download"> + <p>A first release of the project has not yet been made, however + in the meantime you should be able to get a recent snapshot + from the Internet + Archive's <a href="http://builds.archive.org:8081/">build + server</a>. Select "Show projects", "Access control: Oracle + Webapp", "Working copy" and + download <tt>target/oracle-0.0.1-SNAPSHOT.war</tt>. + Alternatively you can build the project from source, see the + <a href="developer_manual.html">Developer Manual</a> for + instructions.</p> + </section> + + <section name="Installation"> + <subsection name="General"> + <ul> + <li>Create a user and database to store the access control rules.</li> + <li>Deploy the oracle webapp to your application server (eg Apache Tomcat).</li> + <li>Download the appropriate JDBC connector for your database and drop it + in WEB-INF/lib.</li> + <li>Configure the database in the dataSource and sessionFactory beans in + WEB-INF/applicationContext.xml.</li> + </ul> + </subsection> + <subsection name="MySQL"> + + <p>Create a user and database to store the access control rules:</p> + + <pre> + CREATE USER 'stayback'@ 'localhost' IDENTIFIED BY 'password'; + GRANT USAGE ON * . * TO 'stayback'@ 'localhost' IDENTIFIED BY 'password'; + CREATE DATABASE `stayback`; + GRANT ALL PRIVILEGES ON `stayback` . * TO 'stayback'@ 'localhost'; + </pre> + + <p>Deploy the oracle webapp to tomcat</p> + + <p>Download <a href="http://www.mysql.com/products/connector-j">MySQL Connector/J</a> and copy + <tt>mysql-connector-java-*-bin.jar</tt> to <tt>WEB-INF/lib</tt>.</p> + + + <p>Configure the database in <tt>WEB-INF/applicationContext.xml</tt></p>: + + + <pre> + <bean id="dataSource" + class="org.apache.commons.dbcp.BasicDataSource" + destroy-method="close"> + <property name="driverClassName" value="com.mysql.jdbc.Driver" /> + <property name="url" value="jdbc:mysql://localhost/stayback" /> + <property name="username" value="stayback" /> + <property name="password" value="password" /> + </bean> + + + <bean id="sessionFactory" [...] > + [...] + <property name="hibernateProperties"> + <value> + hibernate.dialect=org.hibernate.dialect.MySQLDialect + hibernate.hbm2ddl.auto=create + </value> + </property> + </bean> + </pre> + + <p>The <tt>hibernate.hbm2ddl.auto=create</tt> option will cause Hibernate to + automatically create the tables in the database.</p> + </subsection> + </section> + + <section name="Configuring clients"> + <subsection name="Wayback"> + TODO: Write this. For now see the oracle section in the example wayback.xml. + </subsection> + <subsection name="NutchWAX"> + Stayback client has not yet been integrated into NutchWAX. + </subsection> + <subsection name="Others"> + <p>See the <a href="developer_manual.html">developer + manual</a> for information about integrating Stayback into + other software. + </subsection> + </section> + </body> +</document> Added: trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml =================================================================== --- trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml (rev 0) +++ trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml 2008-05-12 00:59:30 UTC (rev 2264) @@ -0,0 +1,11 @@ +<document> + <properties> + <title>Stayback Developer Manual</title> + <author email="aosborne nla gov au">Alex Osborne</author> + </properties> + <body> + <p>For now see + the <a href="http://webteam.archive.org/confluence/display/wayback/Exclusions+API">notes</a> + on the Wayback wiki.</p> + </body> +</document> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2263 http://archive-access.svn.sourceforge.net/archive-access/?rev=2263&view=rev Author: bradtofel Date: 2008-05-05 14:36:23 -0700 (Mon, 05 May 2008) Log Message: ----------- INITIAL REV: more than enough rope to hang yourself with this class -- allows for dynamic setting of the active ExclusionFilterFactory per request, based on whatever logic is set in the BooleanOperator. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java 2008-05-05 21:36:23 UTC (rev 2263) @@ -0,0 +1,34 @@ +package org.archive.wayback.authenticationcontrol; + +import org.archive.wayback.accesscontrol.ExclusionFilterFactory; +import org.archive.wayback.core.WaybackRequest; +import org.archive.wayback.util.operator.BooleanOperator; + +public class AccessControlSettingOperation implements BooleanOperator<WaybackRequest> { + + private ExclusionFilterFactory factory = null; + private BooleanOperator<WaybackRequest> operator = null; + + public boolean isTrue(WaybackRequest value) { + if(operator.isTrue(value)) { + value.setExclusionFilter(factory.get()); + } + return true; + } + + public ExclusionFilterFactory getFactory() { + return factory; + } + + public void setFactory(ExclusionFilterFactory factory) { + this.factory = factory; + } + + public BooleanOperator<WaybackRequest> getOperator() { + return operator; + } + + public void setOperator(BooleanOperator<WaybackRequest> operator) { + this.operator = operator; + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-05-05 21:34:09
|
Revision: 2262 http://archive-access.svn.sourceforge.net/archive-access/?rev=2262&view=rev Author: bradtofel Date: 2008-05-05 14:34:10 -0700 (Mon, 05 May 2008) Log Message: ----------- INITIAL REV: command line tool for accessing main() in FileLocationDB Added Paths: ----------- trunk/archive-access/projects/wayback/dist/src/scripts/location-db Added: trunk/archive-access/projects/wayback/dist/src/scripts/location-db =================================================================== --- trunk/archive-access/projects/wayback/dist/src/scripts/location-db (rev 0) +++ trunk/archive-access/projects/wayback/dist/src/scripts/location-db 2008-05-05 21:34:10 UTC (rev 2262) @@ -0,0 +1,82 @@ +#!/usr/bin/env sh +## +## This script allows querying and updating of a remote LocationDB from the +## command line, including syncronizing the LocationDB with an entire directory +## of ARCs files +## +## Optional environment variables +## +## JAVA_HOME Point at a JDK install to use. +## +## WAYBACK_HOME Pointer to your wayback install. If not present, we +## make an educated guess based of position relative to this +## script. +## +## JAVA_OPTS Java runtime options. Default setting is '-Xmx256m'. +## + +# Resolve links - $0 may be a softlink +PRG="$0" +while [ -h "$PRG" ]; do + ls=`ls -ld "$PRG"` + link=`expr "$ls" : '.*-> \(.*\)$'` + if expr "$link" : '.*/.*' > /dev/null; then + PRG="$link" + else + PRG=`dirname "$PRG"`/"$link" + fi +done +PRGDIR=`dirname "$PRG"` + +# Set WAYBACK_HOME. +if [ -z "$WAYBACK_HOME" ] +then + WAYBACK_HOME=`cd "$PRGDIR/.." ; pwd` +fi + +# Find JAVA_HOME. +if [ -z "$JAVA_HOME" ] +then + JAVA=`which java` + if [ -z "$JAVA" ] + then + echo "Cannot find JAVA. Please set JAVA_HOME or your PATH." + exit 1 + fi + JAVA_BINDIR=`dirname $JAVA` + JAVA_HOME=$JAVA_BINDIR/.. +fi + +if [ -z "$JAVACMD" ] +then + # It may be defined in env - including flags!! + JAVACMD=$JAVA_HOME/bin/java +fi + +# Ignore previous classpath. Build one that contains heritrix jar and content +# of the lib directory into the variable CP. +for jar in `ls $WAYBACK_HOME/lib/*.jar $WAYBACK_HOME/*.jar 2> /dev/null` +do + CP=${CP}:${jar} +done + +# cygwin path translation +if expr `uname` : 'CYGWIN*' > /dev/null; then + CP=`cygpath -p -w "$CP"` + WAYBACK_HOME=`cygpath -p -w "$WAYBACK_HOME"` +fi + +# Make sure of java opts. +if [ -z "$JAVA_OPTS" ] +then + JAVA_OPTS=" -Xmx256m" +fi + +# Main ArcIndexer class. +if [ -z "$CLASS_MAIN" ] +then + CLASS_MAIN='org.archive.wayback.resourcestore.http.FileLocationDB' +fi + +CLASSPATH=${CP} $JAVACMD ${JAVA_OPTS} $CLASS_MAIN "$@" + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2261 http://archive-access.svn.sourceforge.net/archive-access/?rev=2261&view=rev Author: bradtofel Date: 2008-05-05 14:33:02 -0700 (Mon, 05 May 2008) Log Message: ----------- FEATURE: command-line code to populate an offline FileLocationDB. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java 2008-04-19 00:37:34 UTC (rev 2260) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java 2008-05-05 21:33:02 UTC (rev 2261) @@ -24,7 +24,9 @@ */ package org.archive.wayback.resourcestore.http; +import java.io.BufferedReader; import java.io.IOException; +import java.io.InputStreamReader; import org.archive.wayback.bdb.BDBRecordSet; import org.archive.wayback.exception.ConfigurationException; @@ -238,4 +240,66 @@ public void setBdbName(String bdbName) { this.bdbName = bdbName; } + private static void USAGE(String message) { + System.err.print("USAGE: " + message + "\n" + + "\tDBDIR DBNAME LOGPATH\n" + + "\n" + + "\t\tread lines from STDIN formatted like:\n" + + "\t\t\tNAME<SPACE>URL\n" + + "\t\tand for each line, add to locationDB that file NAME is\n" + + "\t\tlocated at URL. Use locationDB in DBDIR at DBNAME, \n" + + "\t\tcreating if it does not exist.\n" + ); + System.exit(2); + } + + /** + * @param args + */ + public static void main(String[] args) { + if(args.length != 3) { + USAGE(""); + System.exit(1); + } + String bdbPath = args[0]; + String bdbName = args[1]; + String logPath = args[2]; + FileLocationDB db = new FileLocationDB(); + db.setBdbPath(bdbPath); + db.setBdbName(bdbName); + db.setLogPath(logPath); + BufferedReader r = new BufferedReader( + new InputStreamReader(System.in)); + String line; + int exitCode = 0; + try { + db.init(); + while((line = r.readLine()) != null) { + String parts[] = line.split(" "); + if(parts.length != 2) { + System.err.println("Bad input(" + line + ")"); + System.exit(2); + } + db.addArcUrl(parts[0],parts[1]); + System.out.println("Added\t" + parts[0] + "\t" + parts[1]); + } + } catch (IOException e) { + e.printStackTrace(); + exitCode = 1; + } catch (DatabaseException e) { + e.printStackTrace(); + exitCode = 1; + } catch (ConfigurationException e) { + e.printStackTrace(); + exitCode = 1; + } finally { + try { + db.shutdownDB(); + } catch (DatabaseException e) { + e.printStackTrace(); + exitCode = 1; + } + } + System.exit(exitCode); + } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-04-19 00:37:36
|
Revision: 2260 http://archive-access.svn.sourceforge.net/archive-access/?rev=2260&view=rev Author: bradtofel Date: 2008-04-18 17:37:34 -0700 (Fri, 18 Apr 2008) Log Message: ----------- DOCS: wrong version specified Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2008-04-19 00:36:41 UTC (rev 2259) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml 2008-04-19 00:37:34 UTC (rev 2260) @@ -11,7 +11,7 @@ <section name="Releases"> <p> Full listing of changes and bug fixes are not currently available prior - to release 1.2.1. + to release 1.2.0. </p> </section> <section name="Release 1.2.1"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-04-19 00:36:48
|
Revision: 2259 http://archive-access.svn.sourceforge.net/archive-access/?rev=2259&view=rev Author: bradtofel Date: 2008-04-18 17:36:41 -0700 (Fri, 18 Apr 2008) Log Message: ----------- POST-RELEASE: 1.3.0-SNAPSHOT Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/pom.xml trunk/archive-access/projects/wayback/pom.xml trunk/archive-access/projects/wayback/wayback-core/pom.xml trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml trunk/archive-access/projects/wayback/wayback-webapp/pom.xml Modified: trunk/archive-access/projects/wayback/dist/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/pom.xml 2008-04-19 00:32:29 UTC (rev 2258) +++ trunk/archive-access/projects/wayback/dist/pom.xml 2008-04-19 00:36:41 UTC (rev 2259) @@ -3,7 +3,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> @@ -54,13 +54,13 @@ <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-webapp</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> <type>war</type> </dependency> <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> </dependency> </dependencies> Modified: trunk/archive-access/projects/wayback/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/pom.xml 2008-04-19 00:32:29 UTC (rev 2258) +++ trunk/archive-access/projects/wayback/pom.xml 2008-04-19 00:36:41 UTC (rev 2259) @@ -16,7 +16,7 @@ <modelVersion>4.0.0</modelVersion> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> <packaging>pom</packaging> <name>Wayback</name> Modified: trunk/archive-access/projects/wayback/wayback-core/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/pom.xml 2008-04-19 00:32:29 UTC (rev 2258) +++ trunk/archive-access/projects/wayback/wayback-core/pom.xml 2008-04-19 00:36:41 UTC (rev 2259) @@ -17,7 +17,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-core</artifactId> Modified: trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml 2008-04-19 00:32:29 UTC (rev 2258) +++ trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml 2008-04-19 00:36:41 UTC (rev 2259) @@ -12,7 +12,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> Modified: trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml 2008-04-19 00:32:29 UTC (rev 2258) +++ trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml 2008-04-19 00:36:41 UTC (rev 2259) @@ -10,7 +10,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce-prereq</artifactId> Modified: trunk/archive-access/projects/wayback/wayback-webapp/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/pom.xml 2008-04-19 00:32:29 UTC (rev 2258) +++ trunk/archive-access/projects/wayback/wayback-webapp/pom.xml 2008-04-19 00:36:41 UTC (rev 2259) @@ -3,7 +3,7 @@ <parent> <artifactId>wayback</artifactId> <groupId>org.archive</groupId> - <version>1.2.1</version> + <version>1.3.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> <groupId>org.archive.wayback</groupId> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-04-19 00:32:52
|
Revision: 2258 http://archive-access.svn.sourceforge.net/archive-access/?rev=2258&view=rev Author: bradtofel Date: 2008-04-18 17:32:29 -0700 (Fri, 18 Apr 2008) Log Message: ----------- RELEASE 1.2.1 Added Paths: ----------- branches/wayback-1_2_1/wayback/ Copied: branches/wayback-1_2_1/wayback (from rev 2257, trunk/archive-access/projects/wayback) This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-04-19 00:32:22
|
Revision: 2257 http://archive-access.svn.sourceforge.net/archive-access/?rev=2257&view=rev Author: bradtofel Date: 2008-04-18 17:31:58 -0700 (Fri, 18 Apr 2008) Log Message: ----------- Removing bad branch... Removed Paths: ------------- branches/wayback-1_2_1/wayback/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-04-17 23:02:02
|
Revision: 2256 http://archive-access.svn.sourceforge.net/archive-access/?rev=2256&view=rev Author: bradtofel Date: 2008-04-17 16:02:06 -0700 (Thu, 17 Apr 2008) Log Message: ----------- DOC: added a bit of info indicating that adding ARCs/WARCs to 'dataDir' will get them added to Wayback iff autoindexing is enabled. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-04-17 20:54:16 UTC (rev 2255) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-04-17 23:02:06 UTC (rev 2256) @@ -138,7 +138,11 @@ implementation also includes the capability to run a background thread to automatically notice new ARC/WARC files appearing, index those files, and hand off the index data for merging with - a BDBResourceIndex. + a BDBResourceIndex. When using automatic indexing, any files added to + the 'dataDir' will automatically be indexed and queued for merging + with the ResourceIndex. Please see documentation for the + BDBResourceIndex for information on configuring automatic merging of + indexed data with a BDBResourceIndex. </p> <p> The XML configuration template for a LocalResourceStore follows: This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |