archive-access-cvs Mailing List for Web Archive Access Utilities (Page 62)

Brought to you by: binzino, bradtofel, gojomo, ia_igor, and 5 others

archive-access-cvs — CVS commits

You can subscribe to this list here.

2005	Jan	Feb	Mar	Apr	May	Jun	Jul (1)	Aug (10)	Sep (36)	Oct (339)	Nov (103)	Dec (152)
2006	Jan (141)	Feb (102)	Mar (125)	Apr (203)	May (57)	Jun (30)	Jul (139)	Aug (46)	Sep (64)	Oct (105)	Nov (34)	Dec (162)
2007	Jan (81)	Feb (57)	Mar (141)	Apr (72)	May (9)	Jun (1)	Jul (144)	Aug (88)	Sep (40)	Oct (43)	Nov (34)	Dec (20)
2008	Jan (44)	Feb (45)	Mar (16)	Apr (36)	May (8)	Jun (77)	Jul (177)	Aug (66)	Sep (8)	Oct (33)	Nov (13)	Dec (37)
2009	Jan (2)	Feb (5)	Mar (8)	Apr	May (36)	Jun (19)	Jul (46)	Aug (8)	Sep (1)	Oct (66)	Nov (61)	Dec (10)
2010	Jan (13)	Feb (16)	Mar (38)	Apr (76)	May (47)	Jun (32)	Jul (35)	Aug (45)	Sep (20)	Oct (61)	Nov (24)	Dec (16)
2011	Jan (22)	Feb (34)	Mar (11)	Apr (8)	May (24)	Jun (23)	Jul (11)	Aug (42)	Sep (81)	Oct (48)	Nov (21)	Dec (20)
2012	Jan (30)	Feb (25)	Mar (4)	Apr (6)	May (1)	Jun (5)	Jul (5)	Aug (8)	Sep (6)	Oct (6)	Nov	Dec

Flat | Threaded

<< < 1 .. 60 61 62 63 64 .. 171 > >> (Page 62 of 171)

[Archive-access-cvs] SF.net SVN: archive-access: [2280] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourcestore

From: <bra...@us...> - 2008-06-05 20:35:00

Revision: 2280
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2280&view=rev
Author:   bradtofel
Date:     2008-06-05 13:34:57 -0700 (Thu, 05 Jun 2008)

Log Message:
-----------
FEATURE: added method to return iterator from a pathOrUrl (String)

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java	2008-06-04 00:08:01 UTC (rev 2279)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/ArcIndexer.java	2008-06-05 20:34:57 UTC (rev 2280)
@@ -59,7 +59,7 @@
 	public ArcIndexer() {
 		canonicalizer = new AggressiveUrlCanonicalizer();
 	}
-	
+
 	/**
 	 * @param arc
 	 * @return Iterator of SearchResults for input arc File
@@ -67,7 +67,26 @@
 	 */
 	public CloseableIterator<SearchResult> iterator(File arc)
 	throws IOException {
-		ARCReader arcReader = ARCReaderFactory.get(arc);
+		return iterator(ARCReaderFactory.get(arc));
+	}
+
+	/**
+	 * @param pathOrUrl
+	 * @return Iterator of SearchResults for input pathOrUrl
+	 * @throws IOException
+	 */
+	public CloseableIterator<SearchResult> iterator(String pathOrUrl)
+	throws IOException {
+		return iterator(ARCReaderFactory.get(pathOrUrl));
+	}
+	
+	/**
+	 * @param arcReader
+	 * @return Iterator of SearchResults for input ARCReader
+	 * @throws IOException
+	 */
+	public CloseableIterator<SearchResult> iterator(ARCReader arcReader)
+	throws IOException {
 		arcReader.setParseHttpHeaders(true);
 
 		Adapter<ArchiveRecord,ARCRecord> adapter1 =

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java	2008-06-04 00:08:01 UTC (rev 2279)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/WarcIndexer.java	2008-06-05 20:34:57 UTC (rev 2280)
@@ -31,21 +31,37 @@
 	}
 	
 	/**
-	 * @param arc
+	 * @param warc
 	 * @return Iterator of SearchResults for input arc File
 	 * @throws IOException
 	 */
 	public CloseableIterator<SearchResult> iterator(File warc)
 			throws IOException {
+		return iterator(WARCReaderFactory.get(warc));
+	}
+	/**
+	 * @param pathOrUrl
+	 * @return Iterator of SearchResults for input pathOrUrl
+	 * @throws IOException
+	 */
+	public CloseableIterator<SearchResult> iterator(String pathOrUrl)
+			throws IOException {
+		return iterator(WARCReaderFactory.get(pathOrUrl));
+	}
+	/**
+	 * @param arc
+	 * @return Iterator of SearchResults for input arc File
+	 * @throws IOException
+	 */
+	public CloseableIterator<SearchResult> iterator(WARCReader reader)
+			throws IOException {
 
 		Adapter<ArchiveRecord, WARCRecord> adapter1 = new ArchiveRecordToWARCRecordAdapter();
 
 		WARCRecordToSearchResultAdapter adapter2 = 
 			new WARCRecordToSearchResultAdapter();
 		adapter2.setCanonicalizer(canonicalizer);
-		
-		WARCReader reader = WARCReaderFactory.get(warc);
-		
+
 		ArchiveReaderCloseableIterator itr1 = 
 			new ArchiveReaderCloseableIterator(reader,reader.iterator());
 


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2279] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java

From: <bra...@us...> - 2008-06-04 00:07:57

Revision: 2279
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2279&view=rev
Author:   bradtofel
Date:     2008-06-03 17:08:01 -0700 (Tue, 03 Jun 2008)

Log Message:
-----------
BUGFIX(unreported): was appending file.toString() not itr.next() in store()

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java	2008-06-04 00:05:34 UTC (rev 2278)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java	2008-06-04 00:08:01 UTC (rev 2279)
@@ -183,7 +183,7 @@
 	public void store(Iterator<String> itr) throws IOException {
 		PrintWriter pw = new PrintWriter(file);
 		while(itr.hasNext()) {
-			pw.println(file);
+			pw.println(itr.next());
 		}
 		pw.close();
 	}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2278] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/archivalurl/ ArchivalUrlReplayDispatcher.java

From: <bra...@us...> - 2008-06-04 00:05:26

Revision: 2278
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2278&view=rev
Author:   bradtofel
Date:     2008-06-03 17:05:34 -0700 (Tue, 03 Jun 2008)

Log Message:
-----------
FEATURE: added handling of ASX format XML documents.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java	2008-06-04 00:04:33 UTC (rev 2277)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlReplayDispatcher.java	2008-06-04 00:05:34 UTC (rev 2278)
@@ -50,6 +50,9 @@
 	private final static String TEXT_HTML_MIME = "text/html";
 	private final static String TEXT_XHTML_MIME = "application/xhtml";
 	private final static String TEXT_CSS_MIME = "text/css";
+	private final static String ASX_MIME = "video/x-ms-asf";
+	private final static String ASX_EXTENSION = ".asx";
+	
 
 	// TODO: make this configurable
 	private final static long MAX_HTML_MARKUP_LENGTH = 1024 * 1024 * 5;
@@ -62,6 +65,8 @@
 		new ArchivalUrlReplayRenderer();
 	private ArchivalUrlCSSReplayRenderer archivalCSS =
 		new ArchivalUrlCSSReplayRenderer();
+	private ArchivalUrlASXReplayRenderer archivalASX =
+		new ArchivalUrlASXReplayRenderer();
 
 	/* (non-Javadoc)
 	 * @see org.archive.wayback.replay.ReplayRendererDispatcher#getRenderer(org.archive.wayback.core.WaybackRequest, org.archive.wayback.core.SearchResult, org.archive.wayback.core.Resource)
@@ -82,20 +87,30 @@
 		// only bother attempting  markup on pages smaller than some size:
 		if (resource.getRecordLength() < MAX_HTML_MARKUP_LENGTH) {
 
+			String resultMime = result.get(WaybackConstants.RESULT_MIME_TYPE);
 			// HTML and XHTML docs get marked up as HTML
-			if (-1 != result.get(WaybackConstants.RESULT_MIME_TYPE).indexOf(
-					TEXT_HTML_MIME)) {
+			if (-1 != resultMime.indexOf(TEXT_HTML_MIME)) {
 				return archivalHTML;
 			}
-			if (-1 != result.get(WaybackConstants.RESULT_MIME_TYPE).indexOf(
-					TEXT_XHTML_MIME)) {
+			if (-1 != resultMime.indexOf(TEXT_XHTML_MIME)) {
 				return archivalHTML;
 			}
 			// CSS docs get marked up as CSS
-			if (-1 != result.get(WaybackConstants.RESULT_MIME_TYPE).indexOf(
-					TEXT_CSS_MIME)) {
+			if (-1 != resultMime.indexOf(TEXT_CSS_MIME)) {
 				return archivalCSS;
 			}
+			if (-1 != resultMime.indexOf(ASX_MIME)) {
+				return archivalASX;
+			}
+			String resultPath = result.get(WaybackConstants.RESULT_URL_KEY);
+			resultPath = resultPath.substring(resultPath.indexOf('/'));
+			int queryIdx = resultPath.indexOf('?');
+			if(queryIdx > 0) {
+				resultPath = resultPath.substring(0,queryIdx-1);
+			}
+			if(resultPath.endsWith(ASX_EXTENSION)) {
+				return archivalASX;
+			}
 		}
 		
 		// everything else goes transparently:


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2277] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/archivalurl/ ArchivalUrlASXReplayRenderer.java

From: <bra...@us...> - 2008-06-04 00:04:32

Revision: 2277
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2277&view=rev
Author:   bradtofel
Date:     2008-06-03 17:04:33 -0700 (Tue, 03 Jun 2008)

Log Message:
-----------
INITIAL REV: ReplayRenderer responsible for rewriting ASX format XML documents as they are replayed.

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java

Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlASXReplayRenderer.java	2008-06-04 00:04:33 UTC (rev 2277)
@@ -0,0 +1,50 @@
+package org.archive.wayback.archivalurl;
+
+import java.io.IOException;
+import java.util.Map;
+
+import javax.servlet.ServletException;
+import javax.servlet.http.HttpServletRequest;
+import javax.servlet.http.HttpServletResponse;
+
+import org.archive.wayback.ResultURIConverter;
+import org.archive.wayback.core.Resource;
+import org.archive.wayback.core.SearchResult;
+import org.archive.wayback.core.SearchResults;
+import org.archive.wayback.core.WaybackRequest;
+import org.archive.wayback.exception.BadContentException;
+import org.archive.wayback.replay.HTMLPage;
+import org.archive.wayback.replay.HttpHeaderOperation;
+
+public class ArchivalUrlASXReplayRenderer extends ArchivalUrlReplayRenderer {
+	/* (non-Javadoc)
+	 * @see org.archive.wayback.ReplayRenderer#renderResource(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse, org.archive.wayback.core.WaybackRequest, org.archive.wayback.core.SearchResult, org.archive.wayback.core.Resource, org.archive.wayback.ResultURIConverter, org.archive.wayback.core.SearchResults)
+	 */
+	public void renderResource(HttpServletRequest httpRequest,
+			HttpServletResponse httpResponse, WaybackRequest wbRequest,
+			SearchResult result, Resource resource,
+			ResultURIConverter uriConverter, SearchResults results)
+			throws ServletException, IOException, BadContentException {
+
+		
+		HttpHeaderOperation.copyHTTPMessageHeader(resource, httpResponse);
+
+		Map<String,String> headers = HttpHeaderOperation.processHeaders(
+				resource, result, uriConverter, this);
+	
+		// Load content into an HTML page, and resolve embedded HREF urls:
+		HTMLPage page = new HTMLPage(resource,result,uriConverter);
+		page.readFully();
+
+		page.resolveASXRefUrls();
+
+		// set the corrected length:
+		int bytes = page.getBytes().length;
+		headers.put(HTTP_LENGTH_HEADER, String.valueOf(bytes));
+
+		// send back the headers:
+		HttpHeaderOperation.sendHeaders(headers, httpResponse);
+
+		page.writeToOutputStream(httpResponse.getOutputStream());
+	}
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2276] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java

From: <bra...@us...> - 2008-06-04 00:03:34

Revision: 2276
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2276&view=rev
Author:   bradtofel
Date:     2008-06-03 17:03:40 -0700 (Tue, 03 Jun 2008)

Log Message:
-----------
BUGFIX: moved extract HTTP request call to beginning of fixup.
FEATURE: added keySet() to get Set of request filters.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java	2008-06-04 00:02:04 UTC (rev 2275)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java	2008-06-04 00:03:40 UTC (rev 2276)
@@ -27,6 +27,7 @@
 import java.util.Iterator;
 import java.util.Locale;
 import java.util.ResourceBundle;
+import java.util.Set;
 import java.io.UnsupportedEncodingException;
 import java.net.URLEncoder;
 
@@ -259,6 +260,7 @@
 	 * @param httpRequest
 	 */
 	public void fixup(HttpServletRequest httpRequest) {
+		extractHttpRequestInfo(httpRequest);
 		String startDate = get(WaybackConstants.REQUEST_START_DATE);
 		String endDate = get(WaybackConstants.REQUEST_END_DATE);
 		String exactDate = get(WaybackConstants.REQUEST_EXACT_DATE);
@@ -287,7 +289,6 @@
 			put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp
 					.padEndDateStr(exactDate));
 		}
-		extractHttpRequestInfo(httpRequest);
 	}
 
 	/**
@@ -408,4 +409,8 @@
 	public void setExclusionFilter(ObjectFilter<SearchResult> exclusionFilter) {
 		this.exclusionFilter = exclusionFilter;
 	}
+
+	public Set<String> keySet() {
+		return filters.keySet();
+	}
 }
\ No newline at end of file


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2275] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java

From: <bra...@us...> - 2008-06-04 00:02:04

Revision: 2275
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2275&view=rev
Author:   bradtofel
Date:     2008-06-03 17:02:04 -0700 (Tue, 03 Jun 2008)

Log Message:
-----------
FEATURE: added ASX markup method, which rewrites ASX XML documents, converting mms:// to http:// as it rewrites urls.. This might even be the "right thing" to do for mms://...

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java	2008-06-02 22:01:49 UTC (rev 2274)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/HTMLPage.java	2008-06-04 00:02:04 UTC (rev 2275)
@@ -309,6 +309,18 @@
 		TagMagix.markupCSSImports(sb,uriConverter, captureDate, pageUrl);
 	}
 
+	public void resolveASXRefUrls() {
+
+		// TODO: get url from Resource instead of SearchResult?
+		String pageUrl = result.getAbsoluteUrl();
+		String captureDate = result.getCaptureDate();
+		ResultURIConverter ruc = new MMSToHTTPResultURIConverter(uriConverter);
+		
+		TagMagix.markupTagREURIC(sb, ruc, captureDate, pageUrl,
+				"REF", "HREF");
+	}
+	
+	
 	/**
 	 * @param charSet
 	 * @throws IOException 
@@ -475,4 +487,20 @@
 			return base.makeReplayURI(datespec, url);
 		}
 	}
+
+	private class MMSToHTTPResultURIConverter implements ResultURIConverter {
+		private static final String MMS_PROTOCOL_PREFIX = "mms://";
+		private static final String HTTP_PROTOCOL_PREFIX = "http://";
+		private ResultURIConverter base = null;
+		public MMSToHTTPResultURIConverter(ResultURIConverter base) {
+			this.base = base;
+		}
+		public String makeReplayURI(String datespec, String url) {
+			if(url.startsWith(MMS_PROTOCOL_PREFIX)) {
+				url = HTTP_PROTOCOL_PREFIX + 
+					url.substring(MMS_PROTOCOL_PREFIX.length());
+			}
+			return base.makeReplayURI(datespec, url);
+		}
+	}	
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2274] tags/nutchwax-0_12-beta1/

From: <bi...@us...> - 2008-06-02 22:01:44

Revision: 2274
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2274&view=rev
Author:   binzino
Date:     2008-06-02 15:01:49 -0700 (Mon, 02 Jun 2008)

Log Message:
-----------
NutchWAX 0.12 Beta-1 release tag.

Added Paths:
-----------
    tags/nutchwax-0_12-beta1/

Copied: tags/nutchwax-0_12-beta1 (from rev 2273, trunk/archive-access/projects/nutchwax)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2273] trunk/archive-access/projects/nutchwax/ archive/INSTALL.txt

From: <bi...@us...> - 2008-06-02 19:00:38

Revision: 2273
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2273&view=rev
Author:   binzino
Date:     2008-06-02 11:58:52 -0700 (Mon, 02 Jun 2008)

Log Message:
-----------
Updated with current location of NutchWAX source in SVN.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/archive/INSTALL.txt

Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-06-02 18:53:46 UTC (rev 2272)
+++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt	2008-06-02 18:58:52 UTC (rev 2273)
@@ -1,6 +1,6 @@
 
 INSTALL.txt
-2008-05-20
+2008-06-02
 Aaron Binns
 
 This installation guide assumes the reader is already familiar with
@@ -60,7 +60,7 @@
 Nutch's "contrib" directory.
 
  $ cd contrib
- $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive
 
 This will create a sub-directory named "archive" containing the
 NutchWAX sources.


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2272] trunk/archive-access/projects

From: <bi...@us...> - 2008-06-02 18:55:10

Revision: 2272
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2272&view=rev
Author:   binzino
Date:     2008-06-02 11:53:46 -0700 (Mon, 02 Jun 2008)

Log Message:
-----------
Move NutchWAX 0.12 from 'projects/nat' to 'projects/nutchwax'.

Added Paths:
-----------
    trunk/archive-access/projects/nutchwax/

Removed Paths:
-------------
    trunk/archive-access/projects/nat/

Copied: trunk/archive-access/projects/nutchwax (from rev 2271, trunk/archive-access/projects/nat)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2271] trunk/archive-access/projects/nutchwax/

From: <bi...@us...> - 2008-06-02 18:36:45

Revision: 2271
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2271&view=rev
Author:   binzino
Date:     2008-06-02 11:36:19 -0700 (Mon, 02 Jun 2008)

Log Message:
-----------
Move pre-0.12 NutchWAX code to branches/nutchwax-pre-0.12.

Removed Paths:
-------------
    trunk/archive-access/projects/nutchwax/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2270] trunk/archive-access/projects/nutchwax/ nutchwax-thirdparty/

From: <bi...@us...> - 2008-06-02 18:33:31

Revision: 2270
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2270&view=rev
Author:   binzino
Date:     2008-06-02 11:32:32 -0700 (Mon, 02 Jun 2008)

Log Message:
-----------
Removed svn:externals property since NutchWAX 0.12 doesn't require it.

Property Changed:
----------------
    trunk/archive-access/projects/nutchwax/nutchwax-thirdparty/


Property changes on: trunk/archive-access/projects/nutchwax/nutchwax-thirdparty
___________________________________________________________________
Name: svn:externals
   - nutch http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.9



This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2269] branches/nutchwax-pre-0_12/

From: <bi...@us...> - 2008-06-02 18:11:48

Revision: 2269
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2269&view=rev
Author:   binzino
Date:     2008-06-02 11:11:05 -0700 (Mon, 02 Jun 2008)

Log Message:
-----------
Moving all pre-0.12 code from 'projects/nutchwax' to here for posterity.

Added Paths:
-----------
    branches/nutchwax-pre-0_12/

Copied: branches/nutchwax-pre-0_12 (from rev 2268, trunk/archive-access/projects/nutchwax)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2268] trunk/archive-access/projects/nat/ archive

From: <bi...@us...> - 2008-05-27 18:58:15

Revision: 2268
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2268&view=rev
Author:   binzino
Date:     2008-05-27 11:58:18 -0700 (Tue, 27 May 2008)

Log Message:
-----------
Updated license information: header comments, .LICENSE files,
LICENSE.txt, etc.

Modified Paths:
--------------
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/NutchWax.java
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java
    trunk/archive-access/projects/nat/archive/src/plugin/build.xml
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/plugin.xml
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/plugin.xml
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/DateQueryFilter.java

Added Paths:
-----------
    trunk/archive-access/projects/nat/archive/LICENSE.txt
    trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE
    trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE
    trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.LICENSE

Added: trunk/archive-access/projects/nat/archive/LICENSE.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/LICENSE.txt	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/LICENSE.txt	2008-05-27 18:58:18 UTC (rev 2268)
@@ -0,0 +1,519 @@
+
+NutchWAX is free software.  Except as noted, it is licensed under the
+terms of the GNU Lesser Public License (LGPL), reproduced below.
+
+Source code derived from Nutch retains the Apache License, as
+stipulated by that license.
+
+Libraries used by NutchWAX are redistributed under their respective
+liceneses, which can be found in a file with the same name as the
+library, suffixed by ".LICENSE".  For example, the license for
+"foo.jar" can be found in "foo.LICENSE".
+
+All other files not carrying an explicit license are licensed under
+the GNU Lesser General Public License version 2.1 (included below)
+
+======================================================================
+
+		  GNU LESSER GENERAL PUBLIC LICENSE
+		       Version 2.1, February 1999
+
+ Copyright (C) 1991, 1999 Free Software Foundation, Inc.
+     59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+[This is the first released version of the Lesser GPL.  It also counts
+ as the successor of the GNU Library Public License, version 2, hence
+ the version number 2.1.]
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+Licenses are intended to guarantee your freedom to share and change
+free software--to make sure the software is free for all its users.
+
+  This license, the Lesser General Public License, applies to some
+specially designated software packages--typically libraries--of the
+Free Software Foundation and other authors who decide to use it.  You
+can use it too, but we suggest you first think carefully about whether
+this license or the ordinary General Public License is the better
+strategy to use in any particular case, based on the explanations below.
+
+  When we speak of free software, we are referring to freedom of use,
+not price.  Our General Public Licenses are designed to make sure that
+you have the freedom to distribute copies of free software (and charge
+for this service if you wish); that you receive source code or can get
+it if you want it; that you can change the software and use pieces of
+it in new free programs; and that you are informed that you can do
+these things.
+
+  To protect your rights, we need to make restrictions that forbid
+distributors to deny you these rights or to ask you to surrender these
+rights.  These restrictions translate to certain responsibilities for
+you if you distribute copies of the library or if you modify it.
+
+  For example, if you distribute copies of the library, whether gratis
+or for a fee, you must give the recipients all the rights that we gave
+you.  You must make sure that they, too, receive or can get the source
+code.  If you link other code with the library, you must provide
+complete object files to the recipients, so that they can relink them
+with the library after making changes to the library and recompiling
+it.  And you must show them these terms so they know their rights.
+
+  We protect your rights with a two-step method: (1) we copyright the
+library, and (2) we offer you this license, which gives you legal
+permission to copy, distribute and/or modify the library.
+
+  To protect each distributor, we want to make it very clear that
+there is no warranty for the free library.  Also, if the library is
+modified by someone else and passed on, the recipients should know
+that what they have is not the original version, so that the original
+author's reputation will not be affected by problems that might be
+introduced by others.
+
+  Finally, software patents pose a constant threat to the existence of
+any free program.  We wish to make sure that a company cannot
+effectively restrict the users of a free program by obtaining a
+restrictive license from a patent holder.  Therefore, we insist that
+any patent license obtained for a version of the library must be
+consistent with the full freedom of use specified in this license.
+
+  Most GNU software, including some libraries, is covered by the
+ordinary GNU General Public License.  This license, the GNU Lesser
+General Public License, applies to certain designated libraries, and
+is quite different from the ordinary General Public License.  We use
+this license for certain libraries in order to permit linking those
+libraries into non-free programs.
+
+  When a program is linked with a library, whether statically or using
+a shared library, the combination of the two is legally speaking a
+combined work, a derivative of the original library.  The ordinary
+General Public License therefore permits such linking only if the
+entire combination fits its criteria of freedom.  The Lesser General
+Public License permits more lax criteria for linking other code with
+the library.
+
+  We call this license the "Lesser" General Public License because it
+does Less to protect the user's freedom than the ordinary General
+Public License.  It also provides other free software developers Less
+of an advantage over competing non-free programs.  These disadvantages
+are the reason we use the ordinary General Public License for many
+libraries.  However, the Lesser license provides advantages in certain
+special circumstances.
+
+  For example, on rare occasions, there may be a special need to
+encourage the widest possible use of a certain library, so that it becomes
+a de-facto standard.  To achieve this, non-free programs must be
+allowed to use the library.  A more frequent case is that a free
+library does the same job as widely used non-free libraries.  In this
+case, there is little to gain by limiting the free library to free
+software only, so we use the Lesser General Public License.
+
+  In other cases, permission to use a particular library in non-free
+programs enables a greater number of people to use a large body of
+free software.  For example, permission to use the GNU C Library in
+non-free programs enables many more people to use the whole GNU
+operating system, as well as its variant, the GNU/Linux operating
+system.
+
+  Although the Lesser General Public License is Less protective of the
+users' freedom, it does ensure that the user of a program that is
+linked with the Library has the freedom and the wherewithal to run
+that program using a modified version of the Library.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.  Pay close attention to the difference between a
+"work based on the library" and a "work that uses the library".  The
+former contains code derived from the library, whereas the latter must
+be combined with the library in order to run.
+
+		  GNU LESSER GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License Agreement applies to any software library or other
+program which contains a notice placed by the copyright holder or
+other authorized party saying it may be distributed under the terms of
+this Lesser General Public License (also called "this License").
+Each licensee is addressed as "you".
+
+  A "library" means a collection of software functions and/or data
+prepared so as to be conveniently linked with application programs
+(which use some of those functions and data) to form executables.
+
+  The "Library", below, refers to any such software library or work
+which has been distributed under these terms.  A "work based on the
+Library" means either the Library or any derivative work under
+copyright law: that is to say, a work containing the Library or a
+portion of it, either verbatim or with modifications and/or translated
+straightforwardly into another language.  (Hereinafter, translation is
+included without limitation in the term "modification".)
+
+  "Source code" for a work means the preferred form of the work for
+making modifications to it.  For a library, complete source code means
+all the source code for all modules it contains, plus any associated
+interface definition files, plus the scripts used to control compilation
+and installation of the library.
+
+  Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running a program using the Library is not restricted, and output from
+such a program is covered only if its contents constitute a work based
+on the Library (independent of the use of the Library in a tool for
+writing it).  Whether that is true depends on what the Library does
+and what the program that uses the Library does.
+  
+  1. You may copy and distribute verbatim copies of the Library's
+complete source code as you receive it, in any medium, provided that
+you conspicuously and appropriately publish on each copy an
+appropriate copyright notice and disclaimer of warranty; keep intact
+all the notices that refer to this License and to the absence of any
+warranty; and distribute a copy of this License along with the
+Library.
+
+  You may charge a fee for the physical act of transferring a copy,
+and you may at your option offer warranty protection in exchange for a
+fee.
+
+  2. You may modify your copy or copies of the Library or any portion
+of it, thus forming a work based on the Library, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) The modified work must itself be a software library.
+
+    b) You must cause the files modified to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    c) You must cause the whole of the work to be licensed at no
+    charge to all third parties under the terms of this License.
+
+    d) If a facility in the modified Library refers to a function or a
+    table of data to be supplied by an application program that uses
+    the facility, other than as an argument passed when the facility
+    is invoked, then you must make a good faith effort to ensure that,
+    in the event an application does not supply such function or
+    table, the facility still operates, and performs whatever part of
+    its purpose remains meaningful.
+
+    (For example, a function in a library to compute square roots has
+    a purpose that is entirely well-defined independent of the
+    application.  Therefore, Subsection 2d requires that any
+    application-supplied function or table used by this function must
+    be optional: if the application does not supply it, the square
+    root function must still compute square roots.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Library,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Library, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote
+it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Library.
+
+In addition, mere aggregation of another work not based on the Library
+with the Library (or with a work based on the Library) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may opt to apply the terms of the ordinary GNU General Public
+License instead of this License to a given copy of the Library.  To do
+this, you must alter all the notices that refer to this License, so
+that they refer to the ordinary GNU General Public License, version 2,
+instead of to this License.  (If a newer version than version 2 of the
+ordinary GNU General Public License has appeared, then you can specify
+that version instead if you wish.)  Do not make any other change in
+these notices.
+
+  Once this change is made in a given copy, it is irreversible for
+that copy, so the ordinary GNU General Public License applies to all
+subsequent copies and derivative works made from that copy.
+
+  This option is useful when you wish to copy part of the code of
+the Library into a program that is not a library.
+
+  4. You may copy and distribute the Library (or a portion or
+derivative of it, under Section 2) in object code or executable form
+under the terms of Sections 1 and 2 above provided that you accompany
+it with the complete corresponding machine-readable source code, which
+must be distributed under the terms of Sections 1 and 2 above on a
+medium customarily used for software interchange.
+
+  If distribution of object code is made by offering access to copy
+from a designated place, then offering equivalent access to copy the
+source code from the same place satisfies the requirement to
+distribute the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  5. A program that contains no derivative of any portion of the
+Library, but is designed to work with the Library by being compiled or
+linked with it, is called a "work that uses the Library".  Such a
+work, in isolation, is not a derivative work of the Library, and
+therefore falls outside the scope of this License.
+
+  However, linking a "work that uses the Library" with the Library
+creates an executable that is a derivative of the Library (because it
+contains portions of the Library), rather than a "work that uses the
+library".  The executable is therefore covered by this License.
+Section 6 states terms for distribution of such executables.
+
+  When a "work that uses the Library" uses material from a header file
+that is part of the Library, the object code for the work may be a
+derivative work of the Library even though the source code is not.
+Whether this is true is especially significant if the work can be
+linked without the Library, or if the work is itself a library.  The
+threshold for this to be true is not precisely defined by law.
+
+  If such an object file uses only numerical parameters, data
+structure layouts and accessors, and small macros and small inline
+functions (ten lines or less in length), then the use of the object
+file is unrestricted, regardless of whether it is legally a derivative
+work.  (Executables containing this object code plus portions of the
+Library will still fall under Section 6.)
+
+  Otherwise, if the work is a derivative of the Library, you may
+distribute the object code for the work under the terms of Section 6.
+Any executables containing that work also fall under Section 6,
+whether or not they are linked directly with the Library itself.
+
+  6. As an exception to the Sections above, you may also combine or
+link a "work that uses the Library" with the Library to produce a
+work containing portions of the Library, and distribute that work
+under terms of your choice, provided that the terms permit
+modification of the work for the customer's own use and reverse
+engineering for debugging such modifications.
+
+  You must give prominent notice with each copy of the work that the
+Library is used in it and that the Library and its use are covered by
+this License.  You must supply a copy of this License.  If the work
+during execution displays copyright notices, you must include the
+copyright notice for the Library among them, as well as a reference
+directing the user to the copy of this License.  Also, you must do one
+of these things:
+
+    a) Accompany the work with the complete corresponding
+    machine-readable source code for the Library including whatever
+    changes were used in the work (which must be distributed under
+    Sections 1 and 2 above); and, if the work is an executable linked
+    with the Library, with the complete machine-readable "work that
+    uses the Library", as object code and/or source code, so that the
+    user can modify the Library and then relink to produce a modified
+    executable containing the modified Library.  (It is understood
+    that the user who changes the contents of definitions files in the
+    Library will not necessarily be able to recompile the application
+    to use the modified definitions.)
+
+    b) Use a suitable shared library mechanism for linking with the
+    Library.  A suitable mechanism is one that (1) uses at run time a
+    copy of the library already present on the user's computer system,
+    rather than copying library functions into the executable, and (2)
+    will operate properly with a modified version of the library, if
+    the user installs one, as long as the modified version is
+    interface-compatible with the version that the work was made with.
+
+    c) Accompany the work with a written offer, valid for at
+    least three years, to give the same user the materials
+    specified in Subsection 6a, above, for a charge no more
+    than the cost of performing this distribution.
+
+    d) If distribution of the work is made by offering access to copy
+    from a designated place, offer equivalent access to copy the above
+    specified materials from the same place.
+
+    e) Verify that the user has already received a copy of these
+    materials or that you have already sent this user a copy.
+
+  For an executable, the required form of the "work that uses the
+Library" must include any data and utility programs needed for
+reproducing the executable from it.  However, as a special exception,
+the materials to be distributed need not include anything that is
+normally distributed (in either source or binary form) with the major
+components (compiler, kernel, and so on) of the operating system on
+which the executable runs, unless that component itself accompanies
+the executable.
+
+  It may happen that this requirement contradicts the license
+restrictions of other proprietary libraries that do not normally
+accompany the operating system.  Such a contradiction means you cannot
+use both them and the Library together in an executable that you
+distribute.
+
+  7. You may place library facilities that are a work based on the
+Library side-by-side in a single library together with other library
+facilities not covered by this License, and distribute such a combined
+library, provided that the separate distribution of the work based on
+the Library and of the other library facilities is otherwise
+permitted, and provided that you do these two things:
+
+    a) Accompany the combined library with a copy of the same work
+    based on the Library, uncombined with any other library
+    facilities.  This must be distributed under the terms of the
+    Sections above.
+
+    b) Give prominent notice with the combined library of the fact
+    that part of it is a work based on the Library, and explaining
+    where to find the accompanying uncombined form of the same work.
+
+  8. You may not copy, modify, sublicense, link with, or distribute
+the Library except as expressly provided under this License.  Any
+attempt otherwise to copy, modify, sublicense, link with, or
+distribute the Library is void, and will automatically terminate your
+rights under this License.  However, parties who have received copies,
+or rights, from you under this License will not have their licenses
+terminated so long as such parties remain in full compliance.
+
+  9. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Library or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Library (or any work based on the
+Library), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Library or works based on it.
+
+  10. Each time you redistribute the Library (or any work based on the
+Library), the recipient automatically receives a license from the
+original licensor to copy, distribute, link with or modify the Library
+subject to these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties with
+this License.
+
+  11. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Library at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Library by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Library.
+
+If any portion of this section is held invalid or unenforceable under any
+particular circumstance, the balance of the section is intended to apply,
+and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  12. If the distribution and/or use of the Library is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Library under this License may add
+an explicit geographical distribution limitation excluding those countries,
+so that distribution is permitted only in or among countries not thus
+excluded.  In such case, this License incorporates the limitation as if
+written in the body of this License.
+
+  13. The Free Software Foundation may publish revised and/or new
+versions of the Lesser General Public License from time to time.
+Such new versions will be similar in spirit to the present version,
+but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Library
+specifies a version number of this License which applies to it and
+"any later version", you have the option of following the terms and
+conditions either of that version or of any later version published by
+the Free Software Foundation.  If the Library does not specify a
+license version number, you may choose any version ever published by
+the Free Software Foundation.
+
+  14. If you wish to incorporate parts of the Library into other free
+programs whose distribution conditions are incompatible with these,
+write to the author to ask for permission.  For software which is
+copyrighted by the Free Software Foundation, write to the Free
+Software Foundation; we sometimes make exceptions for this.  Our
+decision will be guided by the two goals of preserving the free status
+of all derivatives of our free software and of promoting the sharing
+and reuse of software generally.
+
+			    NO WARRANTY
+
+  15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
+WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
+OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
+KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
+LIBRARY IS WITH YOU.  SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
+THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
+WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
+AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
+FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
+CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
+RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
+FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
+SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+
+           How to Apply These Terms to Your New Libraries
+
+  If you develop a new library, and you want it to be of the greatest
+possible use to the public, we recommend making it free software that
+everyone can redistribute and change.  You can do so by permitting
+redistribution under these terms (or, alternatively, under the terms of the
+ordinary General Public License).
+
+  To apply these terms, attach the following notices to the library.  It is
+safest to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least the
+"copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the library's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This library is free software; you can redistribute it and/or
+    modify it under the terms of the GNU Lesser General Public
+    License as published by the Free Software Foundation; either
+    version 2.1 of the License, or (at your option) any later version.
+
+    This library is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+    Lesser General Public License for more details.
+
+    You should have received a copy of the GNU Lesser General Public
+    License along with this library; if not, write to the Free Software
+    Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
+
+Also add information on how to contact you by electronic and paper mail.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the library, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the
+  library `Frob' (a library for tweaking knobs) written by James Random Hacker.
+
+  <signature of Ty Coon>, 1 April 1990
+  Ty Coon, President of Vice
+
+That's all there is to it!

Added: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE
===================================================================
--- trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.LICENSE	2008-05-27 18:58:18 UTC (rev 2268)
@@ -0,0 +1,504 @@
+		  GNU LESSER GENERAL PUBLIC LICENSE
+		       Version 2.1, February 1999
+
+ Copyright (C) 1991, 1999 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+[This is the first released version of the Lesser GPL.  It also counts
+ as the successor of the GNU Library Public License, version 2, hence
+ the version number 2.1.]
+
+			    Preamble
+
+  The licenses for most software are designed to take away your
+freedom to share and change it.  By contrast, the GNU General Public
+Licenses are intended to guarantee your freedom to share and change
+free software--to make sure the software is free for all its users.
+
+  This license, the Lesser General Public License, applies to some
+specially designated software packages--typically libraries--of the
+Free Software Foundation and other authors who decide to use it.  You
+can use it too, but we suggest you first think carefully about whether
+this license or the ordinary General Public License is the better
+strategy to use in any particular case, based on the explanations below.
+
+  When we speak of free software, we are referring to freedom of use,
+not price.  Our General Public Licenses are designed to make sure that
+you have the freedom to distribute copies of free software (and charge
+for this service if you wish); that you receive source code or can get
+it if you want it; that you can change the software and use pieces of
+it in new free programs; and that you are informed that you can do
+these things.
+
+  To protect your rights, we need to make restrictions that forbid
+distributors to deny you these rights or to ask you to surrender these
+rights.  These restrictions translate to certain responsibilities for
+you if you distribute copies of the library or if you modify it.
+
+  For example, if you distribute copies of the library, whether gratis
+or for a fee, you must give the recipients all the rights that we gave
+you.  You must make sure that they, too, receive or can get the source
+code.  If you link other code with the library, you must provide
+complete object files to the recipients, so that they can relink them
+with the library after making changes to the library and recompiling
+it.  And you must show them these terms so they know their rights.
+
+  We protect your rights with a two-step method: (1) we copyright the
+library, and (2) we offer you this license, which gives you legal
+permission to copy, distribute and/or modify the library.
+
+  To protect each distributor, we want to make it very clear that
+there is no warranty for the free library.  Also, if the library is
+modified by someone else and passed on, the recipients should know
+that what they have is not the original version, so that the original
+author's reputation will not be affected by problems that might be
+introduced by others.
+
+  Finally, software patents pose a constant threat to the existence of
+any free program.  We wish to make sure that a company cannot
+effectively restrict the users of a free program by obtaining a
+restrictive license from a patent holder.  Therefore, we insist that
+any patent license obtained for a version of the library must be
+consistent with the full freedom of use specified in this license.
+
+  Most GNU software, including some libraries, is covered by the
+ordinary GNU General Public License.  This license, the GNU Lesser
+General Public License, applies to certain designated libraries, and
+is quite different from the ordinary General Public License.  We use
+this license for certain libraries in order to permit linking those
+libraries into non-free programs.
+
+  When a program is linked with a library, whether statically or using
+a shared library, the combination of the two is legally speaking a
+combined work, a derivative of the original library.  The ordinary
+General Public License therefore permits such linking only if the
+entire combination fits its criteria of freedom.  The Lesser General
+Public License permits more lax criteria for linking other code with
+the library.
+
+  We call this license the "Lesser" General Public License because it
+does Less to protect the user's freedom than the ordinary General
+Public License.  It also provides other free software developers Less
+of an advantage over competing non-free programs.  These disadvantages
+are the reason we use the ordinary General Public License for many
+libraries.  However, the Lesser license provides advantages in certain
+special circumstances.
+
+  For example, on rare occasions, there may be a special need to
+encourage the widest possible use of a certain library, so that it becomes
+a de-facto standard.  To achieve this, non-free programs must be
+allowed to use the library.  A more frequent case is that a free
+library does the same job as widely used non-free libraries.  In this
+case, there is little to gain by limiting the free library to free
+software only, so we use the Lesser General Public License.
+
+  In other cases, permission to use a particular library in non-free
+programs enables a greater number of people to use a large body of
+free software.  For example, permission to use the GNU C Library in
+non-free programs enables many more people to use the whole GNU
+operating system, as well as its variant, the GNU/Linux operating
+system.
+
+  Although the Lesser General Public License is Less protective of the
+users' freedom, it does ensure that the user of a program that is
+linked with the Library has the freedom and the wherewithal to run
+that program using a modified version of the Library.
+
+  The precise terms and conditions for copying, distribution and
+modification follow.  Pay close attention to the difference between a
+"work based on the library" and a "work that uses the library".  The
+former contains code derived from the library, whereas the latter must
+be combined with the library in order to run.
+
+		  GNU LESSER GENERAL PUBLIC LICENSE
+   TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+  0. This License Agreement applies to any software library or other
+program which contains a notice placed by the copyright holder or
+other authorized party saying it may be distributed under the terms of
+this Lesser General Public License (also called "this License").
+Each licensee is addressed as "you".
+
+  A "library" means a collection of software functions and/or data
+prepared so as to be conveniently linked with application programs
+(which use some of those functions and data) to form executables.
+
+  The "Library", below, refers to any such software library or work
+which has been distributed under these terms.  A "work based on the
+Library" means either the Library or any derivative work under
+copyright law: that is to say, a work containing the Library or a
+portion of it, either verbatim or with modifications and/or translated
+straightforwardly into another language.  (Hereinafter, translation is
+included without limitation in the term "modification".)
+
+  "Source code" for a work means the preferred form of the work for
+making modifications to it.  For a library, complete source code means
+all the source code for all modules it contains, plus any associated
+interface definition files, plus the scripts used to control compilation
+and installation of the library.
+
+  Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope.  The act of
+running a program using the Library is not restricted, and output from
+such a program is covered only if its contents constitute a work based
+on the Library (independent of the use of the Library in a tool for
+writing it).  Whether that is true depends on what the Library does
+and what the program that uses the Library does.
+  
+  1. You may copy and distribute verbatim copies of the Library's
+complete source code as you receive it, in any medium, provided that
+you conspicuously and appropriately publish on each copy an
+appropriate copyright notice and disclaimer of warranty; keep intact
+all the notices that refer to this License and to the absence of any
+warranty; and distribute a copy of this License along with the
+Library.
+
+  You may charge a fee for the physical act of transferring a copy,
+and you may at your option offer warranty protection in exchange for a
+fee.
+
+  2. You may modify your copy or copies of the Library or any portion
+of it, thus forming a work based on the Library, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+    a) The modified work must itself be a software library.
+
+    b) You must cause the files modified to carry prominent notices
+    stating that you changed the files and the date of any change.
+
+    c) You must cause the whole of the work to be licensed at no
+    charge to all third parties under the terms of this License.
+
+    d) If a facility in the modified Library refers to a function or a
+    table of data to be supplied by an application program that uses
+    the facility, other than as an argument passed when the facility
+    is invoked, then you must make a good faith effort to ensure that,
+    in the event an application does not supply such function or
+    table, the facility still operates, and performs whatever part of
+    its purpose remains meaningful.
+
+    (For example, a function in a library to compute square roots has
+    a purpose that is entirely well-defined independent of the
+    application.  Therefore, Subsection 2d requires that any
+    application-supplied function or table used by this function must
+    be optional: if the application does not supply it, the square
+    root function must still compute square roots.)
+
+These requirements apply to the modified work as a whole.  If
+identifiable sections of that work are not derived from the Library,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works.  But when you
+distribute the same sections as part of a whole which is a work based
+on the Library, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote
+it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Library.
+
+In addition, mere aggregation of another work not based on the Library
+with the Library (or with a work based on the Library) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+  3. You may opt to apply the terms of the ordinary GNU General Public
+License instead of this License to a given copy of the Library.  To do
+this, you must alter all the notices that refer to this License, so
+that they refer to the ordinary GNU General Public License, version 2,
+instead of to this License.  (If a newer version than version 2 of the
+ordinary GNU General Public License has appeared, then you can specify
+that version instead if you wish.)  Do not make any other change in
+these notices.
+
+  Once this change is made in a given copy, it is irreversible for
+that copy, so the ordinary GNU General Public License applies to all
+subsequent copies and derivative works made from that copy.
+
+  This option is useful when you wish to copy part of the code of
+the Library into a program that is not a library.
+
+  4. You may copy and distribute the Library (or a portion or
+derivative of it, under Section 2) in object code or executable form
+under the terms of Sections 1 and 2 above provided that you accompany
+it with the complete corresponding machine-readable source code, which
+must be distributed under the terms of Sections 1 and 2 above on a
+medium customarily used for software interchange.
+
+  If distribution of object code is made by offering access to copy
+from a designated place, then offering equivalent access to copy the
+source code from the same place satisfies the requirement to
+distribute the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+  5. A program that contains no derivative of any portion of the
+Library, but is designed to work with the Library by being compiled or
+linked with it, is called a "work that uses the Library".  Such a
+work, in isolation, is not a derivative work of the Library, and
+therefore falls outside the scope of this License.
+
+  However, linking a "work that uses the Library" with the Library
+creates an executable that is a derivative of the Library (because it
+contains portions of the Library), rather than a "work that uses the
+library".  The executable is therefore covered by this License.
+Section 6 states terms for distribution of such executables.
+
+  When a "work that uses the Library" uses material from a header file
+that is part of the Library, the object code for the work may be a
+derivative work of the Library even though the source code is not.
+Whether this is true is especially significant if the work can be
+linked without the Library, or if the work is itself a library.  The
+threshold for this to be true is not precisely defined by law.
+
+  If such an object file uses only numerical parameters, data
+structure layouts and accessors, and small macros and small inline
+functions (ten lines or less in length), then the use of the object
+file is unrestricted, regardless of whether it is legally a derivative
+work.  (Executables containing this object code plus portions of the
+Library will still fall under Section 6.)
+
+  Otherwise, if the work is a derivative of the Library, you may
+distribute the object code for the work under the terms of Section 6.
+Any executables containing that work also fall under Section 6,
+whether or not they are linked directly with the Library itself.
+
+  6. As an exception to the Sections above, you may also combine or
+link a "work that uses the Library" with the Library to produce a
+work containing portions of the Library, and distribute that work
+under terms of your choice, provided that the terms permit
+modification of the work for the customer's own use and reverse
+engineering for debugging such modifications.
+
+  You must give prominent notice with each copy of the work that the
+Library is used in it and that the Library and its use are covered by
+this License.  You must supply a copy of this License.  If the work
+during execution displays copyright notices, you must include the
+copyright notice for the Library among them, as well as a reference
+directing the user to the copy of this License.  Also, you must do one
+of these things:
+
+    a) Accompany the work with the complete corresponding
+    machine-readable source code for the Library including whatever
+    changes were used in the work (which must be distributed under
+    Sections 1 and 2 above); and, if the work is an executable linked
+    with the Library, with the complete machine-readable "work that
+    uses the Library", as object code and/or source code, so that the
+    user can modify the Library and then relink to produce a modified
+    executable containing the modified Library.  (It is understood
+    that the user who changes the contents of definitions files in the
+    Library will not necessarily be able to recompile the application
+    to use the modified definitions.)
+
+    b) Use a suitable shared library mechanism for linking with the
+    Library.  A suitable mechanism is one that (1) uses at run time a
+    copy of the library already present on the user's computer system,
+    rather than copying library functions into the executable, and (2)
+    will operate properly with a modified version of the library, if
+    the user installs one, as long as the modified version is
+    interface-compatible with the version that the work was made with.
+
+    c) Accompany the work with a written offer, valid for at
+    least three years, to give the same user the materials
+    specified in Subsection 6a, above, for a charge no more
+    than the cost of performing this distribution.
+
+    d) If distribution of the work is made by offering access to copy
+    from a designated place, offer equivalent access to copy the above
+    specified materials from the same place.
+
+    e) Verify that the user has already received a copy of these
+    materials or that you have already sent this user a copy.
+
+  For an executable, the required form of the "work that uses the
+Library" must include any data and utility programs needed for
+reproducing the executable from it.  However, as a special exception,
+the materials to be distributed need not include anything that is
+normally distributed (in either source or binary form) with the major
+components (compiler, kernel, and so on) of the operating system on
+which the executable runs, unless that component itself accompanies
+the executable.
+
+  It may happen that this requirement contradicts the license
+restrictions of other proprietary libraries that do not normally
+accompany the operating system.  Such a contradiction means you cannot
+use both them and the Library together in an executable that you
+distribute.
+
+  7. You may place library facilities that are a work based on the
+Library side-by-side in a single library together with other library
+facilities not covered by this License, and distribute such a combined
+library, provided that the separate distribution of the work based on
+the Library and of the other library facilities is otherwise
+permitted, and provided that you do these two things:
+
+    a) Accompany the combined library with a copy of the same work
+    based on the Library, uncombined with any other library
+    facilities.  This must be distributed under the terms of the
+    Sections above.
+
+    b) Give prominent notice with the combined library of the fact
+    that part of it is a work based on the Library, and explaining
+    where to find the accompanying uncombined form of the same work.
+
+  8. You may not copy, modify, sublicense, link with, or distribute
+the Library except as expressly provided under this License.  Any
+attempt otherwise to copy, modify, sublicense, link with, or
+distribute the Library is void, and will automatically terminate your
+rights under this License.  However, parties who have received copies,
+or rights, from you under this License will not have their licenses
+terminated so long as such parties remain in full compliance.
+
+  9. You are not required to accept this License, since you have not
+signed it.  However, nothing else grants you permission to modify or
+distribute the Library or its derivative works.  These actions are
+prohibited by law if you do not accept this License.  Therefore, by
+modifying or distributing the Library (or any work based on the
+Library), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Library or works based on it.
+
+  10. Each time you redistribute the Library (or any work based on the
+Library), the recipient automatically receives a license from the
+original licensor to copy, distribute, link with or modify the Library
+subject to these terms and conditions.  You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties with
+this License.
+
+  11. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License.  If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Library at all.  For example, if a patent
+license would not permit royalty-free redistribution of the Library by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Library.
+
+If any portion of this section is held invalid or unenforceable under any
+particular circumstance, the balance of the section is intended to apply,
+and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system which is
+implemented by public license practices.  Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+  12. If the distribution and/or use of the Library is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Library under this License may add
+an explicit geographical distribution limitation excluding those countries,
+so that distribution is permitted only in or among countries not thus
+excluded.  In such case, this License incorporates the limitation as if
+written in the body of this License.
+
+  13. The Free Software Foundation may publish revised and/or new
+versions of the Lesser General Public License from time to time.
+Such new versions will be similar in spirit to the present version,
+but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number.  If the Library
+specifies a version number of this License which applies to it and
+"any later version", you have the option of following the terms and
+conditions either of that version or of any later version published by
+the Free Software Foundation.  If the Library does not specify a
+license version number, you may choose any version ever published by
+the Free Software Foundation.
+
+  14. If you wish to incorporate parts of the Library into other free
+programs whose distribution conditions are incompatible with these,
+write to the author to ask for permission.  For software which is
+copyrighted by the Free Software Foundation, write to the Free
+Software Foundation; we sometimes make exceptions for this.  Our
+decision will be guided by the two goals of preserving the free status
+of all derivatives of our free software and of promoting the sharing
+and reuse of software generally.
+
+			    NO WARRANTY
+
+  15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
+WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
+OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
+KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE.  THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
+LIBRARY IS WITH YOU.  SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
+THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+  16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
+WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
+AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
+FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
+CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
+RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
+FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
+SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGES.
+
+		     END OF TERMS AND CONDITIONS
+
+           How to Apply These Terms to Your New Libraries
+
+  If you develop a new library, and you want it to be of the greatest
+possible use to the public, we recommend making it free software that
+everyone can redistribute and change.  You can do so by permitting
+redistribution under these terms (or, alternatively, under the terms of the
+ordinary General Public License).
+
+  To apply these terms, attach the following notices to the library.  It is
+safest to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least the
+"copyright" line and a pointer to where the full notice is found.
+
+    <one line to give the library's name and a brief idea of what it does.>
+    Copyright (C) <year>  <name of author>
+
+    This library is free software; you can redistribute it and/or
+    modify it under the terms of the GNU Lesser General Public
+    License as published by the Free Software Foundation; either
+    version 2.1 of the License, or (at your option) any later version.
+
+    This library is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+    Lesser General Public License for more details.
+
+    You should have received a copy of the GNU Lesser General Public
+    License along with this library; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301  USA
+
+Also add information on how to contact you by electronic and paper mail.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the library, if
+necessary.  Here is a sample; alter the names:
+
+  Yoyodyne, Inc., hereby disclaims all copyright interest in the
+  library `Frob' (a library for tweaking knobs) written by James Random Hacker.
+
+  <signature of Ty Coon>, 1 April 1990
+  Ty Coon, President of Vice
+
+That's all there is to it!
+
+

Added: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE
===================================================================
--- trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.LICENSE	2008-05-27 18:58:18 UTC (rev 2268)
@@ -0,0 +1,176 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the ...
 
[truncated message content]

[Archive-access-cvs] SF.net SVN: archive-access: [2267] trunk/archive-access/projects/nutchwax/ xdocs/index.xml

From: <bi...@us...> - 2008-05-22 23:27:25

Revision: 2267
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2267&view=rev
Author:   binzino
Date:     2008-05-22 16:27:32 -0700 (Thu, 22 May 2008)

Log Message:
-----------
Added 0.12 pre-announcement.

Modified Paths:
--------------
    trunk/archive-access/projects/nutchwax/xdocs/index.xml

Modified: trunk/archive-access/projects/nutchwax/xdocs/index.xml
===================================================================
--- trunk/archive-access/projects/nutchwax/xdocs/index.xml	2008-05-21 00:02:01 UTC (rev 2266)
+++ trunk/archive-access/projects/nutchwax/xdocs/index.xml	2008-05-22 23:27:32 UTC (rev 2267)
@@ -60,6 +60,17 @@
     </table>
     </section>
     <section name="News">
+    <subsection name="Upcoming Release 0.12.0 - 05/22/2007">
+    <p>
+      With this upcoming release, NutchWAX 0.12 will "catch-up" to
+      Nutch 1.0-dev (which uses Hadoop 0.16), thereby benefiting from
+      numerous bug fixes and enhancements contained therein.
+    </p>
+    <p>
+      We are on target for releasing a public beta on June 2nd.
+      Watch this space for further announcements.
+    </p>
+    </subsection>
     <subsection name="Release 0.10.0 - 01/17/2007">
     <p>
      Bug fixes and improvements in the quality of search results


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2266] trunk/archive-access/projects/nat/ archive

From: <bi...@us...> - 2008-05-21 00:01:54

Revision: 2266
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2266&view=rev
Author:   binzino
Date:     2008-05-20 17:02:01 -0700 (Tue, 20 May 2008)

Log Message:
-----------
Total re-write of install, readme and howto documents.

Modified Paths:
--------------
    trunk/archive-access/projects/nat/archive/INSTALL.txt
    trunk/archive-access/projects/nat/archive/README.txt

Added Paths:
-----------
    trunk/archive-access/projects/nat/archive/HOWTO.txt

Added: trunk/archive-access/projects/nat/archive/HOWTO.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/HOWTO.txt	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/HOWTO.txt	2008-05-21 00:02:01 UTC (rev 2266)
@@ -0,0 +1,325 @@
+
+HOWTO.txt
+2008-05-20
+Aaron Binns
+
+Table of Contents
+ o Prerequisites
+   - Nutch(WAX) installation
+   - ARC/WARC files
+ o Configuration & Patching
+ o Create a manifest
+ o Import, Invert and Index
+ o Search
+ o Web deployment
+   - Don't forget to config & patch again
+
+======================================================================
+Prerequisites
+======================================================================
+
+In order to use Nutch(WAX) you need the following prerequisites:
+
+ 1. NutchWAX installed.
+
+    See INSTALL.txt for instruction on building and installing
+    NutchWAX.
+
+    This HOWTO assumes it is installed in
+
+      /opt/nutch-1.0-dev
+
+ 2. ARC/WARC files.
+
+    The whole purpose of NutchWAX is to index ARC/WARC files.  These
+    files are not produced by Nutch nor NutchWAX, they are produced by
+    other tools, such as Heritrix.
+
+    If you don't have any ARC/WARC files, you have no need for
+    NutchWAX.
+
+
+======================================================================
+Patching
+======================================================================
+
+The vanilla NutchWAX as built according to the INSTALL.txt guide is
+not quite ready to be used out-of-the-box.
+
+Before you can use NutchWAX, you must first patch a bug that exists in
+the current Nutch SVN head.
+
+The file
+
+  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
+
+contains two errors: one where a mimetype is referenced before it is
+defined; and a second where a definition has an illegal character.
+
+These errors cause Nutch to not recognize certain mimetypes and
+therefore will ignore documents matching those mimetypes.
+
+There are two fixes:
+
+ 1. Move
+
+	<mime-type type="application/xml">
+		<alias type="text/xml" />
+		<glob pattern="*.xml" />
+	</mime-type>
+
+    definition higher up in the file, before the reference to it.
+
+ 2. Remove
+
+	<mime-type type="application/x-ms-dos-executable">
+		<alias type="application/x-dosexec;exe" />
+	</mime-type>
+
+    as the ';' character is illegal according to the comments in the
+    Nutch code.
+
+You can either apply these patches yourself, or copy an already-patched
+copy from:
+
+  /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml
+
+to 
+
+  /opt/nutch-1.0-dev/conf/tika-mimetypes.xml
+
+
+======================================================================
+Configuring
+======================================================================
+
+Since we assume that you are already familiar with Nutch, then you
+should already be familiar with configuring it.  The configuration
+is mainly defined in
+
+  /opt/nutch-1.0-dev/conf/nutch-default.xml
+
+NutchWAX requires the modification of two existing properties and the
+addition of two new ones.
+
+All of the modifications described below can be found in:
+
+  /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml
+
+You can either apply the configuration changes yourself, or copy that
+file to
+
+  /opt/nutch-1.0-dev/conf/nutch-site.xml
+
+--------------------------------------------------
+plugin.includes
+--------------------------------------------------
+Change the list of plugins from:
+
+  protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)
+
+to
+
+  protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic
+
+In short, we add:
+
+  index-nutchwax
+  query-nutchwax
+  parse-pdf
+
+and remove:
+
+  urlfilter-regex
+  urlnormalizer-(pass|regex|basic)
+
+The only *required* changes are the additions of the NutchWAX index
+and query plugins.  The rest are optional, but recommended.
+
+The addition of the "parse-pdf" plugin is simply because we have lots
+of PDFs in our archives and we want to index them.  We sometimes
+remove the "parse-js" plugin if we don't care to index JavaScript
+files.
+
+We also remove the URL filtering and normalizing plugins because we do
+not need the URLs normalized nor filtered.  We trust that the tool
+that produced the ARC/WARC file will have normalized the URLs
+contained therein according to its own rules so there's no need to
+normalize here.  Also, we don't filter by URL since we want to index
+as much of the ARC/WARC file as we have parsers for.
+
+--------------------------------------------------
+mime.type.magic
+--------------------------------------------------
+We disable mimetype detection in Nutch for two reasons:
+
+1. The ARC/WARC file specifies the Content-Type of the document.  We
+   trust that the tool that created the ARC/WARC file got it right.
+
+2. The implementation in Nutch can use a lot of memory as the *entire*
+   document is read into memory as a byte[], then converted to a
+   String, then checked against the MIME database.  This can lead to
+   out of memory errors for large files, such as music and video.
+
+To disable, simply set the property value to false.
+
+  <property>
+    <name>mime.type.magic</name>
+    <value>false</value>
+  </property>
+
+--------------------------------------------------
+nutchwax.filter.index
+--------------------------------------------------
+Configure the 'index-nutchwax' plugin.  Specify how the metadata
+fields added by the ArcsToSegment are mapped to the Lucene documents
+during indexing.
+
+The specifications here are of the form:
+
+  src-key:lowercase:store:tokenize:dest-key
+
+where the only required part is the "src-key", the rest will assume
+the following defaults:
+
+  lowercase = true
+  store     = true
+  tokenize  = false
+  dest-key  = src-key
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.index</name>
+  <value>
+    arcname:false
+    collection
+    date
+    type
+  </value>
+</property>
+
+--------------------------------------------------
+nutchwax.filter.query
+--------------------------------------------------
+Configure the 'query-nutchwax' plugin.  Specify which fields to make
+searchable via "[field]:[term|phrase]" query syntax, and whether they
+are "raw" fields or not.
+
+The specification format is 
+
+  raw:name:lowercase:boost 
+or
+  field:name:boost
+
+Default values are
+
+  lowercase = true
+  boost     = 1.0f
+
+There is no "lowercase" property for "field" specification because the
+Nutch FieldQueryFilter doesn't expose the option, unlike the
+RawFieldQueryFilter.
+
+NTOE: We do *not* use this filter for handling "date" queries, there is a
+specific filter for that: DateQueryFilter
+
+We recommend:
+
+<property>
+  <name>nutchwax.filter.query</name>
+  <value>
+    raw:arcname:false
+    raw:collection
+    raw:type
+    field:anchor
+    field:content
+    field:host
+    field:title
+  </value>
+</property>
+
+
+======================================================================
+Create a manifest
+======================================================================
+
+The input to NutchWAX's import tool is a manifest file.  This is a
+simple text file where each line contains a URL to an ARC/WARC file
+and an optional collection name.
+
+For example:
+
+ $ cat > manifest
+ http://someserver/somepath/somearchive.arc.gz mycollection
+ ^D
+
+Creates a simple manifest file with one ARC file and a collection
+name of "mycollection".
+
+You don't have to use collections at all.  If you don't know how you
+would use it, then simply leave it out here.
+
+
+======================================================================
+Import, Invert and Index
+======================================================================
+
+The steps to import the files, invert the link and index the documents
+are rather simple:
+
+  $ mkdir crawl
+  $ cd crawl
+  $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest
+  $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
+  $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb  -dir segments
+  $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
+  $ ls -F1
+  crawldb/
+  indexes/
+  linkdb/
+  segments/
+
+To those already familiar with Nutch, these steps should be quite
+familiar.
+
+The first step, we call NutchWAX's "import" command which creates the
+Nutch segment containing the documents in the ARC/WARC files listed in
+the manifest.  The rest is the same as regular Nutch.
+
+
+======================================================================
+Search
+======================================================================
+The resulting indexes can be searched in exactly the same manner as in
+regular Nutch.  For example, assuming you just completed the steps
+above, now:
+
+  $ cd ../
+  $ ls -F1
+  crawl/
+  $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer
+
+This calls the NutchBean to execute a simple keyword search for
+"computer".  Use whatever query term you think appears in the
+documents you imported.
+
+
+======================================================================
+Web Deployment
+======================================================================
+
+As users of Nutch are aware, the web application (nutch-1.0-dev.war)
+bundled with Nutch contains duplicate copies of the configuration
+files.
+
+So, all patches and configuration changes that we made to the
+files in
+
+  /opt/nutch-1.0-dev/conf
+
+will have to be duplicated in the Nutch webapp when it is deployed.
+
+This is not due to NutchWAX, this is a "feature" of regular Nutch.  I
+just thought it would be good to remind everyone since we did make
+configuration changes for NutchWAX.

Modified: trunk/archive-access/projects/nat/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/INSTALL.txt	2008-05-14 00:20:24 UTC (rev 2265)
+++ trunk/archive-access/projects/nat/archive/INSTALL.txt	2008-05-21 00:02:01 UTC (rev 2266)
@@ -1,236 +1,93 @@
 
 INSTALL.txt
-2008-05-06
+2008-05-20
 Aaron Binns
 
+This installation guide assumes the reader is already familiar with
+building, packaging and deploying Nutch 1.0-dev.
 
-The NutchWAX 0.12 build and installation is as an "add-on" to an
-existing Nutch 1.0-dev installation.
 
-NutchWAX 0.12 uses a simple 'ant' build script.  The script compiles
-the NutchWAX sources, using the libraries in the installed
-Nutch-1.0-dev.
+The NutchWAX 0.12 source and build system are designed to integrate
+into the existing Nutch 1.0-dev source and build.
 
-We strongly recommend having *two* Nutch-1.0-dev installation
-directories: one that you build NutchWAX against, and another into
-which NutchWAX is deployed.
+The long-term goal is for the NutchWAX components to be fully
+integrated into mainline Nutch.  As a stepping-stone toward that goal,
+we have packaged the NutchWAX source to be dropped into the Nutch
+"contrib" directory and built from there.
 
-NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file
-*into* an existing Nutch-1.0-dev installation.  Think of NutchWAX as
-an add-on.  We over-write a few Nutch config files, but the rest is
-simply added to the existing Nutch-1.0-dev installation.
+Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script.  The
+NutchWAX build script calls out to the Nutch script to build Nutch
+proper, then builds NutchWAX components and integrates them into the
+Nutch build directory.
 
+We recommend that you execute all build commands from the NutchWAX
+directory.  This way, NutchWAX will ensure that any and all
+dependencies in Nutch will be properly built and kept up-to-date.
+Towards this goal, we have duplicated the most common build targets
+from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file,
+such as:
 
+  o compile
+  o jar
+  o job
+  o tar
+  o clean
+
+Again, the idea is that if you're already used to building Nutch, you
+can easily transition to building Nutch and NutchWAX together.  All of
+the build artifacts will still be placed in Nutch's 'build'
+sub-directory as normal.
+
+
 Nutch-1.0-dev
 -------------
-
-As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.  Now
+As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.
 Nutch doesn't have a 1.0 release package yet, so we have to use the
-Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is 
+Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is
 built against is:
 
   650739
 
 To checkout this revision of Nutch, use:
 
- $ mkdir nutch
+ $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
  $ cd nutch
- $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk
 
-To build the nutch-1.0-dev.tar.gz package, use 'ant'
 
- $ cd trunk
- $ ant tar
+NutchWAX
+--------
+Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into
+Nutch's "contrib" directory.
 
-This produces
+ $ cd contrib
+ $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nat/archive
 
-  build/nutch-1.0-dev.tar.gz
+This will create a sub-directory named "archive" containing the
+NutchWAX sources.
 
-Which we then install *twice*
 
- $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev
- $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz
- $ mkdir -p /opt/nutch-1.0-dev
- $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz
-
-The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which
-we compile against, then, when we want to test NutcWAX, we deploy it
-into ~/nutchwax-0.12/nutch-1.0-dev.
-
-Why can't we just use one installation of Nutch?  Mainly to avoid
-weirdness where we are compiling NutchWAX source against the same set
-of libraries where we would be installing NutchWAX.  Consider, when we
-deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib'
-directory.  If we use that same 'lib' directory for dependencies when
-compiling the source, 'ant'/'javac' will likely get confused when
-calculating dependencies.
-
-It's possible that you could successfully go through the
-build/test/release cycle using one Nutch-1.0-dev directory, but these
-instructions assume you will have two.
-
-
 Build and install
 -----------------
+Assuming you already have the required tool-set for building Nutch,
+building NutchWAX is a snap.
 
-  1. Install two Nutch-1.0-dev packages per the instructions above.
+Simply execute the same 'ant' build command in
 
-  2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev
+  nutch/contrib/archive
 
-       <!-- NOTE: Point this to your Nutch 1.0-dev directory -->
-       <property name="nutch.dir" value="/opt/nutch-1.0-dev" />
+as you normally would and everything will build as normal.
 
-  3. Build NutchWAX-0.12
+For example
 
-      $ ant
+  $ cd nutch/contrib/archive
+  $ ant tar
 
-     The default build rule is "package" which will compile all the source
-     and build an intallation tarball: nutchwax-0.12.tar.gz
+This command will build all of Nutch, then the NutchWAX add-ons and
+finally will package everything up into the "nutch-1.0-dev.tar.gz"
+release package.
 
-     The "build.xml" file is pretty straightforward and just grepping
-     for the targets should be pretty obvious: compile, clean, etc.
-
-  4. Install NutchWAX into the build/test Nutch installation
-
-     $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz
-
-That's it!
-
-All we do is add our libraries (nutchwax.jar and dependencies), the
-'nutchwax' helper script, plugins for indexing and querying, and a few
-config files.
-
-Except for the config files, no files in the Nutch-1.0-dev
-installation are over-written, only added.  The "nutch-site.xml" file
-is over-written, but that file is empty in a vanilla Nutch
-installation, so there's small risk of over-writing something.
-
-
-HOWTO run and test
-------------------
-
-The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin'
-directory next to the 'nutch' helper script.
-
-The 'nutchwax' script is used to run the NutchWAX-specific tools, use
-the regular 'nutch' script for regular Nutch activities.
-
-The 'nutchwax' script runs two tools
-
-  "import"     Import a set of .arc/.warc files from a manifest, creating
-               a Nutch segment.
-
-  "dumpindex"  Debug tool that dumps a Lucene index, such as the ones
-               created by Nutch's "index" tool.
-
-The idea is that the NutchWAX "import" tool supplants the Nutch
-generate and fetch cycle.  Rather than generating and fetching
-segments, we import the .arc/.warc files directly into a newly created
-segment.  Then, we process that segment just as we normally would with
-Nutch.
-
-For example,
-
-  $ cd nutch-test
-  $ cat > manifest
-    http://someserver/foo-bar-baz.arc mycollection
-    ^D
-  $ nutch-1.0-dev/bin/nutchwax import manifest
-
-This will import the arc file listed in the manifest into a newly
-created segment.  The segment is created by default in a directory
-hierarchy of the form:
-
-  segments/[date-timestamp]
-
-This mirrors the way segments are created in vanilla Nutch by the
-"generate" command.
-
-You can explicitly name the segment if you want, e.g.
-
-  nutchwax import manifest mysegment
-
-Once the segment is created by the importing of ARC files with
-NutchWAX, you can use Nutch to perform the rest of the steps.  For
+Then, install the "nutch-1.0-dev.tar.gz" tarball as normal.  For
 example:
 
-  $ nutch-1.0-dev/bin/nutchwax import manifest
-  $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
-  $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments
-  $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
-  $ nutch-1.0-dev/bin/nutch merge index indexes
-
-This is pretty much the minimal set of steps to import and index a set
-of ARC files.  The crawldb update and link inversion steps are pro
-forma and don't have anything to do with NutchWAX specifically, but
-are a part of regular Nutch processing.
-
-Now you have a Nutch "index" directory and are ready to search!
-
-Searching is done as in vanilla Nutch.  Either launch the Nutch webapp
-or use the command-line interface to NutchBean to run some test
-searches.  Nothing NutchWAX-specific here.
-
-
-Miscellaneous notes
--------------------
-
-1. Plugins
-
-There are two plugins bundled with NutchWAX: 
-
-   index-nutchwax
-   query-nutchwax
-
-See the "plugin.includes" property in nutch-site.xml to see where
-these plugins are added to the filter chain.
-
-The index-nutchwax plugin ensures that WAX-specifici metadata is
-transferred from the Nutch Content object to the Lucene Document
-object, which is placed in the Lucene index.
-
-The query-nutchwax plugin is used to process query requests against
-those same meta-data fields.  It also expands the capabilities of
-searching the basic Nutch fields as well.
-
-2. URL filters
-
-Nutch's URL filter by default filters-out many common URL oddities
-that would normally trip-up Nutch's crawler.  However, when importing
-content from ARC files, filtering out content probably doens't make
-sense.  That is, whatever content made it into the ARC file should be
-imported, no matter what the URL looks like.
-
-To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'.
-To pass all content through the filter, remove all filter rules except
-for the last one:
-
-  # accept anything else
-  +.
-
-3. conf/tika-mimetypes.xml
-
-NutchWAX comes with a fixed copy of tika-mimetypes.xml.  The version
-in Nutch revision 650739 has a few bugs in it which cause parsing to
-fail for many document types.  The bugs are:
-
- o Move
-
-	<mime-type type="application/xml">
-		<alias type="text/xml" />
-		<glob pattern="*.xml" />
-	</mime-type>
-
-   definition higher up in the file, before the reference to it.
-
- o Remove
-
-	<mime-type type="application/x-ms-dos-executable">
-		<alias type="application/x-dosexec;exe" />
-	</mime-type>
-
-   as the ';' character is illegal according to the comments in the
-   Nutch code.
-
-The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes
-these two bugs.
+  $ cd /opt
+  $ tar xvfz nutch-1.0-dev.tar.gz

Modified: trunk/archive-access/projects/nat/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/README.txt	2008-05-14 00:20:24 UTC (rev 2265)
+++ trunk/archive-access/projects/nat/archive/README.txt	2008-05-21 00:02:01 UTC (rev 2266)
@@ -1,105 +1,122 @@
 
 README.txt
-2008-05-06
+2008-05-20
 Aaron Binns
 
+Welcome to NutchWAX 0.12!
 
-This is the NutchWAX-0.12 source that John Lee handed-off to me.  It
-is a work-in-progress.
+NutchWAX is a set of add-ons to Nutch in order to index and search
+archived web data.
 
-Compared to NutchWAX-0.10 (and earlier) it is *much* simpler.  The
-main WAX-specific code is in just a few files really:
+These add-ons are developed and maintained by the Internet Archive Web
+Team in conjunction with a broad community of contributors, partners
+and end-users.
 
-src/java/org/archive/nutchwax/ArcsToSegment.java
+The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
 
-  This is the meat of the WAX logic for processing .arc files and
-  generating Nutch segments.  Once we use this to generate a set of
-  segments for the .arc files, we can use the rest of vanilla
-  Nutch-1.0-dev to invert links and index the content with Lucene.
+Since NutchWAX is a set of add-ons to Nutch, you should already be
+familiar with Nutch before using NutchWAX.
 
-  This conversion code is heavily edited from:
+======================================================================
 
-    nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+The goal of NutchWAX is to enable full-text indexing and searching of
+documents stored in web archive file formats (ARC and WARC).
 
-  taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development).
+The way we achieve that goal is by providing add-on tools and plugins
+to Nutch to read documents directly from ARC/WARC files.  We call this
+process "importing" archive files.
 
-  Ours differs in a few important ways:
+Importing produces a Nutch segment, the same as if Nutch had actually
+crawled the documents itself.  In this scenario, document importing
+replaces the conventional "generate/fetch/update" cycle of Nutch.
 
-    o Rather than taking a directory with .arc files as input, we take
-      a manifest file with URLs to .arc files.  This way, the manifest
-      is split up among the distributed Hadoop jobs and the .arc files
-      are processed in whole by each worker.
+Once the archival documents have been imported into a segment, the
+regular Nutch commands to update the 'crawldb', invert the links and
+index the document contents can proceed as normal.
 
-      In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the
-      input directory to contain the .arc files and (AFAICT) splits
-      them up and distributes them across the Hadoop workers.  This
-      seems really inefficient to me, I think our approach is much
-      better -- at least for us.
+======================================================================
 
-    o Related to the way input files are split and processed, we use
-      the standard Archive ARCReader class just like Heritrix and
-      Wayback.
+The NutchWAX add-ons consist of:
 
-      The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our
-      ARCReader because of licensing imcompatibility.  Ours is under
-      GPL and Nutch-1.0-dev forbids the use of GPL code.
-      
-      We are in the process of re-licensing or dual-licensing with
-      Apache License, but until then, our ARCReader code won't be incldued      
-      in mainline Nutch.
+ bin/nutchwax
 
-      This isn's a problem per se, but worth noting in case anyone
-      looks at the Nutch-1.0-dev code and wonders why they built their
-      own (horribly inefficient) .arc reader.
+   A shell script that is used to run the NutchWAX command-line tools,
+   such as document importing.
 
-    o We add metadata fields to the processed document for WAX-specific
-      purposes:
+   This is patterned after the 'bin/nutch' shell script.
 
-        content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() );
-        content.getMetadata().set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() ) ;
-        content.getMetadata().set( NutchWax.COLLECTION_KEY,   collection);
-        content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() );
+ plugins/index-nutchwax
 
-      The addition of the arcname and collection key is pretty
-      obvious.  I don't know why the content-type isn't added in the
-      vanilla Nutch-1.0-dev.
-      
-      Also, we should review the use of the ARCHIVE_DATE_KEY in that
-      John Lee mentioned to me that there was possibly duplicate date
-      fields put in the index: one that is a plain old Java date, and
-      one that is a 14-digit date string for use with Wayback.
+   Indexing plugin which adds NutchWAX-specific metadata fields to the
+   indexed document.
 
-src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java
-src/java/plugin/index-nutchwax/plugin.xml
+ plugins/query-nutchwax
 
-  This filter is pretty straightforward.  All it does is take the
-  metadata fields that were added to the document (as described above)
-  and placed in the Lucene index so that we can make use of them at
-  search-time.
+   Query plugin which allows for querying against the metadata fields
+   added by 'index-nutchwax'.
 
-src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java
-src/java/plugin/query-nutchwax/plugin.xml
+There is no separate 'lib/nutchwax.jar' file for NutchWAX.  NutchWAX
+is distributed in source code form and is intended to be built in
+conjunction with Nutch.
 
-  This is a single query filter that can be used for querying single
-  fields from a single implementation.  It does *not* allow for
-  querying multiple fields as you can already do that via Nutch.
+See "INSTALL.txt" for details on building NutchWAX and Nutch.
 
-  What this filter does is allows one to more-or-less create query
-  filters in a data-driven manner rather than having to code-up a new
-  class for each field.  That is, before one would have to create a
-  CollectionQueryFilter class to filter on the "collection" field.
-  With the MultipleFieldQueryFilter class, you can specify that the
-  "collection" field is to be filterable via the plugin.xml file and
-  "nutchwax.filter.query" configuration property.
+See "HOWTO.txt" for a quick tutorial on importing, indexing and
+searching a set of documents in a web archive file.
 
-src/java/org/archive/nutchwax/NutchWax.java
+======================================================================
 
-  Just a simple enum used by the above two classes for the metadata
-  keys.
+This 0.12 release of NutchWAX is radically different in source-code
+form compared to the previous release, 0.10.
 
-src/java/org/archive/nutchwax/tools/DumpIndex.java
+One of the design goals of 0.12 was to reduce or even eliminate the
+"copy/paste/edit" approach of 0.10.  The 0.10 (and prior) NutchWAX
+releases had to copy/paste/edit large chunks of Nutch source code in
+order to add the NutchWAX features.
 
-  A simple command-line utility to dump the contents of a Lucene
-  index.  Used for debugging.
+Also, the NutchWAX 0.12 sources and build are designed to one day be
+added into mainline Nutch as a proper "contrib" package; then
+eventually be fully integrated into the core Nutch source code.
 
+======================================================================
 
+Most of the NutchWAX source code is relatively straightfoward to those
+already familiar with the inner workings of Nutch.  Still, special
+attention on one class is worth while:
+
+  src/java/org/archive/nutchwax/ArcsToSegment.java
+
+This is where ARC/WARC files are read and their documents are imported
+into a Nutch segment.
+
+It is inspired by:
+
+  nutch/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+
+on the Nutch SVN head.
+
+Our implementation differs in a few important ways:
+
+  o Rather than taking a directory with ARC files as input, we take a
+    manifest file with URLs to ARC files.  This way, the manifest is
+    split up among the distributed Hadoop jobs and the ARC files are
+    processed in whole by each worker.
+
+    In the Nutch SVN, the ArcSegmentCreator.java expects the input
+    directory to contain the ARC files and (AFAICT) splits them up and
+    distributes them across the Hadoop workers.
+
+  o We use the standard Internet Archive ARCReader and WARCReader
+    classes.  Thus, NutchWAX can read both ARC and WARC files, whereas
+    the ArcSegmentCreator class can only read ARC files.
+
+  o We add metadata fields to the document, which are then available
+    to the "index-nutchwax" plugin at indexing-time.
+
+    ArcsToSegment.importRecord()
+      ...
+      contentMetadata.set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype()          );
+      contentMetadata.set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() );
+      contentMetadata.set( NutchWax.COLLECTION_KEY,   collectionName              );
+      contentMetadata.set( NutchWax.DATE_KEY,         meta.getDate()              );
+      ...


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2265] trunk/archive-access/projects

From: <bi...@us...> - 2008-05-14 00:20:16

Revision: 2265
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2265&view=rev
Author:   binzino
Date:     2008-05-13 17:20:24 -0700 (Tue, 13 May 2008)

Log Message:
-----------
Initial checkin of NutchWAX 0.12, a.k.a Nutch Archive Tools (NAT).

Added Paths:
-----------
    trunk/archive-access/projects/nat/
    trunk/archive-access/projects/nat/archive/
    trunk/archive-access/projects/nat/archive/INSTALL.txt
    trunk/archive-access/projects/nat/archive/README.txt
    trunk/archive-access/projects/nat/archive/bin/
    trunk/archive-access/projects/nat/archive/bin/nutchwax
    trunk/archive-access/projects/nat/archive/build.xml
    trunk/archive-access/projects/nat/archive/conf/
    trunk/archive-access/projects/nat/archive/conf/nutch-site.xml
    trunk/archive-access/projects/nat/archive/conf/search-servers.txt
    trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml
    trunk/archive-access/projects/nat/archive/lib/
    trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.jar
    trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.jar
    trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.jar
    trunk/archive-access/projects/nat/archive/src/
    trunk/archive-access/projects/nat/archive/src/java/
    trunk/archive-access/projects/nat/archive/src/java/org/
    trunk/archive-access/projects/nat/archive/src/java/org/archive/
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/NutchWax.java
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/
    trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/tools/DumpIndex.java
    trunk/archive-access/projects/nat/archive/src/plugin/
    trunk/archive-access/projects/nat/archive/src/plugin/build-plugin.xml
    trunk/archive-access/projects/nat/archive/src/plugin/build.xml
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/build.xml
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/plugin.xml
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/
    trunk/archive-access/projects/nat/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/build.xml
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/plugin.xml
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/ConfigurableQueryFilter.java
    trunk/archive-access/projects/nat/archive/src/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/DateQueryFilter.java

Added: trunk/archive-access/projects/nat/archive/INSTALL.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/INSTALL.txt	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/INSTALL.txt	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,236 @@
+
+INSTALL.txt
+2008-05-06
+Aaron Binns
+
+
+The NutchWAX 0.12 build and installation is as an "add-on" to an
+existing Nutch 1.0-dev installation.
+
+NutchWAX 0.12 uses a simple 'ant' build script.  The script compiles
+the NutchWAX sources, using the libraries in the installed
+Nutch-1.0-dev.
+
+We strongly recommend having *two* Nutch-1.0-dev installation
+directories: one that you build NutchWAX against, and another into
+which NutchWAX is deployed.
+
+NutchWAX is deployed by un-tar'ing the nutchwax-0.12.tar.gz file
+*into* an existing Nutch-1.0-dev installation.  Think of NutchWAX as
+an add-on.  We over-write a few Nutch config files, but the rest is
+simply added to the existing Nutch-1.0-dev installation.
+
+
+Nutch-1.0-dev
+-------------
+
+As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev.  Now
+Nutch doesn't have a 1.0 release package yet, so we have to use the
+Nutch SVN trunk.  The specific SVN revision that NutchWAX 0.12 is 
+built against is:
+
+  650739
+
+To checkout this revision of Nutch, use:
+
+ $ mkdir nutch
+ $ cd nutch
+ $ svn checkout -r 650739 http://svn.apache.org/repos/asf/lucene/nutch/trunk
+
+To build the nutch-1.0-dev.tar.gz package, use 'ant'
+
+ $ cd trunk
+ $ ant tar
+
+This produces
+
+  build/nutch-1.0-dev.tar.gz
+
+Which we then install *twice*
+
+ $ mkdir -p ~/nutchwax-0.12/nutch-1.0-dev
+ $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev build/nutch-1.0-dev.tar.gz
+ $ mkdir -p /opt/nutch-1.0-dev
+ $ tar xfz -C /opt/nutch-1.0-dev build/nutch-1.0-dev.tar.gz
+
+The idea is that we keep /opt/nutch-1.0-dev as our pristine copy which
+we compile against, then, when we want to test NutcWAX, we deploy it
+into ~/nutchwax-0.12/nutch-1.0-dev.
+
+Why can't we just use one installation of Nutch?  Mainly to avoid
+weirdness where we are compiling NutchWAX source against the same set
+of libraries where we would be installing NutchWAX.  Consider, when we
+deploy NutchWAX, we copy the nutchwax.jar into the Nutch 'lib'
+directory.  If we use that same 'lib' directory for dependencies when
+compiling the source, 'ant'/'javac' will likely get confused when
+calculating dependencies.
+
+It's possible that you could successfully go through the
+build/test/release cycle using one Nutch-1.0-dev directory, but these
+instructions assume you will have two.
+
+
+Build and install
+-----------------
+
+  1. Install two Nutch-1.0-dev packages per the instructions above.
+
+  2. Edit build.xml to point to the "pristine" installation of Nutch-1.0-dev
+
+       <!-- NOTE: Point this to your Nutch 1.0-dev directory -->
+       <property name="nutch.dir" value="/opt/nutch-1.0-dev" />
+
+  3. Build NutchWAX-0.12
+
+      $ ant
+
+     The default build rule is "package" which will compile all the source
+     and build an intallation tarball: nutchwax-0.12.tar.gz
+
+     The "build.xml" file is pretty straightforward and just grepping
+     for the targets should be pretty obvious: compile, clean, etc.
+
+  4. Install NutchWAX into the build/test Nutch installation
+
+     $ tar xfz -C ~/nutchwax-0.12/nutch-1.0-dev nutchwax-0.12.tar.gz
+
+That's it!
+
+All we do is add our libraries (nutchwax.jar and dependencies), the
+'nutchwax' helper script, plugins for indexing and querying, and a few
+config files.
+
+Except for the config files, no files in the Nutch-1.0-dev
+installation are over-written, only added.  The "nutch-site.xml" file
+is over-written, but that file is empty in a vanilla Nutch
+installation, so there's small risk of over-writing something.
+
+
+HOWTO run and test
+------------------
+
+The 'nutchwax' helper script is installed in the Nutch-1.0-dev 'bin'
+directory next to the 'nutch' helper script.
+
+The 'nutchwax' script is used to run the NutchWAX-specific tools, use
+the regular 'nutch' script for regular Nutch activities.
+
+The 'nutchwax' script runs two tools
+
+  "import"     Import a set of .arc/.warc files from a manifest, creating
+               a Nutch segment.
+
+  "dumpindex"  Debug tool that dumps a Lucene index, such as the ones
+               created by Nutch's "index" tool.
+
+The idea is that the NutchWAX "import" tool supplants the Nutch
+generate and fetch cycle.  Rather than generating and fetching
+segments, we import the .arc/.warc files directly into a newly created
+segment.  Then, we process that segment just as we normally would with
+Nutch.
+
+For example,
+
+  $ cd nutch-test
+  $ cat > manifest
+    http://someserver/foo-bar-baz.arc mycollection
+    ^D
+  $ nutch-1.0-dev/bin/nutchwax import manifest
+
+This will import the arc file listed in the manifest into a newly
+created segment.  The segment is created by default in a directory
+hierarchy of the form:
+
+  segments/[date-timestamp]
+
+This mirrors the way segments are created in vanilla Nutch by the
+"generate" command.
+
+You can explicitly name the segment if you want, e.g.
+
+  nutchwax import manifest mysegment
+
+Once the segment is created by the importing of ARC files with
+NutchWAX, you can use Nutch to perform the rest of the steps.  For
+example:
+
+  $ nutch-1.0-dev/bin/nutchwax import manifest
+  $ nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments
+  $ nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments
+  $ nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/*
+  $ nutch-1.0-dev/bin/nutch merge index indexes
+
+This is pretty much the minimal set of steps to import and index a set
+of ARC files.  The crawldb update and link inversion steps are pro
+forma and don't have anything to do with NutchWAX specifically, but
+are a part of regular Nutch processing.
+
+Now you have a Nutch "index" directory and are ready to search!
+
+Searching is done as in vanilla Nutch.  Either launch the Nutch webapp
+or use the command-line interface to NutchBean to run some test
+searches.  Nothing NutchWAX-specific here.
+
+
+Miscellaneous notes
+-------------------
+
+1. Plugins
+
+There are two plugins bundled with NutchWAX: 
+
+   index-nutchwax
+   query-nutchwax
+
+See the "plugin.includes" property in nutch-site.xml to see where
+these plugins are added to the filter chain.
+
+The index-nutchwax plugin ensures that WAX-specifici metadata is
+transferred from the Nutch Content object to the Lucene Document
+object, which is placed in the Lucene index.
+
+The query-nutchwax plugin is used to process query requests against
+those same meta-data fields.  It also expands the capabilities of
+searching the basic Nutch fields as well.
+
+2. URL filters
+
+Nutch's URL filter by default filters-out many common URL oddities
+that would normally trip-up Nutch's crawler.  However, when importing
+content from ARC files, filtering out content probably doens't make
+sense.  That is, whatever content made it into the ARC file should be
+imported, no matter what the URL looks like.
+
+To change the URL filter, edit the Nutch file 'conf/regex-urlfilter.txt'.
+To pass all content through the filter, remove all filter rules except
+for the last one:
+
+  # accept anything else
+  +.
+
+3. conf/tika-mimetypes.xml
+
+NutchWAX comes with a fixed copy of tika-mimetypes.xml.  The version
+in Nutch revision 650739 has a few bugs in it which cause parsing to
+fail for many document types.  The bugs are:
+
+ o Move
+
+	<mime-type type="application/xml">
+		<alias type="text/xml" />
+		<glob pattern="*.xml" />
+	</mime-type>
+
+   definition higher up in the file, before the reference to it.
+
+ o Remove
+
+	<mime-type type="application/x-ms-dos-executable">
+		<alias type="application/x-dosexec;exe" />
+	</mime-type>
+
+   as the ';' character is illegal according to the comments in the
+   Nutch code.
+
+The copy of "conf/tika-mimetypes.xml" bundled with NutchWAX fixes
+these two bugs.

Added: trunk/archive-access/projects/nat/archive/README.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/README.txt	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/README.txt	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,105 @@
+
+README.txt
+2008-05-06
+Aaron Binns
+
+
+This is the NutchWAX-0.12 source that John Lee handed-off to me.  It
+is a work-in-progress.
+
+Compared to NutchWAX-0.10 (and earlier) it is *much* simpler.  The
+main WAX-specific code is in just a few files really:
+
+src/java/org/archive/nutchwax/ArcsToSegment.java
+
+  This is the meat of the WAX logic for processing .arc files and
+  generating Nutch segments.  Once we use this to generate a set of
+  segments for the .arc files, we can use the rest of vanilla
+  Nutch-1.0-dev to invert links and index the content with Lucene.
+
+  This conversion code is heavily edited from:
+
+    nutch-1.0-dev/src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
+
+  taken from the Nutch SVN head (a.k.a the "1.0-dev" in-development).
+
+  Ours differs in a few important ways:
+
+    o Rather than taking a directory with .arc files as input, we take
+      a manifest file with URLs to .arc files.  This way, the manifest
+      is split up among the distributed Hadoop jobs and the .arc files
+      are processed in whole by each worker.
+
+      In the Nutch-1.0-dev, the ArcSegmentCreator.java expects the
+      input directory to contain the .arc files and (AFAICT) splits
+      them up and distributes them across the Hadoop workers.  This
+      seems really inefficient to me, I think our approach is much
+      better -- at least for us.
+
+    o Related to the way input files are split and processed, we use
+      the standard Archive ARCReader class just like Heritrix and
+      Wayback.
+
+      The ArcSegmentCreator.java in Nutch-1.0-dev doesn't use our
+      ARCReader because of licensing imcompatibility.  Ours is under
+      GPL and Nutch-1.0-dev forbids the use of GPL code.
+      
+      We are in the process of re-licensing or dual-licensing with
+      Apache License, but until then, our ARCReader code won't be incldued      
+      in mainline Nutch.
+
+      This isn's a problem per se, but worth noting in case anyone
+      looks at the Nutch-1.0-dev code and wonders why they built their
+      own (horribly inefficient) .arc reader.
+
+    o We add metadata fields to the processed document for WAX-specific
+      purposes:
+
+        content.getMetadata().set( NutchWax.CONTENT_TYPE_KEY, meta.getMimetype() );
+        content.getMetadata().set( NutchWax.ARCNAME_KEY,      meta.getArcFile().getName() ) ;
+        content.getMetadata().set( NutchWax.COLLECTION_KEY,   collection);
+        content.getMetadata().set( NutchWax.ARCHIVE_DATE_KEY, meta.getDate() );
+
+      The addition of the arcname and collection key is pretty
+      obvious.  I don't know why the content-type isn't added in the
+      vanilla Nutch-1.0-dev.
+      
+      Also, we should review the use of the ARCHIVE_DATE_KEY in that
+      John Lee mentioned to me that there was possibly duplicate date
+      fields put in the index: one that is a plain old Java date, and
+      one that is a 14-digit date string for use with Wayback.
+
+src/java/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/NutchWaxIndexingFilter.java
+src/java/plugin/index-nutchwax/plugin.xml
+
+  This filter is pretty straightforward.  All it does is take the
+  metadata fields that were added to the document (as described above)
+  and placed in the Lucene index so that we can make use of them at
+  search-time.
+
+src/java/plugin/query-nutchwax/src/java/org/archive/nutchwax/query/MultipleFieldQueryFilter.java
+src/java/plugin/query-nutchwax/plugin.xml
+
+  This is a single query filter that can be used for querying single
+  fields from a single implementation.  It does *not* allow for
+  querying multiple fields as you can already do that via Nutch.
+
+  What this filter does is allows one to more-or-less create query
+  filters in a data-driven manner rather than having to code-up a new
+  class for each field.  That is, before one would have to create a
+  CollectionQueryFilter class to filter on the "collection" field.
+  With the MultipleFieldQueryFilter class, you can specify that the
+  "collection" field is to be filterable via the plugin.xml file and
+  "nutchwax.filter.query" configuration property.
+
+src/java/org/archive/nutchwax/NutchWax.java
+
+  Just a simple enum used by the above two classes for the metadata
+  keys.
+
+src/java/org/archive/nutchwax/tools/DumpIndex.java
+
+  A simple command-line utility to dump the contents of a Lucene
+  index.  Used for debugging.
+
+

Added: trunk/archive-access/projects/nat/archive/bin/nutchwax
===================================================================
--- trunk/archive-access/projects/nat/archive/bin/nutchwax	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/bin/nutchwax	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,70 @@
+#!/usr/bin/env bash
+
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed
+# with this work for additional information regarding copyright
+# ownership.  The ASF licenses this file to You under the Apache
+# License, Version 2.0 (the "License"); you may not use this file
+# except in compliance with the License.  You may obtain a copy of the
+# License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+# implied.  See the License for the specific language governing
+# permissions and limitations under the License.
+
+
+# The following is cribbed from the 'nutch' script to ascertain the
+# location of Nutch so we can call its scripts.
+#
+# resolve links - $0 may be a softlink
+THIS="$0"
+while [ -h "$THIS" ]; do
+  ls=`ls -ld "$THIS"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '.*/.*' > /dev/null; then
+    THIS="$link"
+  else
+    THIS=`dirname "$THIS"`/"$link"
+  fi
+done
+
+THIS_DIR=`dirname "$THIS"`
+NUTCH_HOME=`cd "$THIS_DIR/.." ; pwd`
+
+# Now that we have NUTCH_HOME, process the command-line.
+
+case "$1" in
+  import)
+    shift
+    if [ $# -eq 0 ]; then
+        ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.ArcsToSegment
+        exit 1
+    fi
+    if [ -z "$2" ]; then
+        segment=`date +"%Y%m%d%H%M%S"`
+        segment="segments/${segment}"
+    else
+        segment="$2"
+    fi
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.ArcsToSegment "$1" "${segment}"
+    ;;
+  dumpindex)
+    shift
+    ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.tools.DumpIndex $@
+    ;;
+  *)
+    echo ""
+    echo "Usage: nutchwax COMMAND"
+    echo "where COMMAND is one of:"
+    echo "  import       Import ARCs into a new Nutch segment"
+    echo "  dumpindex    Dump an index to the screen"
+    echo ""
+    exit 1
+    ;;
+esac
+
+exit 0


Property changes on: trunk/archive-access/projects/nat/archive/bin/nutchwax
___________________________________________________________________
Name: svn:executable
   + *

Added: trunk/archive-access/projects/nat/archive/build.xml
===================================================================
--- trunk/archive-access/projects/nat/archive/build.xml	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/build.xml	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,138 @@
+<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project name="nutchwax" default="job">
+
+  <property name="nutch.dir" value="../../" />
+
+  <property name="src.dir"   value="src" />
+  <property name="lib.dir"   value="lib" />
+  <property name="build.dir" value="${nutch.dir}/build" />
+  <!-- HACK: Need to import default.properties like Nutch does -->
+  <property name="dist.dir"  value="${build.dir}/nutch-1.0-dev" />
+
+  <target name="nutch-compile-core">
+    <ant dir="${nutch.dir}" target="compile-core" inheritAll="false" />
+  </target>
+
+  <target name="nutch-compile-plugins">
+    <ant dir="${nutch.dir}" target="compile-plugins" inheritAll="false" />
+  </target>
+
+  <target name="compile-core" depends="nutch-compile-core">
+    <javac 
+           destdir="${build.dir}/classes"
+           debug="true"
+           verbose="false"
+           source="1.5"
+           target="1.5"
+           encoding="UTF-8"
+           fork="true"
+           nowarn="true"
+           deprecation="false">
+      <src path="${src.dir}/java" />
+      <include name="**/*.java" />
+      <classpath>
+        <pathelement location="${build.dir}/classes" />
+        <fileset dir="${lib.dir}">
+          <include name="*.jar"/>
+        </fileset>
+        <fileset dir="${nutch.dir}/lib">
+          <include name="*.jar"/>
+        </fileset>
+      </classpath>
+    </javac>
+  </target>
+
+  <target name="compile-plugins">
+    <ant dir="src/plugin" target="deploy" inheritAll="false" />
+  </target>
+
+  <!--
+      These targets all call down to the corresponding target in the
+      Nutch build.xml file.  This way all of the 'ant' build commands
+      can be executed from this directory and everything should get
+      built as expected.
+    -->
+  <target name="compile" depends="compile-core, compile-plugins, nutch-compile-plugins">
+  </target>
+
+  <target name="jar" depends="compile-core">
+    <ant dir="${nutch.dir}" target="jar" inheritAll="false" />
+  </target>
+
+  <target name="job" depends="compile">
+    <ant dir="${nutch.dir}" target="job" inheritAll="false" />
+  </target>
+
+  <target name="war" depends="compile">
+    <ant dir="${nutch.dir}" target="war" inheritAll="false" />
+  </target>
+
+  <target name="javadoc" depends="compile">
+    <ant dir="${nutch.dir}" target="javadoc" inheritAll="false" />
+  </target>
+
+  <target name="tar" depends="package">
+    <ant dir="${nutch.dir}" target="tar" inheritAll="false" />
+  </target>
+
+  <target name="clean">
+    <ant dir="${nutch.dir}" target="clean" inheritAll="false" />
+  </target>
+
+  <!-- This one does a little more after calling down to the relevant
+       Nutch target.  After Nutch has copied everything into the
+       distribution directory, we add our script, libraries, etc.
+       
+       Rather than over-write the standard Nutch configuration files,
+       we place ours in a newly created directory
+       
+         contrib/archive/conf
+
+       and let the individual user decide whether or not to
+       incorporate our modifications.
+    -->
+  <target name="package" depends="jar, job, war, javadoc">
+    <ant dir="${nutch.dir}" target="package" inheritAll="false" />
+
+    <copy todir="${dist.dir}/lib" includeEmptyDirs="false">
+      <fileset dir="lib"/>
+    </copy>
+
+    <copy todir="${dist.dir}/bin">
+      <fileset dir="bin"/>
+    </copy>
+
+    <chmod perm="ugo+x" type="file">
+        <fileset dir="${dist.dir}/bin"/>
+    </chmod>
+
+    <mkdir dir="${dist.dir}/contrib/archive/conf"/>
+    <copy todir="${dist.dir}/contrib/archive/conf">
+      <fileset dir="conf" />
+    </copy>
+
+    <copy todir="${dist.dir}/contrib/archive">
+      <fileset dir=".">
+        <include name="*.txt" />
+      </fileset>
+    </copy>
+
+  </target>
+
+</project>

Added: trunk/archive-access/projects/nat/archive/conf/nutch-site.xml
===================================================================
--- trunk/archive-access/projects/nat/archive/conf/nutch-site.xml	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/conf/nutch-site.xml	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,65 @@
+<?xml version="1.0"?>
+<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+
+<!-- Put site-specific property overrides in this file. -->
+
+<configuration>
+
+<property>
+  <name>plugin.includes</name>
+  <!-- Add 'index-nutchwax' and 'query-nutchwax' to plugin list. -->
+  <!-- Also, add 'parse-pdf' -->
+  <!-- Remove 'urlfilter-regex' and 'normalizer-(pass|regex|basic)' -->
+  <value>protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic</value>
+</property>
+
+<property>
+  <!-- Configure the 'index-nutchwax' plugin.  Specify how the metadata fields added by the ArcsToSegment are mapped to the Lucene documents during indexing.
+       The specifications here are of the form "src-key:lowercase:store:tokenize:dest-key"
+       Where the only required part is the "src-key", the rest will assume the following defaults:
+          lowercase = true
+          store     = true
+          tokenize  = false
+          dest-key  = src-key
+    -->
+  <name>nutchwax.filter.index</name>
+  <value>
+    arcname:false
+    collection
+    date
+    type
+  </value>
+</property>
+
+<property>
+  <!-- Configure the 'query-nutchwax' plugin.  Specify which fields to make searchable via "field:[term|phrase]" query syntax, and whether they are "raw" fields or not.  
+       The specification format is "raw:name:lowercase:boost" or "field:name:boost".  Default values are
+          lowercase = true
+          boost     = 1.0f
+       There is no "lowercase" property for "field" specification because the Nutch FieldQueryFilter doesn't expose the option, unlike the RawFieldQueryFilter.
+       AFAICT, the order isn't important. -->
+  <!-- We do *not* use this filter for handling "date" queries, there is a specific filter for that: DateQueryFilter -->
+  <name>nutchwax.filter.query</name>
+  <value>
+    raw:arcname:false
+    raw:collection
+    raw:type
+    field:anchor
+    field:content
+    field:host
+    field:title
+  </value>
+</property>
+
+<!-- Over-ride setting in Nutch "nutch-default.xml" file.  We do *not* want Content-Type detection via magic resolution because the implementation 
+     in Nutch reads in the entire content body (which could be a 1GB MPG movie), then converts it to a String before examining the first dozen or
+     so bytes/characters for magic matching.  Since we archvie large files, this is bad, and OOMs occur.  So, we disable this feature and keep
+     the Content-Type that is already in the (W)ARC file. -->
+<property>
+  <name>mime.type.magic</name>
+  <value>false</value>
+  <description>Defines if the mime content type detector uses magic resolution.
+  </description>
+</property>
+
+</configuration>

Added: trunk/archive-access/projects/nat/archive/conf/search-servers.txt
===================================================================
--- trunk/archive-access/projects/nat/archive/conf/search-servers.txt	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/conf/search-servers.txt	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1 @@
+localhost 9000

Added: trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml
===================================================================
--- trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/conf/tika-mimetypes.xml	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,364 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+	Licensed to the Apache Software Foundation (ASF) under one or more
+	contributor license agreements.  See the NOTICE file distributed with
+	this work for additional information regarding copyright ownership.
+	The ASF licenses this file to You under the Apache License, Version 2.0
+	(the "License"); you may not use this file except in compliance with
+	the License.  You may obtain a copy of the License at
+	
+	http://www.apache.org/licenses/LICENSE-2.0
+	
+	Unless required by applicable law or agreed to in writing, software
+	distributed under the License is distributed on an "AS IS" BASIS,
+	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+	See the License for the specific language governing permissions and
+	limitations under the License.
+	
+	Description: This xml file defines the valid mime types used by Tika.
+	The mime types within this file are based on the types in the mime-types.xml 
+	file available in Apache Nutch.
+-->
+
+<mime-info>
+
+	<mime-type type="text/plain">
+		<magic priority="50">
+			<match value="This is TeX," type="string" offset="0" />
+			<match value="This is METAFONT," type="string" offset="0" />
+		</magic>
+		<glob pattern="*.txt" />
+		<glob pattern="*.asc" />
+	</mime-type>
+
+	<mime-type type="text/html">
+		<magic priority="50">
+			<match value="&lt;!DOCTYPE HTML" type="string"
+				offset="0:64" />
+			<match value="&lt;!doctype html" type="string"
+				offset="0:64" />
+			<match value="&lt;HEAD" type="string" offset="0:64" />
+			<match value="&lt;head" type="string" offset="0:64" />
+			<match value="&lt;TITLE" type="string" offset="0:64" />
+			<match value="&lt;title" type="string" offset="0:64" />
+			<match value="&lt;html" type="string" offset="0:64" />
+			<match value="&lt;HTML" type="string" offset="0:64" />
+			<match value="&lt;BODY" type="string" offset="0" />
+			<match value="&lt;body" type="string" offset="0" />
+			<match value="&lt;TITLE" type="string" offset="0" />
+			<match value="&lt;title" type="string" offset="0" />
+			<match value="&lt;!--" type="string" offset="0" />
+			<match value="&lt;h1" type="string" offset="0" />
+			<match value="&lt;H1" type="string" offset="0" />
+			<match value="&lt;!doctype HTML" type="string" offset="0" />
+			<match value="&lt;!DOCTYPE html" type="string" offset="0" />
+		</magic>
+		<glob pattern="*.html" />
+		<glob pattern="*.htm" />
+	</mime-type>
+
+	<mime-type type="application/xml">
+		<alias type="text/xml" />
+		<glob pattern="*.xml" />
+	</mime-type>
+
+	<mime-type type="application/xhtml+xml">
+		<sub-class-of type="text/xml" />
+		<glob pattern="*.xhtml" />
+		<root-XML namespaceURI='http://www.w3.org/1999/xhtml'
+			localName='html' />
+	</mime-type>
+
+	<mime-type type="application/vnd.ms-powerpoint">
+		<glob pattern="*.ppz" />
+		<glob pattern="*.ppt" />
+		<glob pattern="*.pps" />
+		<glob pattern="*.pot" />
+		<magic priority="50">
+			<match value="0xcfd0e011" type="little32" offset="0" />
+		</magic>
+	</mime-type>
+
+	<mime-type type="application/vnd.ms-excel">
+		<magic priority="50">
+			<match value="Microsoft Excel 5.0 Worksheet" type="string"
+				offset="2080" />
+		</magic>
+		<glob pattern="*.xls" />
+		<glob pattern="*.xlc" />
+		<glob pattern="*.xll" />
+		<glob pattern="*.xlm" />
+		<glob pattern="*.xlw" />
+		<glob pattern="*.xla" />
+		<glob pattern="*.xlt" />
+		<glob pattern="*.xld" />
+		<alias type="application/msexcel" />
+	</mime-type>
+
+	<mime-type type="application/vnd.oasis.opendocument.text">
+		<glob pattern="*.odt" />
+	</mime-type>
+
+
+	<mime-type type="application/zip">
+		<alias type="application/x-zip-compressed" />
+		<magic priority="40">
+			<match value="PK\003\004" type="string" offset="0" />
+		</magic>
+		<glob pattern="*.zip" />
+	</mime-type>
+
+	<mime-type type="application/vnd.oasis.opendocument.text">
+		<glob pattern="*.oth" />
+	</mime-type>
+
+	<mime-type type="application/msword">
+		<magic priority="50">
+			<match value="\x31\xbe\x00\x00" type="string" offset="0" />
+			<match value="PO^Q`" type="string" offset="0" />
+			<match value="\376\067\0\043" type="string" offset="0" />
+			<match value="\333\245-\0\0\0" type="string" offset="0" />
+			<match value="Microsoft Word 6.0 Document" type="string"
+				offset="2080" />
+			<match value="Microsoft Word document data" type="string"
+				offset="2112" />
+		</magic>
+		<glob pattern="*.doc" />
+		<alias type="application/vnd.ms-word" />
+	</mime-type>
+
+	<mime-type type="application/octet-stream">
+		<magic priority="50">
+			<match value="\037\036" type="string" offset="0" />
+			<match value="017437" type="host16" offset="0" />
+			<match value="0x1fff" type="host16" offset="0" />
+			<match value="\377\037" type="string" offset="0" />
+			<match value="0145405" type="host16" offset="0" />
+		</magic>
+		<glob pattern="*.bin" />
+	</mime-type>
+
+	<mime-type type="application/pdf">
+		<magic priority="50">
+			<match value="%PDF-" type="string" offset="0" />
+		</magic>
+		<glob pattern="*.pdf" />
+		<alias type="application/x-pdf" />
+	</mime-type>
+
+	<mime-type type="application/atom+xml">
+		<root-XML localName="feed"
+			namespaceURI="http://purl.org/atom/ns#" />
+	</mime-type>
+
+	<mime-type type="application/mac-binhex40">
+		<glob pattern="*.hqx" />
+	</mime-type>
+
+	<mime-type type="application/mac-compactpro">
+		<glob pattern="*.cpt" />
+	</mime-type>
+
+	<mime-type type="application/rtf">
+	    <glob pattern="*.rtf"/>
+		<alias type="text/rtf" />
+	</mime-type>
+
+	<mime-type type="application/rss+xml">
+		<alias type="text/rss" />
+		<root-XML localName="rss" />
+		<root-XML namespaceURI="http://purl.org/rss/1.0/" />
+		<glob pattern="*.rss" />
+	</mime-type>
+
+	<!--  added in by mattmann -->
+	<mime-type type="application/x-mif">
+		<alias type="application/vnd.mif" />
+	</mime-type>
+
+	<mime-type type="application/vnd.wap.wbxml">
+		<glob pattern="*.wbxml" />
+	</mime-type>
+
+	<mime-type type="application/vnd.wap.wmlc">
+		<_comment>Compiled WML Document</_comment>
+		<glob pattern="*.wmlc" />
+	</mime-type>
+
+	<mime-type type="application/vnd.wap.wmlscriptc">
+		<_comment>Compiled WML Script</_comment>
+		<glob pattern="*.wmlsc" />
+	</mime-type>
+
+	<mime-type type="text/vnd.wap.wmlscript">
+		<_comment>WML Script</_comment>
+		<glob pattern="*.wmls" />
+	</mime-type>
+
+	<mime-type type="application/x-bzip">
+		<alias type="application/x-bzip2" />
+	</mime-type>
+
+	<mime-type type="application/x-bzip-compressed-tar">
+		<glob pattern="*.tbz" />
+		<glob pattern="*.tbz2" />
+	</mime-type>
+
+	<mime-type type="application/x-cdlink">
+		<_comment>Virtual CD-ROM CD Image File</_comment>
+		<glob pattern="*.vcd" />
+	</mime-type>
+
+	<mime-type type="application/x-director">
+		<_comment>Shockwave Movie</_comment>
+		<glob pattern="*.dcr" />
+		<glob pattern="*.dir" />
+		<glob pattern="*.dxr" />
+	</mime-type>
+
+	<mime-type type="application/x-futuresplash">
+		<_comment>Macromedia FutureSplash File</_comment>
+		<glob pattern="*.spl" />
+	</mime-type>
+
+	<mime-type type="application/x-java">
+		<alias type="application/java" />
+	</mime-type>
+
+	<mime-type type="application/x-koan">
+		<_comment>SSEYO Koan File</_comment>
+		<glob pattern="*.skp" />
+		<glob pattern="*.skd" />
+		<glob pattern="*.skt" />
+		<glob pattern="*.skm" />
+	</mime-type>
+
+	<mime-type type="application/x-latex">
+		<_comment>LaTeX Source Document</_comment>
+		<glob pattern="*.latex" />
+	</mime-type>
+
+	<!-- JC CHANGED
+		<mime-type type="application/x-mif">
+		<_comment>FrameMaker MIF document</_comment>
+		<glob pattern="*.mif"/>
+		</mime-type> -->
+
+	<mime-type type="application/ogg">
+		<alias type="application/x-ogg" />
+	</mime-type>
+
+	<mime-type type="application/x-rar">
+		<alias type="application/x-rar-compressed" />
+	</mime-type>
+
+	<mime-type type="application/x-shellscript">
+		<alias type="application/x-sh" />
+	</mime-type>
+
+	<mime-type type="application/xhtml+xml">
+		<glob pattern="*.xht" />
+	</mime-type>
+
+	<mime-type type="audio/midi">
+		<glob pattern="*.kar" />
+	</mime-type>
+
+	<mime-type type="audio/x-pn-realaudio">
+		<alias type="audio/x-realaudio" />
+	</mime-type>
+
+	<mime-type type="image/tiff">
+		<magic priority="50">
+			<match value="0x4d4d2a00" type="string" offset="0" />
+			<match value="0x49492a00" type="string" offset="0" />
+		</magic>
+	</mime-type>
+
+	<mime-type type="message/rfc822">
+		<magic priority="50">
+			<match type="string" value="Relay-Version:" offset="0" />
+			<match type="string" value="#! rnews" offset="0" />
+			<match type="string" value="N#! rnews" offset="0" />
+			<match type="string" value="Forward to" offset="0" />
+			<match type="string" value="Pipe to" offset="0" />
+			<match type="string" value="Return-Path:" offset="0" />
+			<match type="string" value="From:" offset="0" />
+			<match type="string" value="Message-ID:" offset="0" />
+			<match type="string" value="Date:" offset="0" />
+		</magic>
+	</mime-type>
+	
+	<mime-type type="application/x-javascript">
+        <glob pattern="*.js" />
+    </mime-type>
+    
+
+	<mime-type type="image/vnd.wap.wbmp">
+		<_comment>Wireless Bitmap File Format</_comment>
+		<glob pattern="*.wbmp" />
+	</mime-type>
+
+	<mime-type type="image/x-psd">
+		<alias type="image/photoshop" />
+	</mime-type>
+
+	<mime-type type="image/x-xcf">
+		<alias type="image/xcf" />
+		<magic priority="50">
+			<match type="string" value="gimp xcf " offset="0" />
+		</magic>
+	</mime-type>
+	
+	<mime-type type="application/x-shockwave-flash">
+      <glob pattern="*.swf"/>
+      <magic priority="50">
+        <match type="string" value="FWS" offset="0"/>
+        <match type="string" value="CWS" offset="0"/>
+      </magic>
+    </mime-type>
+
+	<mime-type type="model/iges">
+		<_comment>
+			Initial Graphics Exchange Specification Format
+		</_comment>
+		<glob pattern="*.igs" />
+		<glob pattern="*.iges" />
+	</mime-type>
+
+	<mime-type type="model/mesh">
+		<glob pattern="*.msh" />
+		<glob pattern="*.mesh" />
+		<glob pattern="*.silo" />
+	</mime-type>
+
+	<mime-type type="model/vrml">
+		<glob pattern="*.vrml" />
+	</mime-type>
+
+	<mime-type type="text/x-tcl">
+		<alias type="application/x-tcl" />
+	</mime-type>
+
+	<mime-type type="text/x-tex">
+		<alias type="application/x-tex" />
+	</mime-type>
+
+	<mime-type type="text/x-texinfo">
+		<alias type="application/x-texinfo" />
+	</mime-type>
+
+	<mime-type type="text/x-troff-me">
+		<alias type="application/x-troff-me" />
+	</mime-type>
+
+	<mime-type type="video/vnd.mpegurl">
+		<glob pattern="*.mxu" />
+	</mime-type>
+
+	<mime-type type="x-conference/x-cooltalk">
+		<_comment>Cooltalk Audio</_comment>
+		<glob pattern="*.ice" />
+	</mime-type>
+
+</mime-info>

Added: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.jar
===================================================================
(Binary files differ)


Property changes on: trunk/archive-access/projects/nat/archive/lib/commons-2.0.1-SNAPSHOT.jar
___________________________________________________________________
Name: svn:mime-type
   + application/octet-stream

Added: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.jar
===================================================================
(Binary files differ)


Property changes on: trunk/archive-access/projects/nat/archive/lib/commons-httpclient-3.0.1.jar
___________________________________________________________________
Name: svn:mime-type
   + application/octet-stream

Added: trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.jar
===================================================================
(Binary files differ)


Property changes on: trunk/archive-access/projects/nat/archive/lib/fastutil-5.0.3.jar
___________________________________________________________________
Name: svn:mime-type
   + application/octet-stream

Added: trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java
===================================================================
--- trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcReader.java	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,273 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.archive.nutchwax;
+
+import java.util.Iterator;
+import java.util.Map;
+import java.util.HashMap;
+import java.io.IOException;
+
+import org.archive.io.ArchiveReader;
+import org.archive.io.ArchiveReaderFactory;
+import org.archive.io.ArchiveRecord;
+import org.archive.io.ArchiveRecordHeader;
+
+import org.archive.io.arc.ARCConstants;
+import org.archive.io.arc.ARCReader;
+import org.archive.io.arc.ARCRecord;
+import org.archive.io.arc.ARCRecordMetaData;
+import org.archive.io.warc.WARCConstants;
+import org.archive.io.warc.WARCRecord;
+
+import org.apache.commons.httpclient.Header;
+
+
+/**
+ * <p>
+ *   Reader of both ARC and WARC format archive files.  This is not a
+ *   general-purpose archive file reader, but is written specifically
+ *   for NutchWAX.  It's possible that this could become a
+ *   general-purpose archive file reader, but for now, consider it
+ *   custom-tailored to the needs of NutchWAX.
+ * </p>
+ * <p>
+ *   <code>ArcReader</code> is a wrapper around the underlying
+ *   <code>ArchiveReader</code> implementation
+ *   (<code>ARCReader</code>/<code>WARCReader</code>) which converts
+ *   <code>WARCRecord</code>s to <code>ARCRecord</code>s on the fly.
+ * </p>
+ * <p>
+ *   If an <code>ARCReader</code> is being wrapped, then the
+ *   underlying <code>ARCRecord</code>s are read and passed-through
+ *   unmolested.
+ * </p>
+ * <p>
+ *   If a <code>WARCReader</code> is being wrapped, then the
+ *   <code>WARCRecord</code>s are converted to <code>ARCRecord</code>s
+ *   on the fly.
+ * </p>
+ * <p>
+ *   <strong>WARNING:</strong> We only convert WARC
+ *   <code>response</code> records.  All other WARC record types are
+ *   returned as <code>null</code> by the iterator's
+ *   <code>next()</code> method.  So, when using the iterator, don't
+ *   forget to check for a <code>null</code> value returned by
+ *   <code>next()</code>.
+ * </p>
+ */
+public class ArcReader implements Iterable<ARCRecord>
+{
+  private ArchiveReader reader;
+
+  /**
+   * Construct an <code>ArcReader<code> wrapping an
+   * <code>ArchiveReader</code> instance.
+   *
+   * @param reader the ArchiveReader instance to wrap
+   */
+  public ArcReader( ArchiveReader reader )
+  {
+    this.reader = reader;
+  }
+
+  /**
+   * Returns an iterator over <code>ARCRecord</code>s in the wrapped
+   * <code>ArchiveReader</code>, converting <code>WARCRecords</code>
+   * to <code>ARCRecords</code> on-the-fly.
+   *
+   * @return an interator
+   */
+  public Iterator<ARCRecord> iterator( )
+  {
+    return new ArcIterator( );
+  }
+
+  /**
+   * 
+   */
+  private class ArcIterator implements Iterator<ARCRecord>
+  {
+    private Iterator<ArchiveRecord> i;
+
+    /**
+     * Construct an <code>ArcIterator</code>, skipping the header
+     * record if the wrapped reader is an <code>ARCReader</code>.
+     */
+    public ArcIterator( )
+    {
+      this.i = ArcReader.this.reader.iterator( );
+      
+      if ( ArcReader.this.reader instanceof ARCReader )
+        {
+          // Skip the first record, which is a "filedesc://"
+          // record describing the ARC file.
+          if ( this.i.hasNext( ) ) this.i.next( );
+        }
+    }
+
+    /**
+     * Returns <code>true</code> if the iteration has more elements.
+     * Will return <code>true</code> even if the value returned by the
+     * next call to <code>next()</code> returns <code>null</code>.
+     *
+     * @return <code>true</code> if the iterator has more elements.
+     */
+    public boolean hasNext( )
+    {
+      return this.i.hasNext( );
+    }
+    
+    /**
+     * Returns the next element in the iteration. Calling this method
+     * repeatedly until the <code>hasNext()</code> method returns
+     * <code>false</code> will return each element in the underlying
+     * collection exactly once.
+     * 
+     * @return the next element in the iteration, which can be <code>null</code>
+     */
+    public ARCRecord next( )
+    {
+      try
+        {
+          ArchiveRecord record = this.i.next( );
+          
+          if ( record instanceof ARCRecord )
+            {
+              // Just return the ARCRecord as-is.
+              ARCRecord arc = (ARCRecord) record;
+              
+              return arc;
+            }
+          
+          if ( record instanceof WARCRecord )
+            {
+              WARCRecord warc = (WARCRecord) record;
+              
+              ARCRecord arc = convert( warc );
+
+              return arc;
+            }
+
+          // If we get here then the record we reaad in was neither an ARC
+          // or WARC record.  What is a good exception to throw?
+          throw new RuntimeException( "Record neither ARC nor WARC: " + record.getClass( ) );
+        }
+      catch ( IOException ioe )
+        {
+          throw new RuntimeException( ioe );
+        }
+    }
+
+    /**
+     * Unsupported optional operation.
+     *
+     * @throw UnsupportedOperationException
+     */
+    public void remove( )
+    {
+      throw new UnsupportedOperationException( );
+    }
+
+    /**
+     * Convert a WARCRecord to an ARCRecord.  Only "response"
+     * WARCRecords are converted to meaningful ARCRecords.  All other
+     * WARCRecord types are converted to <code>null</code>.
+     *
+     * @param warc the WARCRecord to convert
+     * @return the corresponding ARCRecord, <code>null</code> if WARCRecord not a "reponse" record
+     */
+    private ARCRecord convert( WARCRecord warc )
+      throws IOException
+    {
+      ArchiveRecordHeader header = warc.getHeader( );
+      
+      // We only care about "response" WARC records.
+      if ( ! WARCConstants.RESPONSE.equals( header.getHeaderValue( WARCConstants.HEADER_KEY_TYPE ) ) )
+        {
+          return null;
+        }
+              
+      // Construct an ARCRecordMetadata object based on the info in
+      // the ArchiveRecordHeader.
+      Map arcMetadataFields = new HashMap( );
+      arcMetadataFields.put( ARCConstants.URL_FIELD_KEY,       header.getHeaderValue( WARCConstants.HEADER_KEY_URI  ) );
+      arcMetadataFields.put( ARCConstants.IP_HEADER_FIELD_KEY, header.getHeaderValue( WARCConstants.HEADER_KEY_IP   ) );
+      arcMetadataFields.put( ARCConstants.DATE_FIELD_KEY,      header.getHeaderValue( WARCConstants.HEADER_KEY_DATE ) );
+      arcMetadataFields.put( ARCConstants.MIMETYPE_FIELD_KEY,  header.getHeaderValue( null ) );  // We don't know the MIME type of the *payload* in a WARC (yet)
+      arcMetadataFields.put( ARCConstants.LENGTH_FIELD_KEY,    header.getHeaderValue( WARCConstants.CONTENT_LENGTH  ) );
+      arcMetadataFields.put( ARCConstants.VERSION_FIELD_KEY,   header.getHeaderValue( null ) );  // FIXME: Do we need actual values for these?
+      arcMetadataFields.put( ARCConstants.ABSOLUTE_OFFSET_KEY, header.getHeaderValue( null ) );  // FIXME: Do we need actual values for these?
+              
+      ARCRecordMetaData metadata = new ARCRecordMetaData( header.getReaderIdentifier( ), arcMetadataFields );
+              
+      // Then, create an ARCRecord using the WARCRecord and the
+      // ARCRecordMetaData object we just created.
+      ARCRecord arc = new ARCRecord( warc, 
+                                     metadata,
+                                     0,  // offset
+                                     ArcReader.this.reader.isDigest( ),
+                                     ArcReader.this.reader.isStrict( ),
+                                     true  // parse HTTP headers
+                                   );
+      
+      // Now that we've created the ARCRecord, we get the HTTP headers
+      // from it.  From these HTTP headers, we obtain the Content-Type
+      // of the ARCRecord's payload, then set value as the MIME-type
+      // of the ARCRecord itself.
+      
+      // If the response is something other than HTTP
+      // (like DNS) there are no HTTP headers.  
+      if ( arc.getHttpHeaders( ) != null )
+        {
+          for ( Header h : arc.getHttpHeaders( ) )
+            {
+              if ( h.getName( ).equals( "Content-Type" ) )
+                {
+                  arc.getMetaData( ).getHeaderFields( ).put( ARCConstants.MIMETYPE_FIELD_KEY, h.getValue( ) );
+                }
+            }
+        }
+      
+      return arc;
+    }
+
+  }
+
+  /**
+   * Simple test/debug driver to read an archive file and print out
+   * the header for each record.
+   */
+  public static void main( String args[] ) throws Exception
+  {
+    if ( args.length != 1 )
+      {
+        System.out.println( "ReaderTest <(w)arc file>" );
+        System.exit( 1 );
+      }
+
+    String arcName = args[0];
+
+    ArchiveReader r = ArchiveReaderFactory.get( arcName );
+
+    ArcReader reader = new ArcReader( r );
+
+    for ( ARCRecord rec : reader )
+      {
+        if ( rec != null ) System.out.println( rec.getHeader( ) );
+      }
+  }
+}

Added: trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java
===================================================================
--- trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java	                        (rev 0)
+++ trunk/archive-access/projects/nat/archive/src/java/org/archive/nutchwax/ArcsToSegment.java	2008-05-14 00:20:24 UTC (rev 2265)
@@ -0,0 +1,553 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.archive.nutchwax;
+
+import java.io.IOException;
+import java.net.MalformedURLException;
+import java.util.Map.Entry;
+import java.util.Iterator;
+
+import org.apache.commons.logging.Log;
+import org.apache.commons.logging.LogFactory;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.conf.Configured;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.io.Text;
+import org.apache.hadoop.io.Writable;
+import org.apache.hadoop.io.WritableComparable;
+import org.apache.hadoop.mapred.JobClient;
+import org.apache.hadoop.mapred.JobConf;
+import org.apache.hadoop.mapred.Mapper;
+import org.apache.hadoop.mapred.OutputCollector;
+import org.apache.hadoop.mapred.Reporter;
+import org.apache.hadoop.mapred.TextInputFormat;
+import org.apache.hadoop.mapred.TextOutputFormat;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.hadoop.util.Tool;
+import org.apache.hadoop.util.ToolRunner;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.NutchWritable;
+import org.apache.nutch.crawl.SignatureFactory;
+import org.apache.nutch.fetcher.FetcherOutputFormat;
+import org.apache.nutch.metadata.Metadata;
+import org.apache.nutch.metadata.Nutch;
+import org.apache.nutch.net.URLFilters;
+import org.apache.nutch.net.URLFilterException;
+import org.apache.nutch.net.URLNormalizers;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseImpl;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.parse.ParseUtil;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.ProtocolStatus;
+import org.apache.nutch.scoring.ScoringFilters;
+import org.apache.nutch.util.LogUtil;
+import org.apache.nutch.util.NutchConfiguration;
+import org.apache.nutch.util.NutchJob;
+import org.apache.nutch.util.StringUtil;
+
+import org.archive.io.ArchiveReader;
+import org.archive.io.ArchiveReaderFactory;
+import org.archive.io.arc.ARCRecord;
+import org.archive.io.arc.ARCRecordMetaData;
+
+
+/**
+ * Convert Archive files (.arc/.warc) files to a Nutch segment.  This
+ * is sometimes called "importing" other times "converting", the terms
+ * are equivalent.
+ *
+ * <code>ArcsToSegment</code> is coded as a Hadoop job and is intended
+ * to be run within the Hadoop framework, or at least started by the
+ * Hadoop launcher incorporated into Nutch.  Although there is a
+ * <code>main</code> driver, the Nutch launcher script is strongly
+ * recommended.
+ *
+ * This class was initially adapted from the Nutch
+ * <code>Fetcher</code> class.  The premise is since the Nutch
+ * fetching process acquires external content and places it in a Nutch
+ * segment, we can perform a similar activity by taking content from
+ * the ARC files and place that content in a Nutch segment in a
+ * similar fashion.  Ideally, once the <code>ArcsToSegment</code> is
+ * used to import a set of ARCs into a Nutch segment, the resulting
+ * segment should be more-or-less the same as one created by Nutch's
+ * own Fetcher.
+ * 
+ * Since we are mimicing the Nutch Fetcher, we have to be careful
+ * about some implementation details that might not seem relevant
+ * to the importing of ARC files.  I've noted those details with
+ * comments prefaced with "?:".
+ */
+public class ArcsToSegment extends Configured implements Tool, Mapper
+{
+
+  public static final Log LOG = LogFactory.getLog( ArcsToSegment.class );
+
+  private JobConf        jobConf;
+  private URLFilters     urlFilters;
+  private ScoringFilters scfilters;
+  private ParseUtil      parseUtil;
+  private URLNormalizers normalizers;
+  private int            interval;
+
+  private long           numSkipped;
+  private long           numImported;
+  private long           bytesSkipped;
+  private long           bytesImported;
+
+  /**
+   * ?: Is this necessary?
+   */
+  public ArcsToSegment()
+  {
+    
+  }
+
+  /**
+   * <p>Constructor that sets the job configuration.</p>
+   * 
+   * @param conf
+   */
+  public ArcsToSegment( Configuration conf )
+  {
+    setConf( conf );
+  }
+
+  /**
+   * <p>Configures the job.  Sets the url filters, scoring filters, url normalizers
+   * and other relevant data.</p>
+   * 
+   * @param job The job configuration.
+   */
+  public void configure( JobConf job )
+  {
+    // set the url filters, scoring filters the parse util and the url
+    // normalizers
+    this.jobConf     = job;
+    this.urlFilters  = new URLFilters    ( jobConf );
+    this.scfilters   = new ScoringFilters( jobConf );
+    this.parseUtil   = new ParseUtil     ( jobConf );
+    this.normalizers = new URLNormalizers( jobConf, URLNormalizers.SCOPE_FETCHER );
+    this.interval    = jobConf.getInt( "db.fetch.interval.default", 2592000      );
+  }
+
+  /**
+   * In Mapper interface.
+   * @inherit
+   */
+  public void close()
+  {
+    
+  }
+
+  /**
+   * <p>Runs the Map job to translate an arc file into output for Nutch 
+   * segments.</p>
+   * 
+   * @param key Line number in manifest corresponding to the <code>value</code>
+   * @param value A line from the manifest
+   * @param output The output collecter.
+   * @param reporter The progress reporter.
+   */
+  public void map( WritableComparable key, 
+                   Writable           value, 
+                   OutputCollector    output, 
+                   Reporter           reporter )
+    throws IOException
+  {
+    String arcUrl      = "";
+    String collection  = "";
+    String segmentName = getConf().get( Nutch.SEGMENT_NAME_KEY );
+    
+    // Each line of the manifest is "<url> <collection>" where <collection> is optional
+    String[] line = value.toString().split( " " );
+    arcUrl = line[0];
+
+    if ( line.length > 1 )
+      {
+        collection = line[1];
+      }
+
+    if ( LOG.isInfoEnabled() ) LOG.info( "Importing ARC: " + arcUrl );
+
+    ArchiveReader r = ArchiveReaderFactory.get( arcUrl );
+
+    ArcReader reader = new ArcReader( r );
+
+    try
+      {
+        for ( ARCRecord record : reader )
+          {
+            // When reading WARC files, records of type other than
+            // "response" are returned as 'null' by the Iterator, so
+            // we skip them.
+            if ( record == null ) continue ;
+
+            importRecord( record, segmentName, collection, output );
+
+            // FIXME: What does this do exactly?
+            reporter.progress();
+          }
+      }
+    finally
+      {
+        r.close();
+
+        if ( LOG.isInfoEnabled() ) 
+          {
+            LOG.info( "Completed ARC: "  + arcUrl );
+            LOG.info( "URLs skipped : " + this.numSkipped  );
+            LOG.info( "URLs imported: " + this.numImported );
+            LOG.info( "URLs total   : " + ( this.numSkipped + this.numImported ) );
+          }
+      }
+    
+  }
+
+  /**
+   * Import an ARCRecord.
+   *
+   * @param record
+   * @param segmentName 
+   * @param collectionName
+   * @param output
+   * @return whether record was imported or not (i.e. filtered out due to URL filtering rules, etc.)
+   */
+  private boolean importRecord( ARCRecord record, String segmentName, String collectionName, OutputCollector output )
+  {
+    ARCRecordMetaData meta = record.getMetaData();
+    
+    if ( LOG.isInfoEnabled() ) LOG.info( "Consider URL: " + meta.getUrl() + " (" + meta.getMimetype() + ")" );
+
+    /* ?: On second thought, DON'T do this.  Even if we don't have a
+       parser registered for a content-type, we still want to index
+       its URL and possibly other meta-data.
+    */
+    /*
+    // First, check to see if we have a parser registered for the
+    // URL's Content-Type, so we don't read in some huge video file
+    // only to discover we don't have a parser for it.
+    if ( ! this.hasRegisteredParser( meta.getMimetype() ) )
+      {
+        if ( LOG.isInfoEnabled() ) LOG.info( "No parser registered for: "  + meta.getMimetype() );
+        
+        this.numSkipped++;
+        this.bytesSkipped += meta.getLength();
+        
+        return false ;
+      }
+    */
+
+    // ?: Arguably, we shouldn't be normalizing nor filtering based
+    // on the URL.  If the document made it into the (W)ARC file, then
+    // it should be indexed.  But then again, the normalizers and
+    // filters can be disabled in the Nutch configuration files.
+    String url = this.normalizeAndFilterUrl( meta.getUrl() );
+    
+    if ( url == null )
+      {
+        if ( LOG.isInfoEnabled() ) LOG.info( "Skip     URL: "  + meta.getUrl() );
+        
+        this.numSkipped++;
+        this.bytesSkipped += meta.getLength();
+        
+        return false;
+      }
+    
+    // URL is good, let's import the content.
+    if ( LOG.isInfoEnabled() ) LOG.info( "Import   URL: " + meta.getUrl() );
+    this.numImported++;
+    this.bytesImported += meta.getLength();
+    
+    try
+      {
+        ...
 
[truncated message content]

[Archive-access-cvs] SF.net SVN: archive-access: [2264] trunk/archive-access/projects/ access-control

From: <al...@us...> - 2008-05-12 01:00:30

Revision: 2264
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2264&view=rev
Author:   alexoz
Date:     2008-05-11 17:59:30 -0700 (Sun, 11 May 2008)

Log Message:
-----------
Added work in progress administrator manual and developer manual stub.

Added Paths:
-----------
    trunk/archive-access/projects/access-control/dist/
    trunk/archive-access/projects/access-control/dist/src/
    trunk/archive-access/projects/access-control/dist/src/site/
    trunk/archive-access/projects/access-control/dist/src/site/xdoc/
    trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml
    trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml

Added: trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml
===================================================================
--- trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml	                        (rev 0)
+++ trunk/archive-access/projects/access-control/dist/src/site/xdoc/administrator_manual.xml	2008-05-12 00:59:30 UTC (rev 2264)
@@ -0,0 +1,104 @@
+<document>
+  <properties>
+    <title>Stayback Administrator Manual</title>
+    <author email="aosborne nla gov au">Alex Osborne</author>
+  </properties>
+  <body>
+
+    <section name="Requirements">
+      <ul>
+        <li>Java 1.5 or later</li>
+        <li>A servlet container such
+        as <a href="http://tomcat.apache.org/">Tomcat</a></li>
+        <li>A database that
+        supported by <a href="http://hibernate.org">Hibernate</a> (all the
+        usual suspects do).</li>
+      </ul>
+    </section>
+
+    <section name="Download">
+      <p>A first release of the project has not yet been made, however
+        in the meantime you should be able to get a recent snapshot
+        from the Internet
+        Archive's <a href="http://builds.archive.org:8081/">build
+        server</a>.  Select "Show projects", "Access control: Oracle
+        Webapp", "Working copy" and
+        download <tt>target/oracle-0.0.1-SNAPSHOT.war</tt>.
+        Alternatively you can build the project from source, see the
+        <a href="developer_manual.html">Developer Manual</a> for
+        instructions.</p>
+    </section>
+
+    <section name="Installation">
+      <subsection name="General">
+        <ul>
+          <li>Create a user and database to store the access control rules.</li>
+          <li>Deploy the oracle webapp to your application server (eg Apache Tomcat).</li>
+          <li>Download the appropriate JDBC connector for your database and drop it
+            in WEB-INF/lib.</li>
+          <li>Configure the database in the dataSource and sessionFactory beans in
+            WEB-INF/applicationContext.xml.</li>
+        </ul>
+      </subsection>
+      <subsection name="MySQL">
+
+        <p>Create a user and database to store the access control rules:</p>
+
+        <pre>
+          CREATE USER 'stayback'@ 'localhost' IDENTIFIED BY 'password';
+          GRANT USAGE ON * . * TO 'stayback'@ 'localhost' IDENTIFIED BY 'password';
+          CREATE DATABASE `stayback`;
+          GRANT ALL PRIVILEGES ON `stayback` . * TO 'stayback'@ 'localhost';
+        </pre>
+
+        <p>Deploy the oracle webapp to tomcat</p>
+
+        <p>Download <a href="http://www.mysql.com/products/connector-j">MySQL Connector/J</a> and copy
+          <tt>mysql-connector-java-*-bin.jar</tt> to <tt>WEB-INF/lib</tt>.</p>
+
+
+        <p>Configure the database in <tt>WEB-INF/applicationContext.xml</tt></p>:
+
+
+        <pre>
+    &lt;bean id="dataSource"
+          class="org.apache.commons.dbcp.BasicDataSource"
+          destroy-method="close"&gt;
+      &lt;property name="driverClassName" value="com.mysql.jdbc.Driver" /&gt;
+      &lt;property name="url" value="jdbc:mysql://localhost/stayback" /&gt;
+      &lt;property name="username" value="stayback" /&gt;
+      &lt;property name="password" value="password" /&gt;
+    &lt;/bean&gt;
+
+
+    &lt;bean id="sessionFactory" [...] &gt;
+      [...]
+      &lt;property name="hibernateProperties"&gt;
+        &lt;value&gt;
+          hibernate.dialect=org.hibernate.dialect.MySQLDialect
+          hibernate.hbm2ddl.auto=create
+        &lt;/value&gt;
+      &lt;/property&gt;
+    &lt;/bean&gt;
+        </pre>
+
+        <p>The <tt>hibernate.hbm2ddl.auto=create</tt> option will cause Hibernate to
+          automatically create the tables in the database.</p>
+      </subsection>
+    </section>
+
+    <section name="Configuring clients">
+      <subsection name="Wayback">
+        TODO: Write this.  For now see the oracle section in the example wayback.xml.
+      </subsection>
+      <subsection name="NutchWAX">
+        Stayback client has not yet been integrated into NutchWAX.
+      </subsection>
+      <subsection name="Others">
+        <p>See the <a href="developer_manual.html">developer
+        manual</a> for information about integrating Stayback into
+        other software.
+      </subsection>
+    </section>
+  </body>
+</document>

Added: trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml
===================================================================
--- trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml	                        (rev 0)
+++ trunk/archive-access/projects/access-control/dist/src/site/xdoc/developer_manual.xml	2008-05-12 00:59:30 UTC (rev 2264)
@@ -0,0 +1,11 @@
+<document>
+  <properties>
+    <title>Stayback Developer Manual</title>
+    <author email="aosborne nla gov au">Alex Osborne</author>
+  </properties>
+  <body>
+    <p>For now see
+    the <a href="http://webteam.archive.org/confluence/display/wayback/Exclusions+API">notes</a>
+    on the Wayback wiki.</p>
+  </body>
+</document>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2263] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/ AccessControlSettingOperation.java

From: <bra...@us...> - 2008-05-05 21:36:19

Revision: 2263
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2263&view=rev
Author:   bradtofel
Date:     2008-05-05 14:36:23 -0700 (Mon, 05 May 2008)

Log Message:
-----------
INITIAL REV: more than enough rope to hang yourself with this class -- allows for dynamic setting of the active ExclusionFilterFactory per request, based on whatever logic is set in the BooleanOperator.

Added Paths:
-----------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java

Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java	                        (rev 0)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/authenticationcontrol/AccessControlSettingOperation.java	2008-05-05 21:36:23 UTC (rev 2263)
@@ -0,0 +1,34 @@
+package org.archive.wayback.authenticationcontrol;
+
+import org.archive.wayback.accesscontrol.ExclusionFilterFactory;
+import org.archive.wayback.core.WaybackRequest;
+import org.archive.wayback.util.operator.BooleanOperator;
+
+public class AccessControlSettingOperation implements BooleanOperator<WaybackRequest> {
+
+	private ExclusionFilterFactory factory = null;
+	private BooleanOperator<WaybackRequest> operator = null;
+	
+	public boolean isTrue(WaybackRequest value) {
+		if(operator.isTrue(value)) {
+			value.setExclusionFilter(factory.get());
+		}
+		return true;
+	}
+
+	public ExclusionFilterFactory getFactory() {
+		return factory;
+	}
+
+	public void setFactory(ExclusionFilterFactory factory) {
+		this.factory = factory;
+	}
+
+	public BooleanOperator<WaybackRequest> getOperator() {
+		return operator;
+	}
+
+	public void setOperator(BooleanOperator<WaybackRequest> operator) {
+		this.operator = operator;
+	}
+}


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2262] trunk/archive-access/projects/wayback/ dist/src/scripts/location-db

From: <bra...@us...> - 2008-05-05 21:34:09

Revision: 2262
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2262&view=rev
Author:   bradtofel
Date:     2008-05-05 14:34:10 -0700 (Mon, 05 May 2008)

Log Message:
-----------
INITIAL REV: command line tool for accessing main() in FileLocationDB

Added Paths:
-----------
    trunk/archive-access/projects/wayback/dist/src/scripts/location-db

Added: trunk/archive-access/projects/wayback/dist/src/scripts/location-db
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/scripts/location-db	                        (rev 0)
+++ trunk/archive-access/projects/wayback/dist/src/scripts/location-db	2008-05-05 21:34:10 UTC (rev 2262)
@@ -0,0 +1,82 @@
+#!/usr/bin/env sh
+##
+## This script allows querying and updating of a remote LocationDB from the
+## command line, including syncronizing the LocationDB with an entire directory 
+## of ARCs files
+##
+## Optional environment variables
+##
+## JAVA_HOME        Point at a JDK install to use.
+## 
+## WAYBACK_HOME     Pointer to your wayback install.  If not present, we 
+##                  make an educated guess based of position relative to this
+##                  script.
+##
+## JAVA_OPTS        Java runtime options.  Default setting is '-Xmx256m'.
+##
+
+# Resolve links - $0 may be a softlink
+PRG="$0"
+while [ -h "$PRG" ]; do
+  ls=`ls -ld "$PRG"`
+  link=`expr "$ls" : '.*-> \(.*\)$'`
+  if expr "$link" : '.*/.*' > /dev/null; then
+    PRG="$link"
+  else
+    PRG=`dirname "$PRG"`/"$link"
+  fi
+done
+PRGDIR=`dirname "$PRG"`
+
+# Set WAYBACK_HOME.
+if [ -z "$WAYBACK_HOME" ]
+then
+    WAYBACK_HOME=`cd "$PRGDIR/.." ; pwd`
+fi
+
+# Find JAVA_HOME.
+if [ -z "$JAVA_HOME" ]
+then
+  JAVA=`which java`
+  if [ -z "$JAVA" ] 
+  then
+    echo "Cannot find JAVA. Please set JAVA_HOME or your PATH."
+    exit 1
+  fi
+  JAVA_BINDIR=`dirname $JAVA`
+  JAVA_HOME=$JAVA_BINDIR/..
+fi
+
+if [ -z "$JAVACMD" ] 
+then 
+   # It may be defined in env - including flags!!
+   JAVACMD=$JAVA_HOME/bin/java
+fi
+
+# Ignore previous classpath.  Build one that contains heritrix jar and content
+# of the lib directory into the variable CP.
+for jar in `ls $WAYBACK_HOME/lib/*.jar $WAYBACK_HOME/*.jar 2> /dev/null`
+do
+    CP=${CP}:${jar}
+done
+
+# cygwin path translation
+if expr `uname` : 'CYGWIN*' > /dev/null; then
+    CP=`cygpath -p -w "$CP"`
+    WAYBACK_HOME=`cygpath -p -w "$WAYBACK_HOME"`
+fi
+
+# Make sure of java opts.
+if [ -z "$JAVA_OPTS" ]
+then
+  JAVA_OPTS=" -Xmx256m"
+fi
+
+# Main ArcIndexer class.
+if [ -z "$CLASS_MAIN" ]
+then
+  CLASS_MAIN='org.archive.wayback.resourcestore.http.FileLocationDB'
+fi
+
+CLASSPATH=${CP} $JAVACMD ${JAVA_OPTS} $CLASS_MAIN "$@"
+


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2261] trunk/archive-access/projects/wayback/ wayback-core/src/main/java/org/archive/wayback/resourcestore/http/ FileLocationDB.java

From: <bra...@us...> - 2008-05-05 21:33:03

Revision: 2261
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2261&view=rev
Author:   bradtofel
Date:     2008-05-05 14:33:02 -0700 (Mon, 05 May 2008)

Log Message:
-----------
FEATURE: command-line code to populate an offline FileLocationDB.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java

Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java	2008-04-19 00:37:34 UTC (rev 2260)
+++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/http/FileLocationDB.java	2008-05-05 21:33:02 UTC (rev 2261)
@@ -24,7 +24,9 @@
  */
 package org.archive.wayback.resourcestore.http;
 
+import java.io.BufferedReader;
 import java.io.IOException;
+import java.io.InputStreamReader;
 
 import org.archive.wayback.bdb.BDBRecordSet;
 import org.archive.wayback.exception.ConfigurationException;
@@ -238,4 +240,66 @@
 	public void setBdbName(String bdbName) {
 		this.bdbName = bdbName;
 	}
+	private static void USAGE(String message) {
+		System.err.print("USAGE: " + message + "\n" +
+				"\tDBDIR DBNAME LOGPATH\n" +
+				"\n" +
+				"\t\tread lines from STDIN formatted like:\n" +
+				"\t\t\tNAME<SPACE>URL\n" +
+				"\t\tand for each line, add to locationDB that file NAME is\n" +
+				"\t\tlocated at URL. Use locationDB in DBDIR at DBNAME, \n" + 
+				"\t\tcreating if it does not exist.\n"
+				);
+		System.exit(2);
+	}
+	
+	/**
+	 * @param args
+	 */
+	public static void main(String[] args) {
+		if(args.length != 3) {
+			USAGE("");
+			System.exit(1);
+		}
+		String bdbPath = args[0];
+		String bdbName = args[1];
+		String logPath = args[2];
+		FileLocationDB db = new FileLocationDB();
+		db.setBdbPath(bdbPath);
+		db.setBdbName(bdbName);
+		db.setLogPath(logPath);
+		BufferedReader r = new BufferedReader(
+				new InputStreamReader(System.in));
+		String line;
+		int exitCode = 0;
+		try {
+			db.init();
+			while((line = r.readLine()) != null) {
+				String parts[] = line.split(" ");
+				if(parts.length != 2) {
+					System.err.println("Bad input(" + line + ")");
+					System.exit(2);
+				}
+				db.addArcUrl(parts[0],parts[1]);
+				System.out.println("Added\t" + parts[0] + "\t" + parts[1]);
+			}
+		} catch (IOException e) {
+			e.printStackTrace();
+			exitCode = 1;
+		} catch (DatabaseException e) {
+			e.printStackTrace();
+			exitCode = 1;
+		} catch (ConfigurationException e) {
+			e.printStackTrace();
+			exitCode = 1;
+		} finally {
+			try {
+				db.shutdownDB();
+			} catch (DatabaseException e) {
+				e.printStackTrace();
+				exitCode = 1;
+			}
+		}
+		System.exit(exitCode);
+	}	
 }


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2260] trunk/archive-access/projects/wayback/ dist/src/site/xdoc/release_notes.xml

From: <bra...@us...> - 2008-04-19 00:37:36

Revision: 2260
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2260&view=rev
Author:   bradtofel
Date:     2008-04-18 17:37:34 -0700 (Fri, 18 Apr 2008)

Log Message:
-----------
DOCS: wrong version specified

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml

Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml	2008-04-19 00:36:41 UTC (rev 2259)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/release_notes.xml	2008-04-19 00:37:34 UTC (rev 2260)
@@ -11,7 +11,7 @@
     <section name="Releases">
       <p>
         Full listing of changes and bug fixes are not currently available prior
-        to release 1.2.1.
+        to release 1.2.0.
       </p>
     </section>
     <section name="Release 1.2.1">


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2259] trunk/archive-access/projects/wayback

From: <bra...@us...> - 2008-04-19 00:36:48

Revision: 2259
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2259&view=rev
Author:   bradtofel
Date:     2008-04-18 17:36:41 -0700 (Fri, 18 Apr 2008)

Log Message:
-----------
POST-RELEASE: 1.3.0-SNAPSHOT

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/dist/pom.xml
    trunk/archive-access/projects/wayback/pom.xml
    trunk/archive-access/projects/wayback/wayback-core/pom.xml
    trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml
    trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml
    trunk/archive-access/projects/wayback/wayback-webapp/pom.xml

Modified: trunk/archive-access/projects/wayback/dist/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/pom.xml	2008-04-19 00:32:29 UTC (rev 2258)
+++ trunk/archive-access/projects/wayback/dist/pom.xml	2008-04-19 00:36:41 UTC (rev 2259)
@@ -3,7 +3,7 @@
   <parent>
 	  <groupId>org.archive</groupId>
 	  <artifactId>wayback</artifactId>
-    <version>1.2.1</version>
+    <version>1.3.0-SNAPSHOT</version>
   </parent>
   <modelVersion>4.0.0</modelVersion>
 
@@ -54,13 +54,13 @@
     <dependency>
       <groupId>org.archive.wayback</groupId>
       <artifactId>wayback-webapp</artifactId>
-      <version>1.2.1</version>
+      <version>1.3.0-SNAPSHOT</version>
       <type>war</type>
     </dependency>
     <dependency>
       <groupId>org.archive.wayback</groupId>
       <artifactId>wayback-mapreduce</artifactId>
-      <version>1.2.1</version>
+      <version>1.3.0-SNAPSHOT</version>
     </dependency>
   </dependencies>
     

Modified: trunk/archive-access/projects/wayback/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/pom.xml	2008-04-19 00:32:29 UTC (rev 2258)
+++ trunk/archive-access/projects/wayback/pom.xml	2008-04-19 00:36:41 UTC (rev 2259)
@@ -16,7 +16,7 @@
   <modelVersion>4.0.0</modelVersion>
   <groupId>org.archive</groupId>
   <artifactId>wayback</artifactId>
-  <version>1.2.1</version>
+  <version>1.3.0-SNAPSHOT</version>
   <packaging>pom</packaging>
   <name>Wayback</name>
 

Modified: trunk/archive-access/projects/wayback/wayback-core/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/wayback-core/pom.xml	2008-04-19 00:32:29 UTC (rev 2258)
+++ trunk/archive-access/projects/wayback/wayback-core/pom.xml	2008-04-19 00:36:41 UTC (rev 2259)
@@ -17,7 +17,7 @@
   <parent>
 	  <groupId>org.archive</groupId>
 	  <artifactId>wayback</artifactId>
-	  <version>1.2.1</version>
+	  <version>1.3.0-SNAPSHOT</version>
   </parent>
   <groupId>org.archive.wayback</groupId>
   <artifactId>wayback-core</artifactId>

Modified: trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml	2008-04-19 00:32:29 UTC (rev 2258)
+++ trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml	2008-04-19 00:36:41 UTC (rev 2259)
@@ -12,7 +12,7 @@
   <parent>
 	  <groupId>org.archive</groupId>
 	  <artifactId>wayback</artifactId>
-	  <version>1.2.1</version>
+	  <version>1.3.0-SNAPSHOT</version>
   </parent>
   <groupId>org.archive.wayback</groupId>
   <artifactId>wayback-mapreduce</artifactId>

Modified: trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml	2008-04-19 00:32:29 UTC (rev 2258)
+++ trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml	2008-04-19 00:36:41 UTC (rev 2259)
@@ -10,7 +10,7 @@
   <parent>
 	  <groupId>org.archive</groupId>
 	  <artifactId>wayback</artifactId>
-	  <version>1.2.1</version>
+	  <version>1.3.0-SNAPSHOT</version>
   </parent>
   <groupId>org.archive.wayback</groupId>
   <artifactId>wayback-mapreduce-prereq</artifactId>

Modified: trunk/archive-access/projects/wayback/wayback-webapp/pom.xml
===================================================================
--- trunk/archive-access/projects/wayback/wayback-webapp/pom.xml	2008-04-19 00:32:29 UTC (rev 2258)
+++ trunk/archive-access/projects/wayback/wayback-webapp/pom.xml	2008-04-19 00:36:41 UTC (rev 2259)
@@ -3,7 +3,7 @@
   <parent>
     <artifactId>wayback</artifactId>
     <groupId>org.archive</groupId>
-    <version>1.2.1</version>
+    <version>1.3.0-SNAPSHOT</version>
   </parent>
   <modelVersion>4.0.0</modelVersion>
   <groupId>org.archive.wayback</groupId>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2258] branches/wayback-1_2_1/wayback/

From: <bra...@us...> - 2008-04-19 00:32:52

Revision: 2258
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2258&view=rev
Author:   bradtofel
Date:     2008-04-18 17:32:29 -0700 (Fri, 18 Apr 2008)

Log Message:
-----------
RELEASE 1.2.1

Added Paths:
-----------
    branches/wayback-1_2_1/wayback/

Copied: branches/wayback-1_2_1/wayback (from rev 2257, trunk/archive-access/projects/wayback)


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2257] branches/wayback-1_2_1/wayback/

From: <bra...@us...> - 2008-04-19 00:32:22

Revision: 2257
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2257&view=rev
Author:   bradtofel
Date:     2008-04-18 17:31:58 -0700 (Fri, 18 Apr 2008)

Log Message:
-----------
Removing bad branch...

Removed Paths:
-------------
    branches/wayback-1_2_1/wayback/


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

[Archive-access-cvs] SF.net SVN: archive-access: [2256] trunk/archive-access/projects/wayback/ dist/src/site/xdoc/administrator_manual.xml

From: <bra...@us...> - 2008-04-17 23:02:02

Revision: 2256
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2256&view=rev
Author:   bradtofel
Date:     2008-04-17 16:02:06 -0700 (Thu, 17 Apr 2008)

Log Message:
-----------
DOC: added a bit of info indicating that adding ARCs/WARCs to 'dataDir' will get them added to Wayback iff autoindexing is enabled.

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml

Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml	2008-04-17 20:54:16 UTC (rev 2255)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml	2008-04-17 23:02:06 UTC (rev 2256)
@@ -138,7 +138,11 @@
 					implementation also includes the capability to run a background
 					thread to automatically notice new ARC/WARC files appearing, index
 					those files, and hand off the index data for merging with
-					a BDBResourceIndex.
+					a BDBResourceIndex. When using automatic indexing, any files added to
+					the 'dataDir' will automatically be indexed and queued for merging 
+					with the ResourceIndex. Please see documentation for the 
+					BDBResourceIndex for information on configuring automatic merging of
+					indexed data with a BDBResourceIndex.
         </p>
         <p>
           The XML configuration template for a LocalResourceStore follows:


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.

Flat | Threaded

<< < 1 .. 60 61 62 63 64 .. 171 > >> (Page 62 of 171)