You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bra...@us...> - 2008-11-06 22:51:27
|
Revision: 2630 http://archive-access.svn.sourceforge.net/archive-access/?rev=2630&view=rev Author: bradtofel Date: 2008-11-06 22:51:24 +0000 (Thu, 06 Nov 2008) Log Message: ----------- DOC: clarified dependency on using url-client with -identity option on arc/warc-indexer Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-10-29 00:01:33 UTC (rev 2629) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-11-06 22:51:24 UTC (rev 2630) @@ -1110,8 +1110,11 @@ </p> <p> The <b>-identity</b> option causes the tools to skip canonicalization - of URLs. See the documentation for the <b>url-client</b> tool, and - the <a href="resource_index.html#URL_Canonicalization"> + of URLs. When using this option, you will need to pass the CDX + records through the url-client tool before sorting them into a + production CDX index. See the documentation for the + <b>url-client</b> tool, and the + <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization </a> section for more information. </p> @@ -1182,15 +1185,19 @@ canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column - altered. The column that is changed is assumed to be a URL, + altered. The column that is changed is assumed to be an URL, and the version output is the canonicalized form of the input URL. </p> <p> - This tool is mostly useful for debugging the - canonicalization function, but can also be used, if the - canonicalization function is altered, to update an existing - CDX index, without recreating CDX files from original ARCs. See the + This tool is required when using the <b>arc-indexer</b> or + <b>warc-indexer</b> tools with the <b>-identity</b> option. Typical + usage involves generating an <i>identity</i> CDX index, then + passing the lines in that index through this tool to canonicalize the + record URL key for queries. If the <i>identity</i> CDX files are + kept, then canonicalization schemes can be swapped without + reindexing the original ARC/WARC content. This tool can also be + useful for debugging the canonicalization function. See the section <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-10-29 00:01:33 UTC (rev 2629) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-11-06 22:51:24 UTC (rev 2630) @@ -275,10 +275,14 @@ </li> <li> <b>user info removal</b> - http://us...@ex... => example.com, - http://user:pas...@ex... => example.com, + http://us...@ex... => example.com, + http://user:pas...@ex... => example.com, </li> <li> + <b>default port removal</b> + http://example.com:80 => example.com, + </li> + <li> <b>session ID removal</b> http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx => @@ -313,12 +317,12 @@ <p> At the IA, we have recently switched to building CDX files using the <b>-identity</b> option on the <b>arc-indexer</b> and - <b>warc-indexer</b> tools, and have added an additional step in our - CDX creation processes which uses the <b>url-client</b> tool before - sorting and merging CDX files. By keeping the original "identity" CDX - files, we have been able to test various URL canonicalization - strategies without the overhead of re-processing all the source - materials. + <b>warc-indexer</b> tools. The <b>-identity</b> option + <b>requires</b> passing records through the <b>url-client</b> + tool before sorting and merging into production CDX files. By keeping + the original "identity" CDX files, we have been able to test various + URL canonicalization strategies without the overhead of + re-processing all the ARC/WARC source materials. </p> </subsection> <subsection name="Future Directions within Wayback"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-29 00:01:35
|
Revision: 2629 http://archive-access.svn.sourceforge.net/archive-access/?rev=2629&view=rev Author: bradtofel Date: 2008-10-29 00:01:33 +0000 (Wed, 29 Oct 2008) Log Message: ----------- TWEAK: now outputs log message when failing to access a Resource, instead of dumping a stack trace. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java 2008-10-28 23:57:25 UTC (rev 2628) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java 2008-10-29 00:01:33 UTC (rev 2629) @@ -27,6 +27,7 @@ import java.io.File; import java.io.IOException; import java.net.URL; +import java.util.logging.Logger; import org.archive.wayback.ResourceStore; import org.archive.wayback.core.Resource; @@ -43,6 +44,8 @@ * @version $Date$, $Revision$ */ public class LocationDBResourceStore implements ResourceStore { + private static final Logger LOGGER = + Logger.getLogger(LocationDBResourceStore.class.getName()); private ResourceFileLocationDB db = null; @@ -89,7 +92,7 @@ // which means we've already read some } catch (IOException e) { - e.printStackTrace(); + LOGGER.warning("Unable to retrieve resource from " + url); } if(r != null) { break; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 23:57:32
|
Revision: 2628 http://archive-access.svn.sourceforge.net/archive-access/?rev=2628&view=rev Author: bradtofel Date: 2008-10-28 23:57:25 +0000 (Tue, 28 Oct 2008) Log Message: ----------- NEW FEATURE(ACC-43): allows filtering records based on arbitrarily configured ObjectFilter. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2008-10-28 23:53:49 UTC (rev 2627) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2008-10-28 23:57:25 UTC (rev 2628) @@ -86,6 +86,8 @@ private boolean dedupeRecords = false; private ObjectFilter<CaptureSearchResult> annotater = null; + + private ObjectFilter<CaptureSearchResult> filter = null; public LocalResourceIndex() { canonicalizer = new AggressiveUrlCanonicalizer(); @@ -122,7 +124,7 @@ CaptureSearchResults results = new CaptureSearchResults(); CaptureQueryFilterState filterState = - new CaptureQueryFilterState(wbRequest,canonicalizer, type); + new CaptureQueryFilterState(wbRequest,canonicalizer, type, filter); String keyUrl = filterState.getKeyUrl(); CloseableIterator<CaptureSearchResult> itr = getCaptureIterator(keyUrl); @@ -159,7 +161,7 @@ CaptureQueryFilterState filterState = new CaptureQueryFilterState(wbRequest,canonicalizer, - CaptureQueryFilterState.TYPE_URL); + CaptureQueryFilterState.TYPE_URL, filter); String keyUrl = filterState.getKeyUrl(); CloseableIterator<CaptureSearchResult> citr = getCaptureIterator(keyUrl); @@ -287,6 +289,14 @@ public void setAnnotater(ObjectFilter<CaptureSearchResult> annotater) { this.annotater = annotater; } + + public ObjectFilter<CaptureSearchResult> getFilter() { + return filter; + } + + public void setFilter(ObjectFilter<CaptureSearchResult> filter) { + this.filter = filter; + } private class CaptureQueryFilterState { public final static int TYPE_REPLAY = 0; @@ -302,7 +312,8 @@ private String exactDate; public CaptureQueryFilterState(WaybackRequest request, - UrlCanonicalizer canonicalizer, int type) + UrlCanonicalizer canonicalizer, int type, + ObjectFilter<CaptureSearchResult> genericFilter) throws BadQueryException { String searchUrl = request.getRequestUrl(); @@ -333,6 +344,9 @@ preExclusionCounter = new CounterFilter(); DateRangeFilter drFilter = new DateRangeFilter(startDate,endDate); + if(genericFilter != null) { + filter.addFilter(genericFilter); + } // has the user asked for only results on the exact host specified? ObjectFilter<CaptureSearchResult> exactHost = getExactHostFilter(request); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 23:53:57
|
Revision: 2627 http://archive-access.svn.sourceforge.net/archive-access/?rev=2627&view=rev Author: bradtofel Date: 2008-10-28 23:53:49 +0000 (Tue, 28 Oct 2008) Log Message: ----------- NEW FEATURE(ACC-43): allows filtering(include/exclude) of records based on HTTP response code. Added Paths: ----------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java Added: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java (rev 0) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java 2008-10-28 23:53:49 UTC (rev 2627) @@ -0,0 +1,74 @@ +package org.archive.wayback.resourceindex.filters; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.util.ObjectFilter; + +/** + * ObjectFilter which allows including or excluding results based on the + * Http response code. + * + * @author brad + * @version $Date: 2008-10-10 18:59:57 -0700 (Fri, 10 Oct 2008) $, $Rev: 2606 $ + */ +public class HttpCodeFilter implements ObjectFilter<CaptureSearchResult> { + + private Map<String,Object> includes = null; + private Map<String,Object> excludes = null; + + private static Map<String,Object> listToMap(List<String> list) { + if(list == null) { + return null; + } + HashMap<String, Object> map = new HashMap<String, Object>(); + for(String s : list) { + map.put(s, null); + } + return map; + } + private static List<String> mapToList(Map<String,Object> map) { + if(map == null) { + return null; + } + List<String> list = new ArrayList<String>(); + list.addAll(map.keySet()); + return list; + } + + public List<String> getIncludes() { + return mapToList(includes); + } + + public void setIncludes(List<String> includes) { + this.includes = listToMap(includes); + } + + + public List<String> getExcludes() { + return mapToList(excludes); + } + + + public void setExcludes(List<String> excludes) { + this.excludes = listToMap(excludes); + } + + public int filterObject(CaptureSearchResult o) { + String code = o.getHttpCode(); + if(excludes != null) { + if(excludes.containsKey(code)) { + return FILTER_EXCLUDE; + } + } + if(includes != null) { + if(!includes.containsKey(code)) { + return FILTER_EXCLUDE; + } + } + return FILTER_INCLUDE; + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 18:28:11
|
Revision: 2626 http://archive-access.svn.sourceforge.net/archive-access/?rev=2626&view=rev Author: bradtofel Date: 2008-10-28 18:27:59 +0000 (Tue, 28 Oct 2008) Log Message: ----------- RELEASE - 1.4.1 Maintenance, fixes several known bugs in 1.4.0, including disabling of AnchorDate/Window critical bug. Modified Paths: -------------- branches/wayback-1_4_1/dist/pom.xml branches/wayback-1_4_1/pom.xml branches/wayback-1_4_1/wayback-core/pom.xml branches/wayback-1_4_1/wayback-mapreduce/pom.xml branches/wayback-1_4_1/wayback-mapreduce-prereq/pom.xml branches/wayback-1_4_1/wayback-webapp/pom.xml Modified: branches/wayback-1_4_1/dist/pom.xml =================================================================== --- branches/wayback-1_4_1/dist/pom.xml 2008-10-28 18:22:58 UTC (rev 2625) +++ branches/wayback-1_4_1/dist/pom.xml 2008-10-28 18:27:59 UTC (rev 2626) @@ -3,7 +3,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.4.1</version> </parent> <modelVersion>4.0.0</modelVersion> @@ -54,13 +54,13 @@ <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-webapp</artifactId> - <version>1.4.0</version> + <version>1.4.1</version> <type>war</type> </dependency> <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> - <version>1.4.0</version> + <version>1.4.1</version> </dependency> </dependencies> Modified: branches/wayback-1_4_1/pom.xml =================================================================== --- branches/wayback-1_4_1/pom.xml 2008-10-28 18:22:58 UTC (rev 2625) +++ branches/wayback-1_4_1/pom.xml 2008-10-28 18:27:59 UTC (rev 2626) @@ -17,9 +17,9 @@ <groupId>org.archive</groupId> <artifactId>wayback</artifactId> <properties> - <globalVersion>1.4.0</globalVersion> + <globalVersion>1.4.1</globalVersion> </properties> - <version>1.4.0</version> + <version>1.4.1</version> <packaging>pom</packaging> <name>Wayback</name> Modified: branches/wayback-1_4_1/wayback-core/pom.xml =================================================================== --- branches/wayback-1_4_1/wayback-core/pom.xml 2008-10-28 18:22:58 UTC (rev 2625) +++ branches/wayback-1_4_1/wayback-core/pom.xml 2008-10-28 18:27:59 UTC (rev 2626) @@ -17,7 +17,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.4.1</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-core</artifactId> @@ -57,7 +57,7 @@ <dependency> <groupId>org.archive.heritrix</groupId> <artifactId>commons</artifactId> - <version>2.0.1-SNAPSHOT</version> + <version>2.0.2-SNAPSHOT</version> </dependency> <dependency> <groupId>org.archive.access-control</groupId> Modified: branches/wayback-1_4_1/wayback-mapreduce/pom.xml =================================================================== --- branches/wayback-1_4_1/wayback-mapreduce/pom.xml 2008-10-28 18:22:58 UTC (rev 2625) +++ branches/wayback-1_4_1/wayback-mapreduce/pom.xml 2008-10-28 18:27:59 UTC (rev 2626) @@ -12,7 +12,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.4.1</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> Modified: branches/wayback-1_4_1/wayback-mapreduce-prereq/pom.xml =================================================================== --- branches/wayback-1_4_1/wayback-mapreduce-prereq/pom.xml 2008-10-28 18:22:58 UTC (rev 2625) +++ branches/wayback-1_4_1/wayback-mapreduce-prereq/pom.xml 2008-10-28 18:27:59 UTC (rev 2626) @@ -10,7 +10,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.4.1</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce-prereq</artifactId> Modified: branches/wayback-1_4_1/wayback-webapp/pom.xml =================================================================== --- branches/wayback-1_4_1/wayback-webapp/pom.xml 2008-10-28 18:22:58 UTC (rev 2625) +++ branches/wayback-1_4_1/wayback-webapp/pom.xml 2008-10-28 18:27:59 UTC (rev 2626) @@ -3,7 +3,7 @@ <parent> <artifactId>wayback</artifactId> <groupId>org.archive</groupId> - <version>1.4.0</version> + <version>1.4.1</version> </parent> <modelVersion>4.0.0</modelVersion> <groupId>org.archive.wayback</groupId> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 18:23:10
|
Revision: 2625 http://archive-access.svn.sourceforge.net/archive-access/?rev=2625&view=rev Author: bradtofel Date: 2008-10-28 18:22:58 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(ACC-34): Initial-Rev of NoCache.jsp replay insert, which just adds the an HTTP response header to attempt to prevent caching within proxy mode. Modified Paths: -------------- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/ProxyReplay.xml Added Paths: ----------- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/ProxyReplay.xml =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/ProxyReplay.xml 2008-10-28 18:15:41 UTC (rev 2624) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/ProxyReplay.xml 2008-10-28 18:22:58 UTC (rev 2625) @@ -15,6 +15,7 @@ <list> <value>/WEB-INF/replay/ArchiveComment.jsp</value> <value>/WEB-INF/replay/Disclaimer.jsp</value> + <value>/WEB-INF/replay/NoCache.jsp</value> <!-- <value>/replay/JSLessTimeline.jsp</value> --> Added: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp (rev 0) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp 2008-10-28 18:22:58 UTC (rev 2625) @@ -0,0 +1,3 @@ +<% + response.setHeader("Cache-Control","no-cache"); +%> \ No newline at end of file Added: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp (rev 0) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/NoCache.jsp 2008-10-28 18:22:58 UTC (rev 2625) @@ -0,0 +1,3 @@ +<% + response.setHeader("Cache-Control","no-cache"); +%> \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 18:15:47
|
Revision: 2624 http://archive-access.svn.sourceforge.net/archive-access/?rev=2624&view=rev Author: bradtofel Date: 2008-10-28 18:15:41 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(ACC-36): Timeline was using replay date from WaybackRequest, not capture date from SearchResult. Modified Paths: -------------- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp 2008-10-28 02:08:53 UTC (rev 2623) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp 2008-10-28 18:15:41 UTC (rev 2624) @@ -14,16 +14,15 @@ <jsp:include page="/WEB-INF/template/CookieJS.jsp" flush="true" /> <% -String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" - + request.getServerPort() + request.getContextPath(); UIResults results = UIResults.extractReplay(request); +String contextRoot = results.getWbRequest().getContextPrefix(); WaybackRequest wbRequest = results.getWbRequest(); StringFormatter fmt = wbRequest.getFormatter(); CaptureSearchResults cResults = results.getCaptureResults(); -String exactDateStr = wbRequest.getReplayTimestamp(); -Date exactDate = wbRequest.getReplayDate(); +String exactDateStr = results.getResult().getCaptureTimestamp(); +Date exactDate = results.getResult().getCaptureDate(); String searchUrl = wbRequest.getRequestUrl(); String resolution = wbRequest.getTimelineResolution(); Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp 2008-10-28 02:08:53 UTC (rev 2623) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp 2008-10-28 18:15:41 UTC (rev 2624) @@ -14,16 +14,15 @@ <jsp:include page="/WEB-INF/template/CookieJS.jsp" flush="true" /> <% -String contextRoot = request.getScheme() + "://" + request.getServerName() + ":" - + request.getServerPort() + request.getContextPath(); UIResults results = UIResults.extractReplay(request); +String contextRoot = results.getWbRequest().getContextPrefix(); WaybackRequest wbRequest = results.getWbRequest(); StringFormatter fmt = wbRequest.getFormatter(); CaptureSearchResults cResults = results.getCaptureResults(); -String exactDateStr = wbRequest.getReplayTimestamp(); -Date exactDate = wbRequest.getReplayDate(); +String exactDateStr = results.getResult().getCaptureTimestamp(); +Date exactDate = results.getResult().getCaptureDate(); String searchUrl = wbRequest.getRequestUrl(); String resolution = wbRequest.getTimelineResolution(); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 02:08:54
|
Revision: 2623 http://archive-access.svn.sourceforge.net/archive-access/?rev=2623&view=rev Author: bradtofel Date: 2008-10-28 02:08:53 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(ACC-37): added drop down to select urlquery/prefixquery search type, added place-holder text to all UI-properties files, need to get updates for German and Czech translations. Modified Paths: -------------- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties branches/wayback-1_4_1/wayback-webapp/src/main/webapp/advanced_search.jsp trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/advanced_search.jsp Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -114,3 +114,6 @@ AdvancedSearch.earliestDate=Earliest Date: AdvancedSearch.latestDate=Latest Date: AdvancedSearch.submitButton=Submit +AdvancedSearch.searchTypeLabel=Show urls +AdvancedSearch.searchTypeExactOption=exactly matching +AdvancedSearch.searchTypePrefixOption=beginning with Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -91,6 +91,9 @@ AdvancedSearch.earliestDate=Nejd\u0159\u00edv\u011bj\u0161\u00ed datum: AdvancedSearch.latestDate=Nejpozd\u011bj\u0161\u00ed datum: AdvancedSearch.submitButton=Vyhledat +AdvancedSearch.searchTypeLabel=[Show urls] +AdvancedSearch.searchTypeExactOption=[exactly matching] +AdvancedSearch.searchTypePrefixOption=[beginning with] header.h1.title=WebArchiv - archiv \u010desk\u00e9ho webu header.h1.nkp=N\u00e1rodn\u00ed knihovna Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -114,3 +114,6 @@ AdvancedSearch.earliestDate=Frühestes Datum: AdvancedSearch.latestDate=Spätestes Datum: AdvancedSearch.submitButton=Suche +AdvancedSearch.searchTypeLabel=[Show urls] +AdvancedSearch.searchTypeExactOption=[exactly matching] +AdvancedSearch.searchTypePrefixOption=[beginning with] Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -95,3 +95,6 @@ AdvancedSearch.earliestDate=[Earliest Date:] AdvancedSearch.latestDate=[Latest Date:] AdvancedSearch.submitButton=[Submit] +AdvancedSearch.searchTypeLabel=[Show urls] +AdvancedSearch.searchTypeExactOption=[exactly matching] +AdvancedSearch.searchTypePrefixOption=[beginning with] Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/advanced_search.jsp =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/advanced_search.jsp 2008-10-28 02:06:26 UTC (rev 2622) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/advanced_search.jsp 2008-10-28 02:08:53 UTC (rev 2623) @@ -8,8 +8,12 @@ StringFormatter fmt = results.getWbRequest().getFormatter(); %> -<form action="../../replay"> -<%= fmt.format("AdvancedSearch.url") %> +<form action="query"> +<%= fmt.format("AdvancedSearch.searchTypeLabel") %> +<select name="type"> + <option value="urlquery"><%= fmt.format("AdvancedSearch.searchTypeExactOption") %></option> + <option value="prefixquery"><%= fmt.format("AdvancedSearch.searchTypePrefixOption") %></option> +</select> <input type="TEXT" name="url" width="80"> <br></br> <%= fmt.format("AdvancedSearch.exactDate") %> @@ -21,7 +25,6 @@ <%= fmt.format("AdvancedSearch.latestDate") %> <input type="TEXT" name="enddate" width="80"> <br></br> -<input type="HIDDEN" name="type" value="replay"> <input type="SUBMIT" value="<%= fmt.format("AdvancedSearch.submitButton") %>"> </form> <jsp:include page="/WEB-INF/template/UI-footer.jsp" flush="true" /> Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -114,3 +114,6 @@ AdvancedSearch.earliestDate=Earliest Date: AdvancedSearch.latestDate=Latest Date: AdvancedSearch.submitButton=Submit +AdvancedSearch.searchTypeLabel=Show urls +AdvancedSearch.searchTypeExactOption=exactly matching +AdvancedSearch.searchTypePrefixOption=beginning with Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_cs.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -91,6 +91,9 @@ AdvancedSearch.earliestDate=Nejd\u0159\u00edv\u011bj\u0161\u00ed datum: AdvancedSearch.latestDate=Nejpozd\u011bj\u0161\u00ed datum: AdvancedSearch.submitButton=Vyhledat +AdvancedSearch.searchTypeLabel=[Show urls] +AdvancedSearch.searchTypeExactOption=[exactly matching] +AdvancedSearch.searchTypePrefixOption=[beginning with] header.h1.title=WebArchiv - archiv \u010desk\u00e9ho webu header.h1.nkp=N\u00e1rodn\u00ed knihovna Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_de.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -114,3 +114,6 @@ AdvancedSearch.earliestDate=Frühestes Datum: AdvancedSearch.latestDate=Spätestes Datum: AdvancedSearch.submitButton=Suche +AdvancedSearch.searchTypeLabel=[Show urls] +AdvancedSearch.searchTypeExactOption=[exactly matching] +AdvancedSearch.searchTypePrefixOption=[beginning with] Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties 2008-10-28 02:06:26 UTC (rev 2622) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/classes/WaybackUI_fr_CA.properties 2008-10-28 02:08:53 UTC (rev 2623) @@ -95,3 +95,6 @@ AdvancedSearch.earliestDate=[Earliest Date:] AdvancedSearch.latestDate=[Latest Date:] AdvancedSearch.submitButton=[Submit] +AdvancedSearch.searchTypeLabel=[Show urls] +AdvancedSearch.searchTypeExactOption=[exactly matching] +AdvancedSearch.searchTypePrefixOption=[beginning with] Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/advanced_search.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/advanced_search.jsp 2008-10-28 02:06:26 UTC (rev 2622) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/advanced_search.jsp 2008-10-28 02:08:53 UTC (rev 2623) @@ -8,8 +8,12 @@ StringFormatter fmt = results.getWbRequest().getFormatter(); %> -<form action="../../replay"> -<%= fmt.format("AdvancedSearch.url") %> +<form action="query"> +<%= fmt.format("AdvancedSearch.searchTypeLabel") %> +<select name="type"> + <option value="urlquery"><%= fmt.format("AdvancedSearch.searchTypeExactOption") %></option> + <option value="prefixquery"><%= fmt.format("AdvancedSearch.searchTypePrefixOption") %></option> +</select> <input type="TEXT" name="url" width="80"> <br></br> <%= fmt.format("AdvancedSearch.exactDate") %> @@ -21,7 +25,6 @@ <%= fmt.format("AdvancedSearch.latestDate") %> <input type="TEXT" name="enddate" width="80"> <br></br> -<input type="HIDDEN" name="type" value="replay"> <input type="SUBMIT" value="<%= fmt.format("AdvancedSearch.submitButton") %>"> </form> <jsp:include page="/WEB-INF/template/UI-footer.jsp" flush="true" /> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 02:06:32
|
Revision: 2622 http://archive-access.svn.sourceforge.net/archive-access/?rev=2622&view=rev Author: bradtofel Date: 2008-10-28 02:06:26 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(ACC-33): Fixed Search Result summary and counts. Modified Paths: -------------- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp Modified: branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp =================================================================== --- branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp 2008-10-28 01:52:22 UTC (rev 2621) +++ branches/wayback-1_4_1/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp 2008-10-28 02:06:26 UTC (rev 2622) @@ -26,14 +26,13 @@ Date searchEndDate = wbRequest.getEndDate(); long firstResult = uResults.getFirstReturned(); -long resultCount = uResults.getReturnedCount(); -long lastResult = resultCount + firstResult; +long lastResult = uResults.getReturnedCount() + firstResult; long totalCaptures = uResults.getMatchingCount(); %> -<%= fmt.format("PathPrefixQuery.showingResults",firstResult,lastResult, - resultCount,searchString) %> +<%= fmt.format("PathPrefixQuery.showingResults",firstResult + 1,lastResult, + totalCaptures,searchString) %> <br/> <hr></hr> Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp 2008-10-28 01:52:22 UTC (rev 2621) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp 2008-10-28 02:06:26 UTC (rev 2622) @@ -26,14 +26,13 @@ Date searchEndDate = wbRequest.getEndDate(); long firstResult = uResults.getFirstReturned(); -long resultCount = uResults.getReturnedCount(); -long lastResult = resultCount + firstResult; +long lastResult = uResults.getReturnedCount() + firstResult; long totalCaptures = uResults.getMatchingCount(); %> -<%= fmt.format("PathPrefixQuery.showingResults",firstResult,lastResult, - resultCount,searchString) %> +<%= fmt.format("PathPrefixQuery.showingResults",firstResult + 1,lastResult, + totalCaptures,searchString) %> <br/> <hr></hr> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:55:21
|
Revision: 2620 http://archive-access.svn.sourceforge.net/archive-access/?rev=2620&view=rev Author: bradtofel Date: 2008-10-28 01:32:15 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX (unreported): do not forward WaybackRequest fields with 'null' values. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-10-28 01:30:59 UTC (rev 2619) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-10-28 01:32:15 UTC (rev 2620) @@ -822,6 +822,7 @@ } if(isStandard) continue; String val = filters.get(key); + if(val == null) continue; if (queryString.length() > 0) { queryString.append(" "); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:55:18
|
Revision: 2619 http://archive-access.svn.sourceforge.net/archive-access/?rev=2619&view=rev Author: bradtofel Date: 2008-10-28 01:30:59 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(ACC-31): now escapes URLs as they are resolved in UrlOperations. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java 2008-10-28 01:29:30 UTC (rev 2618) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java 2008-10-28 01:30:59 UTC (rev 2619) @@ -83,7 +83,14 @@ public static String resolveUrl(String baseUrl, String url) { // TODO: this only works for http:// if(url.startsWith("http://")) { - return url; + try { + return UURIFactory.getInstance(url).getEscapedURI(); + } catch (URIException e) { + e.printStackTrace(); + // can't let a space exist... send back close to whatever came + // in... + return url.replace(" ", "%20"); + } } UURI absBaseURI; UURI resolvedURI = null; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:52:32
|
Revision: 2621 http://archive-access.svn.sourceforge.net/archive-access/?rev=2621&view=rev Author: bradtofel Date: 2008-10-28 01:52:22 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(ACC-46): anchorDate adherence is now configured on AccessPoint, and is disabled by default. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResults.java branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResults.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResults.java 2008-10-28 01:32:15 UTC (rev 2620) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/core/CaptureSearchResults.java 2008-10-28 01:52:22 UTC (rev 2621) @@ -100,17 +100,22 @@ } /** * @param wbRequest - * @param err if true, then check Request Anchor Window and Date, throwing - * exception if no Result is within the Window. + * @param useAnchor if true, then check Request Anchor Window and Date, + * throwing exception if no Result is within the Window. * @return The closest CaptureSearchResult to the request. */ - public CaptureSearchResult getClosest(WaybackRequest wbRequest, boolean err) + public CaptureSearchResult getClosest(WaybackRequest wbRequest, + boolean useAnchor) throws AnchorWindowTooSmallException { CaptureSearchResult closest = null; long closestDistance = 0; CaptureSearchResult cur = null; - String anchorDate = wbRequest.getAnchorTimestamp(); + String anchorDate = null; + // TODO: check if HTTP request referrer is set before using? + if(useAnchor) { + anchorDate = wbRequest.getAnchorTimestamp(); + } long maxWindow = -1; long wantTime = wbRequest.getReplayDate().getTime(); if(anchorDate != null) { @@ -129,7 +134,7 @@ closestDistance = curDistance; } } - if(err && (maxWindow > 0)) { + if(useAnchor && (maxWindow > 0)) { if(closestDistance > maxWindow) { throw new AnchorWindowTooSmallException("Closest is " + closestDistance + " seconds away, Window is " + Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java 2008-10-28 01:32:15 UTC (rev 2620) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java 2008-10-28 01:52:22 UTC (rev 2621) @@ -79,6 +79,8 @@ AccessPoint.class.getName()); private boolean useServerName = false; + private boolean useAnchorWindow = false; + private int contextPort = 0; private String contextName = null; private String beanName = null; @@ -343,7 +345,7 @@ // TODO: check which versions are actually accessible right now? CaptureSearchResult closest = captureResults.getClosest(wbRequest, - true); + useAnchorWindow); resource = collection.getResourceStore().retrieveResource(closest); ReplayRenderer renderer = replay.getRenderer(wbRequest, closest, resource); renderer.renderResource(httpRequest, httpResponse, wbRequest, @@ -473,6 +475,20 @@ this.useServerName = useServerName; } + /** + * @return the useAnchorWindow + */ + public boolean isUseAnchorWindow() { + return useAnchorWindow; + } + + /** + * @param useAnchorWindow the useAnchorWindow to set + */ + public void setUseAnchorWindow(boolean useAnchorWindow) { + this.useAnchorWindow = useAnchorWindow; + } + public ExclusionFilterFactory getExclusionFactory() { return exclusionFactory; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:29:33
|
Revision: 2618 http://archive-access.svn.sourceforge.net/archive-access/?rev=2618&view=rev Author: bradtofel Date: 2008-10-28 01:29:30 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX(unreported): If self-redirect filters cause all documents to be filtered from results, now throws ResourceNotInArchiveException. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2008-10-28 01:25:23 UTC (rev 2617) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2008-10-28 01:29:30 UTC (rev 2618) @@ -207,7 +207,8 @@ } protected SearchResults documentToSearchResults(Document document, - ObjectFilter<CaptureSearchResult> filter) { + ObjectFilter<CaptureSearchResult> filter) + throws ResourceNotInArchiveException { SearchResults results = null; NodeList filters = getRequestFilters(document); String resultsType = getResultsType(document); @@ -237,9 +238,11 @@ return results; } private CaptureSearchResults documentToCaptureSearchResults( - Document document, ObjectFilter<CaptureSearchResult> filter) { + Document document, ObjectFilter<CaptureSearchResult> filter) + throws ResourceNotInArchiveException { CaptureSearchResults results = new CaptureSearchResults(); NodeList xresults = getSearchResults(document); + int numAdded = 0; for(int i = 0; i < xresults.getLength(); i++) { Node xresult = xresults.item(i); CaptureSearchResult result = searchElementToCaptureSearchResult(xresult); @@ -252,9 +255,14 @@ if (ruling == ObjectFilter.FILTER_ABORT) { break; } else if (ruling == ObjectFilter.FILTER_INCLUDE) { + numAdded++; results.addSearchResult(result, true); } } + if(numAdded == 0) { + throw new ResourceNotInArchiveException("No documents matching" + + " filter"); + } return results; } private UrlSearchResult searchElementToUrlSearchResult(Node e) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2617 http://archive-access.svn.sourceforge.net/archive-access/?rev=2617&view=rev Author: bradtofel Date: 2008-10-28 01:25:23 +0000 (Tue, 28 Oct 2008) Log Message: ----------- BUGFIX (ACC-30): now use original URL + timestamp as uniqueness key. This could still cause problems, in which case, we'll add digest perhaps. Modified Paths: -------------- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java Modified: branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java =================================================================== --- branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java 2008-10-28 01:15:38 UTC (rev 2616) +++ branches/wayback-1_4_1/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java 2008-10-28 01:25:23 UTC (rev 2617) @@ -18,7 +18,7 @@ * @see org.archive.wayback.util.ObjectFilter#filterObject(java.lang.Object) */ public int filterObject(CaptureSearchResult o) { - String thisUrl = o.getUrlKey(); + String thisUrl = o.getOriginalUrl(); String thisDate = o.getCaptureTimestamp(); int result = ObjectFilter.FILTER_INCLUDE; if(lastUrl != null) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:15:43
|
Revision: 2616 http://archive-access.svn.sourceforge.net/archive-access/?rev=2616&view=rev Author: bradtofel Date: 2008-10-28 01:15:38 +0000 (Tue, 28 Oct 2008) Log Message: ----------- Maintenance release 1.4.1 Added Paths: ----------- branches/wayback-1_4_1/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:11:01
|
Revision: 2615 http://archive-access.svn.sourceforge.net/archive-access/?rev=2615&view=rev Author: bradtofel Date: 2008-10-28 01:10:57 +0000 (Tue, 28 Oct 2008) Log Message: ----------- oops. Removed Paths: ------------- branches/wayback-1_4_1/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-28 01:07:54
|
Revision: 2614 http://archive-access.svn.sourceforge.net/archive-access/?rev=2614&view=rev Author: bradtofel Date: 2008-10-28 01:07:50 +0000 (Tue, 28 Oct 2008) Log Message: ----------- Maintenance release Added Paths: ----------- branches/wayback-1_4_1/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-13 18:58:35
|
Revision: 2613 http://archive-access.svn.sourceforge.net/archive-access/?rev=2613&view=rev Author: binzino Date: 2008-10-13 18:58:29 +0000 (Mon, 13 Oct 2008) Log Message: ----------- NutchWAX 0.12.2 release tag. Added Paths: ----------- tags/nutchwax-0_12_2/archive/ Property changes on: tags/nutchwax-0_12_2/archive ___________________________________________________________________ Added: svn:mergeinfo + This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-13 18:57:15
|
Revision: 2612 http://archive-access.svn.sourceforge.net/archive-access/?rev=2612&view=rev Author: binzino Date: 2008-10-13 18:57:09 +0000 (Mon, 13 Oct 2008) Log Message: ----------- Removing these directories due to botched copy. Removed Paths: ------------- tags/nutchwax-0_12_2/archive/ tags/nutchwax-0_12_2/imagesearch/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-13 18:38:05
|
Revision: 2611 http://archive-access.svn.sourceforge.net/archive-access/?rev=2611&view=rev Author: binzino Date: 2008-10-13 18:37:57 +0000 (Mon, 13 Oct 2008) Log Message: ----------- Added info about sample files. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt Modified: trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-10-13 18:08:29 UTC (rev 2610) +++ trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-10-13 18:37:57 UTC (rev 2611) @@ -103,3 +103,15 @@ OpenSearch XML : http://someserver/opensearch?query=foo Human-friendly HTML : http://someserver/coolsearch?query=foo + +====================================================================== +Samples +====================================================================== + +You can find sample 'web.xml' and 'search.xsl' files in + + contrib/archive/web + +in the compiled Nutch package. Or in this source tree under + + src/web This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-13 18:08:38
|
Revision: 2610 http://archive-access.svn.sourceforge.net/archive-access/?rev=2610&view=rev Author: binzino Date: 2008-10-13 18:08:29 +0000 (Mon, 13 Oct 2008) Log Message: ----------- Ooops, trying to abort this copy/delete... Added Paths: ----------- tags/nutchwax-0_12_2/ Removed Paths: ------------- tags/nutchwax-0_12_2/archive/HOWTO-dedup.txt tags/nutchwax-0_12_2/archive/HOWTO-xslt.txt tags/nutchwax-0_12_2/archive/HOWTO.txt tags/nutchwax-0_12_2/archive/INSTALL.txt tags/nutchwax-0_12_2/archive/LICENSE.txt tags/nutchwax-0_12_2/archive/README-dedup.txt tags/nutchwax-0_12_2/archive/README.txt tags/nutchwax-0_12_2/archive/RELEASE-NOTES.txt tags/nutchwax-0_12_2/archive/bin/ tags/nutchwax-0_12_2/archive/build.xml tags/nutchwax-0_12_2/archive/conf/ tags/nutchwax-0_12_2/archive/lib/ tags/nutchwax-0_12_2/archive/src/ tags/nutchwax-0_12_2/imagesearch/README.txt tags/nutchwax-0_12_2/imagesearch/bin/ tags/nutchwax-0_12_2/imagesearch/build.xml tags/nutchwax-0_12_2/imagesearch/conf/ tags/nutchwax-0_12_2/imagesearch/lib/ tags/nutchwax-0_12_2/imagesearch/src/ Property changes on: tags/nutchwax-0_12_2 ___________________________________________________________________ Added: svn:mergeinfo + Deleted: tags/nutchwax-0_12_2/archive/HOWTO-dedup.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-dedup.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/HOWTO-dedup.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,323 +0,0 @@ - -HOWTO-dedup.txt -2008-07-03 -Aaron Binns - -Table of Contents - o Prerequisites - - NutchWAX HOWTO.txt - - Wayback 1.2.1 - o Overview - o Generate CDX - o Generate DUP - o Import - o Update and Invert - o Index - o Add Revisit Dates - o Search - o Web deployment - - -====================================================================== -Prerequisites -====================================================================== - -This de-duplication HOWTO assumes you've already read the main HOWTO -and are familiar with importing and indexing archive files with -NutchWAX. - -For de-duplication, the Wayback Machine tools are required. This guide -assumes you have Wayback 1.2.1 installed in - - /opt/wayback-1.2.1 - - -====================================================================== -Overview -====================================================================== - -The README-dedup.txt explains the de-duplication process in greater -detail, including implementation details. - -NutchWAX does not automagically detect and eliminate duplicate records -when importing and indexing. However, tools are provided to help the -user implement a system to perform de-duplication. - -This guide describes one such system using the tools provided by -NutchWAX and Wayback. - - -====================================================================== -Generate CDX -====================================================================== - -The first step is to generate a list of duplicate records for a set of -ARC files. - -This step is not necessary if your archive files are in WARC format -and de-duplication was performed during the crawl. - -To generate the list of duplicates, we use the Wayback 'arc-indexer' -with the NutchWAX 'dedup-cdx' utility. The CDX files *must* be -sorted. - - $ arc-indexer foo.arc.gz | sort > foo.cdx - $ arc-indexer bar.arc.gz | sort > bar.cdx - $ arc-indexer baz.arc.gz | sort > baz.cdx - -Then we combine the CDX files into one sorted CDX containing all the -records: - - $ sort -m foo.cdx bar.cdx baz.cdx > all.cdx - -The "-m" option speeds up the sort by merging the already-sorted -files. - - -====================================================================== -Generate DUP/Revisits -====================================================================== - -Now that we have 'all.cdx' containing a sorted list of all the records -in the ARC files, we can generate a list of duplicates therein: - - $ dedup-cdx all.cdx > all.dup - -This "all.dup" file contains lines of the form - - example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 - -Where each line is - - URL digest date - -This file is then used as an exlusion filter for importing. - - -WARC ----- -If we are using WARC files with revisit records instead of ARC files, -then we don't generate a list of duplicate records because there -shouldn't be any. - -However, the revisit records in the WARC files do have the dates when -a URL was revisited and seen to have not changed -- which is more or -less the same thing as our "dup" lines above. - -For extracting these revisits from WARC CDX files, we use the -'revisits' utility provided by NutchWAX - - $ revisits all-warc.cdx > all-warc.dup - -The output of 'revisits' is in the same format as 'dedup-cdx'. - - -====================================================================== -Import -====================================================================== - -The import process is essentially the same as in NutchWAX, but now -we use "all.dup" as our exclusion list. - -First, we create a manifest - - $ cat > manifest - foo.arc.gz test-collection - bar.arc.gz test-collection - baz.arc.gz test-collection - ^D - - $ nutchwax import -e all.dup manifest - -The result will be a newly-created Nutch segment, same as importing -without de-duplication. - -If you examine the Nutch "hadoop.log" file, you will see INFO-level -lines from the NutchWAX Importer showing which URLs were excluded. - -WARC ----- -If you are importing WARC files with revisit records, then you -typically won't need to provide an exclusion file as the WARC files -were de-duplicated during the crawl. - - -====================================================================== -Update and Invert -====================================================================== - -Perform the Nutch "updatedb" and "invertlinks" steps as normal. - -Nothing special/different to do here with respect to de-duplication. - - -====================================================================== -Index -====================================================================== - -The only chage we make to the indexing step is the destination of the -index directory. - -By default, Nutch expects the per-segment index directory to live in a -sub-directory called 'indexes' and the index command is accordingly - - $ nutch index indexes crawldb linkdb segments/* - -Resulting in an index directory structure of the form - - indexes/part-00000 - -For de-duplication, we use a slightly different directory structure, -which will be used by a de-duplication-aware NutchWaxBean at -search-time. The directory structure we use is: - - pindexes/<segment>/part-00000 - -Using the segment name is not strictly required, but it is a good -practice and is strongly recommended. This way the segment and its -corresponding index directory are easily matched. - -Let's assume that the segment directory created during the import is -named - - segments/20080703050349 - -In that case, our index command becomes: - - $ nutch index pindexes/20080703050349 crawldb linkdb segments/20080703050349 - -Upon completion, the Lucene index is created in - - pindexes/20080703050349/part-0000 - -This index is exactly the same as one normally created by Nutch, the -only difference is the location. - - -====================================================================== -Add Revisit Dates -====================================================================== - -Now that we have the Nutch index, we add the revisit dates to it. - -Examine the "all.dup" file again, it has lines of the form - - example.org/robots.txt sha1:4G3PAROKCYJNRGZIHJO5PVLZ724FX3GN 20080618133034 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080613194800 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080616061312 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132204 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080618132213 - example.org/robots.txt sha1:AGW5DJIEUBL67473477TDVBBGDZ37AEZ 20080619132911 - -These are the revisit dates that need to be added to the records in -the Lucene index. When we generated the index, only the date of the -first visit was put in the index. Now we have to add these. - -As explained in README-dedup.txt, modifying the Lucene index to -actually add these dates is infeasible. What we do is create a -parallel index next to the main index (the part-00000 created above) -that contains all the dates for each record. - -The NutchWAX 'add-dates' command creates this parallel index for us. - - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ - all.dup - -Yes, the part-0000 argument does appear twice. This is beacuse it is -both the "key" index and the "source" index. - - -Suppose we did another crawl and had even more dates to add to the -existing index. In that case we would run - - $ nutchwax add-dates pindexes/20080703050349/part-0000 \ - pindexes/20080703050349/dates \ - pindexes/20080703050349/new-dates \ - new-crawl.dup - $ rm -r pindexes/20080703050349/dates - $ mv pindexes/20080703050349/new-dates pindexes/20080703050349/dates - -This copies the existing dates from "dates" to "new-dates" and adds -additional ones from "new-crawl.dup" along the way. Then we replace -the previous "dates" index with the new one. - - -WARC ----- -This step is the same for ARCs and WARCs. - -The only difference is that our "all.dup" file containing the list of -revisit dates was created by different utilities: 'dedup-cdx' for ARCs -and 'revisits' for WARCs. - - -====================================================================== -Search -====================================================================== - -Test/debug searches can be run from the command-line, but instead of -using the 'NutchBean' we use 'NutchWaxBean'. - -The "NutchWaxBean" extends NutchBean by adding support for parallel -indexes. - - $ nutch org.archive.nutchwax.NutchWaxBean <query> - -The "NutchWaxBean" also gives slightly more verbose and useful ouput, - - $ nutch org.archive.nutchwax.NutchWaxBean carolina - Total hits: 247338 - 0 [20080702053119] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080618133218, 20080618133218] - ... Studios Blue Ridge Motion Pictures Carolina Pinnacle Creative Network EUE/Screen ... Trailblazer Studios Federal Tax Incentive Carolina Pinnacle Studios ... - 1 [20080703023605] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html] [http://www.ncfilm.com/incentives-benefits/facilities/carolina-pinnacle-studios.html sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [sha1:WAMSFQPBRDMLOV3KETKCCTLJE3OTB23A] [20080613200046, 20080618133218] - -The output consists of - - hit number - segment - url - key (which is url + digest) - digest - dates - -The most useful bit here for testing de-duplication is the list of -dates. - - -====================================================================== -Web Deployment -====================================================================== - -As noted in the HOWTO.txt document, when the nutch(wax) webapp is -deployed, changes made to the configuration must be also applied to -the deployed webapp. - -In addition to those configuration changes, the "web.xml" file must -also be modified. - -In Nutch, the "web.xml" file contains a directive to call a static -method on 'NutchBean' to initialize it. In order to search the -parallel indexes we have to use 'NutchWaxBean'. This is done by -modifying the "web.xml" to call a NutchWaxBean initializer after the -NutchBean initializer. - -Change "web.xml" from - - <listener> - <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> - </listener> - -to: - - <listener> - <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> - <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> - </listener> - Deleted: tags/nutchwax-0_12_2/archive/HOWTO-xslt.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO-xslt.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/HOWTO-xslt.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,105 +0,0 @@ - -HOWTO-xslt.txt -2008-07-25 -Aaron Binns - -Table of Contents - o Prerequisites - - NutchWAX HOWTO.txt - o Overview - o XSLTFilter and web.xml - - -====================================================================== -Prerequisites -====================================================================== - -This HOWTO assumes you've already read the main NutchWAX HOWTO and are -familiar with importing and indexing archive files with NutchWAX. - -Also, we assume that you are familiar with deploying the Nutch(WAX) -web application into a servlet container such as Tomcat. - - -====================================================================== -Overview -====================================================================== - -Nutch is bundled with two search interfaces - - JSP pages: search.jsp, refine-query.jsp, etc. - Servlet : OpenSearchServlet - -If you read the OpenSearchServlet.java source code and the search.jsp -page, you'll notice a lot of similarity, if not duplication of code. - -The Internet Archive Web Team plans to improve and expand upon the -existing OpenSearchServlet interface as well as adding more XML-based -capabilities, including replacements for the existing JSP pages. In -short, moving away from JSP and toward XML. - -But by favoring XML over JSP, how does one make an HTML UI? By adding -XSLT to the XML interfaces. - -This HOWTO describes the process for adding an XSL transformation to -the OpenSearch XML output. - -This shall be the blueprint for future XML-based interfaces as well. - - -====================================================================== -XSLTFilter and web.xml -====================================================================== - -Adding an XSL transformation to an XML-based interface, such as the -OpenSearchServlet is straightforward. Simply add the XSLTFilter to -the servlet's path and specify the XSL transform to apply. - -For example, consider the default Nutch web.xml - - <servlet> - <servlet-name>OpenSearch</servlet-name> - <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> - </servlet> - - <servlet-mapping> - <servlet-name>OpenSearch</servlet-name> - <url-pattern>/opensearch</url-pattern> - </servlet-mapping> - -Let's say we want to retain the '/opensearch' path for the XML output, -and add the human-friendly HTML page at '/coolsearch' - -First, we add an additional 'servlet-mapping' for our new path: - - <servlet-mapping> - <servlet-name>OpenSearch</servlet-name> - <url-pattern>/coolsearch</url-pattern> - </servlet-mapping> - -Then, we add the XSLTFilter, passing it a URL to the XSLT file - - <filter> - <filter-name>XSLT Filter</filter-name> - <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> - <init-param> - <param-name>xsltUrl</param-name> - <param-value>[URL to XSLT file]</param-value> - </init-param> - </filter> - -Lastly, we apply the filter to the same path as the our human-friendly -HTML path: - - <filter-mapping> - <filter-name>XSLT Filter</filter-name> - <url-pattern>/coolsearch</url-pattern> - </filter-mapping> - -This way, we have two URLs, which run the exact same -OpenSearchServlet, but one produces the unperturbed OpenSearch XML -output whereas the other produces human-friendly HTML output. - - OpenSearch XML : http://someserver/opensearch?query=foo - Human-friendly HTML : http://someserver/coolsearch?query=foo - Deleted: tags/nutchwax-0_12_2/archive/HOWTO.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/HOWTO.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/HOWTO.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,466 +0,0 @@ - -HOWTO.txt -2008-07-28 -Aaron Binns - -Table of Contents - o Prerequisites - - Nutch(WAX) installation - - ARC/WARC files - o Configuration & Patching - o Create a manifest - o Import, Invert and Index - o Search - o Web deployment - - Don't forget to config & patch again - -====================================================================== -Prerequisites -====================================================================== - -In order to use Nutch(WAX) you need the following prerequisites: - - 1. NutchWAX installed. - - See INSTALL.txt for instruction on building and installing - NutchWAX. - - This HOWTO assumes it is installed in - - /opt/nutch-1.0-dev - - 2. ARC/WARC files. - - The whole purpose of NutchWAX is to index ARC/WARC files. These - files are not produced by Nutch nor NutchWAX, they are produced by - other tools, such as Heritrix. - - If you don't have any ARC/WARC files, you have no need for - NutchWAX. - - -====================================================================== -Patching -====================================================================== - -The vanilla NutchWAX as built according to the INSTALL.txt guide is -not quite ready to be used out-of-the-box. - -Before you can use NutchWAX, you must first patch a bug that exists in -the current Nutch SVN head. - -The file - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - -contains two errors: one where a mimetype is referenced before it is -defined; and a second where a definition has an illegal character. - -These errors cause Nutch to not recognize certain mimetypes and -therefore will ignore documents matching those mimetypes. - -There are two fixes: - - 1. Move - - <mime-type type="application/xml"> - <alias type="text/xml" /> - <glob pattern="*.xml" /> - </mime-type> - - definition higher up in the file, before the reference to it. - - 2. Remove - - <mime-type type="application/x-ms-dos-executable"> - <alias type="application/x-dosexec;exe" /> - </mime-type> - - as the ';' character is illegal according to the comments in the - Nutch code. - -You can either apply these patches yourself, or copy an already-patched -copy from: - - /opt/nutch-1.0-dev/contrib/archive/conf/tika-mimetypes.xml - -to - - /opt/nutch-1.0-dev/conf/tika-mimetypes.xml - - -====================================================================== -Configuring -====================================================================== - -Since we assume that you are already familiar with Nutch, then you -should already be familiar with configuring it. The configuration -is mainly defined in - - /opt/nutch-1.0-dev/conf/nutch-default.xml - -NutchWAX requires the modification of two existing properties and the -addition of two new ones. - -All of the modifications described below can be found in: - - /opt/nutch-1.0-dev/contrib/archive/conf/nutch-site.xml - -You can either apply the configuration changes yourself, or copy that -file to - - /opt/nutch-1.0-dev/conf/nutch-site.xml - --------------------------------------------------- -plugin.includes --------------------------------------------------- -Change the list of plugins from: - - protocol-http|urlfilter-regex|parse-(text|html|js)|index-(basic|anchor)|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic) - -to - - protocol-http|parse-(text|html|js|pdf)|index-(basic|anchor|nutchwax)|query-(basic|site|url|nutchwax)|summary-basic|scoring-opic|urlfilter-nutchwax - -In short, we add: - - index-nutchwax - query-nutchwax - urlfilter-nutchwax - parse-pdf - -and remove: - - urlfilter-regex - urlnormalizer-(pass|regex|basic) - -The only *required* changes are the additions of the NutchWAX index -and query plugins. The rest are optional, but recommended. - -The "parse-pdf" plugin is added simply because we have lots of PDFs in -our archives and we want to index them. We sometimes remove the -"parse-js" plugin if we don't care to index JavaScript files. - -We also remove the default Nutch URL filtering and normalizing plugins -because we do not need the URLs normalized nor filtered. We trust -that the tool that produced the ARC/WARC file will have normalized the -URLs contained therein according to its own rules so there's no need -to normalize here. Also, we don't filter by URL since we want to -index as much of the ARC/WARC file as we have parsers for. - -We do, however, add the NutchWAX URL filter. If de-duplication is -being performed upon import, this plugin is required. It performs URL -filtering of the list of ARC records to exclude based on -URL+digest+date. - --------------------------------------------------- -indexingfilter.order --------------------------------------------------- - -Add this property with a value of - - org.apache.nutch.indexer.basic.BasicIndexingFilter - org.archive.nutchwax.index.ConfigurableIndexingFilter - -So that the NutchWAX indexing filter is run after the Nutch basic -indexing filter. - -A full explanation is given in "README-dedup.txt". - --------------------------------------------------- -mime.type.magic --------------------------------------------------- -We disable mimetype detection in Nutch for two reasons: - -1. The ARC/WARC file specifies the Content-Type of the document. We - trust that the tool that created the ARC/WARC file got it right. - -2. The implementation in Nutch can use a lot of memory as the *entire* - document is read into memory as a byte[], then converted to a - String, then checked against the MIME database. This can lead to - out of memory errors for large files, such as music and video. - -To disable, simply set the property value to false. - - <property> - <name>mime.type.magic</name> - <value>false</value> - </property> - --------------------------------------------------- -nutchwax.filter.index --------------------------------------------------- -Configure the 'index-nutchwax' plugin. Specify how the metadata -fields added by the Importer are mapped to the Lucene documents during -indexing. - -The specifications here are of the form: - - src-key:lowercase:store:tokenize:exclusive:dest-key - -where the only required part is the "src-key", the rest will assume -the following defaults: - - lowercase = true - store = true - tokenize = false - exclusive = true - dest-key = src-key - -We recommend: - -<property> - <name>nutchwax.filter.index</name> - <value> - url:false:true:true - url:flase:true:false:true:exacturl - orig:false - digest:false - filename:false - fileoffset:false - collection - date - type - length - </value> -</property> - -The "url", "orig" and "digest" values are required, the rest are -optional, but strongly recommended. - --------------------------------------------------- -nutchwax.filter.query --------------------------------------------------- -Configure the 'query-nutchwax' plugin. Specify which fields to make -searchable via "field:[term|phrase]" query syntax, and whether they -are "raw" fields or not. - -The specification format is one of: - - field:<name>:<boost> - raw:<name>:<lowercase>:<boost> - group:<name>:<lowercase>:<delimiter>:<boost> - -Default values are - - lowercase = true - delimiter = "," - boost = 1.0f - -There is no "lowercase" property for "field" specification because the -Nutch FieldQueryFilter doesn't expose the option, unlike the -RawFieldQueryFilter. - -The "group" fields are raw fields that can accept multiple values, -separated by a delimiter. Multiple values appearing in a query are -automagically translated into required OR-groups, such as - - collection:"193,221,36" => +(collection:193 collection:221 collection:36) - -NOTE: We do *not* use this filter for handling "date" queries, there -is a specific filter for that: DateQueryFilter - -We recommend: - -<property> - <name>nutchwax.filter.query</name> - <value> - raw:digest:false - raw:filename:false - raw:fileoffset:false - raw:exacturl:false - group:collection - group:type - field:anchor - field:content - field:host - field:title - </value> -</property> - - --------------------------------------------------- -nutchwax.urlfilter.wayback.exclusions --------------------------------------------------- -File containing the exclusion list for importing. - -Normally, this is specified on the command line with the NutchWAX -Importer is invoked. It can be specified here if preferred. - --------------------------------------------------- -nutchwax.urlfilter.wayback.canonicalizer --------------------------------------------------- - -For CDX-based de-duplication, the same URL canonicalization algorithm -must be used here as was used to generate the CDX files. - -The default canonicalizer in Wayback's '(w)arc-indexer' utility -is - - org.archive.wayback.util.url.AggressiveUrlCanonicalizer - -which is the value provided in "nutch-site.xml". - -If the '(w)arc-indexer' is executed with the "-i" (identity) -command-line option, then the matching canonicalizer - - org.archive.wayback.util.url.IdentityUrlCanonicalizer - -must be specified here. - --------------------------------------------------- -nutchwax.filter.http.status --------------------------------------------------- -This property configures a filter with a list of ranges -of HTTP status codes to allow. - -Typically, most NutchWAX implementors do not wish to import and index -404, 500, 302 and other non-success pages. This is an inclusion -filter, meaning that only ARC records with an HTTP status code -matching any of the values will be imported. - -There is a special "unknown" value which can be used to include ARC -records that don't have an HTTP status code (for whatever reason). - -The default setting provided in nutch-site.xml is to allow any 2XX -success code: - - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - </value> - </property> - -But some other examples are: - - Allow any 2XX success code *and* redirects, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200-299 - 300-399 - </value> - </property> - - Be really strict about only certain codes, use: - <property> - <name>nutchwax.filter.http.status</name> - <value> - 200 - 301 - 302 - 304 - </value> - </property> - - Mix of ranges and specific codes, including the "unknown" - <property> - <name>nutchwax.filter.http.status</name> - <value> - Unknown - 200 - 300-399 - </value> - </property> - --------------------------------------------------- -nutchwax.import.content.limit --------------------------------------------------- -Similar to Nutch's - - file.content.limit - http.content.limit - ftp.content.limit - -properties, this specifies a limit on the size of a document imported -via NutchWAX. - -We recommend setting this to a size compatible with the memory -capacity of the computers performing the import. Something in the -1-4MB range is typical. - - -====================================================================== -Create a manifest -====================================================================== - -The input to NutchWAX's import tool is a manifest file. This is a -simple text file where each line contains a URL to an ARC/WARC file -and an optional collection name. - -For example: - - $ cat > manifest - http://someserver/somepath/somearchive.arc.gz mycollection - ^D - -Creates a simple manifest file with one ARC file and a collection -name of "mycollection". - -You don't have to use collections at all. If you don't know how you -would use it, then simply leave it out here. - - -====================================================================== -Import, Invert and Index -====================================================================== - -The steps to import the files, invert the link and index the documents -are rather simple: - - $ mkdir crawl - $ cd crawl - $ /opt/nutch-1.0-dev/bin/nutchwax import ../manifest - $ /opt/nutch-1.0-dev/bin/nutch updatedb crawldb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutch-1.0-dev/bin/nutch index indexes crawldb linkdb segments/* - $ ls -F1 - crawldb/ - indexes/ - linkdb/ - segments/ - -To those already familiar with Nutch, these steps should be quite -familiar. - -The first step, we call NutchWAX's "import" command which creates the -Nutch segment containing the documents in the ARC/WARC files listed in -the manifest. The rest is the same as regular Nutch. - - -====================================================================== -Search -====================================================================== -The resulting indexes can be searched in exactly the same manner as in -regular Nutch. For example, assuming you just completed the steps -above, now: - - $ cd ../ - $ ls -F1 - crawl/ - $ /opt/nutch-1.0-dev/bin/nutch org.apache.nutch.searcher.NutchBean computer - -This calls the NutchBean to execute a simple keyword search for -"computer". Use whatever query term you think appears in the -documents you imported. - - -====================================================================== -Web Deployment -====================================================================== - -As users of Nutch are aware, the web application (nutch-1.0-dev.war) -bundled with Nutch contains duplicate copies of the configuration -files. - -So, all patches and configuration changes that we made to the -files in - - /opt/nutch-1.0-dev/conf - -will have to be duplicated in the Nutch webapp when it is deployed. - -This is not due to NutchWAX, this is a "feature" of regular Nutch. I -just thought it would be good to remind everyone since we did make -configuration changes for NutchWAX. Deleted: tags/nutchwax-0_12_2/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/INSTALL.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,93 +0,0 @@ - -INSTALL.txt -2008-10-01 -Aaron Binns - -This installation guide assumes the reader is already familiar with -building, packaging and deploying Nutch 1.0-dev. - - -The NutchWAX 0.12 source and build system are designed to integrate -into the existing Nutch 1.0-dev source and build. - -The long-term goal is for the NutchWAX components to be fully -integrated into mainline Nutch. As a stepping-stone toward that goal, -we have packaged the NutchWAX source to be dropped into the Nutch -"contrib" directory and built from there. - -Like Nutch, NutchWAX 0.12 uses a simple 'ant' build script. The -NutchWAX build script calls out to the Nutch script to build Nutch -proper, then builds NutchWAX components and integrates them into the -Nutch build directory. - -We recommend that you execute all build commands from the NutchWAX -directory. This way, NutchWAX will ensure that any and all -dependencies in Nutch will be properly built and kept up-to-date. -Towards this goal, we have duplicated the most common build targets -from the Nutch 'build.xml' file to the NutchWAX 'build.xml' file, -such as: - - o compile - o jar - o job - o tar - o clean - -Again, the idea is that if you're already used to building Nutch, you -can easily transition to building Nutch and NutchWAX together. All of -the build artifacts will still be placed in Nutch's 'build' -sub-directory as normal. - - -Nutch-1.0-dev -------------- -As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. -Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.2 is -built against is: - - 701524 - -To checkout this revision of Nutch, use: - - $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch - $ cd nutch - - -NutchWAX --------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX into -Nutch's "contrib" directory. - - $ cd contrib - $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive - -This will create a sub-directory named "archive" containing the -NutchWAX sources. - - -Build and install ------------------ -Assuming you already have the required tool-set for building Nutch, -building NutchWAX is a snap. - -Simply execute the same 'ant' build command in - - nutch/contrib/archive - -as you normally would and everything will build as normal. - -For example - - $ cd nutch/contrib/archive - $ ant tar - -This command will build all of Nutch, then the NutchWAX add-ons and -finally will package everything up into the "nutch-1.0-dev.tar.gz" -release package. - -Then, install the "nutch-1.0-dev.tar.gz" tarball as normal. For -example: - - $ cd /opt - $ tar xvfz nutch-1.0-dev.tar.gz Deleted: tags/nutchwax-0_12_2/archive/LICENSE.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/LICENSE.txt 2008-10-13 17:48:52 UTC (rev 2609) +++ tags/nutchwax-0_12_2/archive/LICENSE.txt 2008-10-13 18:08:29 UTC (rev 2610) @@ -1,519 +0,0 @@ - -NutchWAX is free software. Except as noted, it is licensed under the -terms of the GNU Lesser Public License (LGPL), reproduced below. - -Source code derived from Nutch retains the Apache License, as -stipulated by that license. - -Libraries used by NutchWAX are redistributed under their respective -liceneses, which can be found in a file with the same name as the -library, suffixed by ".LICENSE". For example, the license for -"foo.jar" can be found in "foo.LICENSE". - -All other files not carrying an explicit license are licensed under -the GNU Lesser General Public License version 2.1 (included below) - -====================================================================== - - GNU LESSER GENERAL PUBLIC LICENSE - Version 2.1, February 1999 - - Copyright (C) 1991, 1999 Free Software Foundation, Inc. - 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA - Everyone is permitted to copy and distribute verbatim copies - of this license document, but changing it is not allowed. - -[This is the first released version of the Lesser GPL. It also counts - as the successor of the GNU Library Public License, version 2, hence - the version number 2.1.] - - Preamble - - The licenses for most software are designed to take away your -freedom to share and change it. By contrast, the GNU General Public -Licenses are intended to guarantee your freedom to share and change -free software--to make sure the software is free for all its users. - - This license, the Lesser General Public License, applies to some -specially designated software packages--typically libraries--of the -Free Software Foundation and other authors who decide to use it. You -can use it too, but we suggest you first think carefully about whether -this license or the ordinary General Public License is the better -strategy to use in any particular case, based on the explanations below. - - When we speak of free software, we are referring to freedom of use, -not price. Our General Public Licenses are designed to make sure that -you have the freedom to distribute copies of free software (and charge -for this service if you wish); that you receive source code or can get -it if you want it; that you can change the software and use pieces of -it in new free programs; and that you are informed that you can do -these things. - - To protect your rights, we need to make restrictions that forbid -distributors to deny you these rights or to ask you to surrender these -rights. These restrictions translate to certain responsibilities for -you if you distribute copies of the library or if you modify it. - - For example, if you distribute copies of the library, whether gratis -or for a fee, you must give the recipients all the rights that we gave -you. You must make sure that they, too, receive or can get the source -code. If you link other code with the library, you must provide -complete object files to the recipients, so that they can relink them -with the library after making changes to the library and recompiling -it. And you must show them these terms so they know their rights. - - We protect your rights with a two-step method: (1) we copyright the -library, and (2) we offer you this license, which gives you legal -permission to copy, distribute and/or modify the library. - - To protect each distributor, we want to make it very clear that -there is no warranty for the free library. Also, if the library is -modified by someone else and passed on, the recipients should know -that what they have is not the original version, so that the original -author's reputation will not be affected by problems that might be -introduced by others. - - Finally, software patents pose a constant threat to the existence of -any free program. We wish to make sure that a company cannot -effectively restrict the users of a free program by obtaining a -restrictive license from a patent holder. Therefore, we insist that -any patent license obtained for a version of the library must be -consistent with the full freedom of use specified in this license. - - Most GNU software, including some libraries, is covered by the -ordinary GNU General Public License. This license, the GNU Lesser -General Public License, applies to certain designated libraries, and -is quite different from the ordinary General Public License. We use -this license for certain libraries in order to permit linking those -libraries into non-free programs. - - When a program is linked with a library, whether statically or using -a shared library, the combination of the two is legally speaking a -combined work, a derivative of the original library. The ordinary -General Public License therefore permits such linking only if the -entire combination fits its criteria of freedom. The Lesser General -Public License permits more lax criteria for linking other code with -the library. - - We call this license the "Lesser" General Public License because it -does Less to protect the user's freedom than the ordinary General -Public License. It also provides other free software developers Less -of an advantage over competing non-free programs. These disadvantages -are the reason we use the ordinary General Public License for many -libraries. However, the Lesser license provides advantages in certain -special circumstances. - - For example, on rare occasions, there may be a special need to -encourage the widest possible use of a certain library, so that it becomes -a de-facto standard. To achieve this, non-free programs must be -allowed to use the library. A more frequent case is that a free -library does the same job as widely used non-free libraries. In this -case, there is little to gain by limiting the free library to free -software only, so we use the Lesser General Public License. - - In other cases, permission to use a particular library in non-free -programs enables a greater number of people to use a large body of -free software. For example, permission to use the GNU C Library in -non-free programs enables many more people to use the whole GNU -operating system, as well as its variant, the GNU/Linux operating -system. - - Although the Lesser General Public License is Less protective of the -users' freedom, it does ensure that the user of a program that is -linked with the Library has the freedom and the wherewithal to run -that program using a modified version of the Library. - - The precise terms and conditions for copying, distribution and -modification follow. Pay close attention to the difference between a -"work based on the library" and a "work that uses the library". The -former contains code derived from the library, whereas the latter must -be combined with the library in order to run. - - GNU LESSER GENERAL PUBLIC LICENSE - TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION - - 0. This License Agreement applies to any software library or other -program which contains a notice placed by the copyright holder or -other authorized party saying it may be distributed under the terms of -this Lesser General Public License (also called "this License"). -Each licensee is addressed as "you". - - A "library" means a collection of software functions and/or data -prepared so as to be conveniently linked with application programs -(which use some of those functions and data) to form executables. - - The "Library", below, refers to any such software library or work -which has been distributed under these terms. A "work based on the -Library" means either the Library or any derivative work under -copyright law: that is to say, a work containing the Library or a -portion of it, either verbatim or with modifications and/or translated -straightforwardly into another language. (Hereinafter, translation is -included without limitation in the term "modification".) - - "Source code" for a work means the preferred form of the work for -making modifications to it. For a library, complete source code means -all the source code for all modules it contains, plus any associated -interface definition files, plus the scripts used to control compilation -and installation of the library. - - Activities other than copying, distribution and modification are not -covered by this License; they are outside its scope. The act of -running a program using the Library is not restricted, and output from -such a program is covered only if its contents constitute a work based -on the Library (independent of the use of the Library in a tool for -writing it). Whether that is true depends on what the Library does -and what the program that uses the Library does. - - 1. You may copy and distribute verbatim copies of the Library's -complete source code as you receive it, in any medium, provided that -you conspicuously and appropriately publish on each copy an -appropriate copyright notice and disclaimer of warranty; keep intact -all the notices that refer to this License and to the absence of any -warranty; and distribute a copy of this License along with the -Library. - - You may charge a fee for the physical act of transferring a copy, -and you may at your option offer warranty protection in exchange for a -fee. - - 2. You may modify your copy or copies of the Library or any portion -of it, thus forming a work based on the Library, and copy and -distribute such modifications or work under the terms of Section 1 -above, provided that you also meet all of these conditions: - - a) The modified work must itself be a software library. - - b) You must cause the files modified to carry prominent notices - stating that you changed the files and the date of any change. - - c) You must cause the whole of the work to be licensed at no - charge to all third parties under the terms of this License. - - d) If a facility in the modified Library refers to a function or a - table of data to be supplied by an application program that uses - the facility, other than as an argument passed when the facility - is invoked, then you must make a good faith effort to ensure that, - in the event an application does not supply such function or - table, the facility still operates, and performs whatever part of - its purpose remains meaningful. - - (For example, a function in a library to compute square roots has - a purpose that is entirely well-defined independent of the - application. Therefore, Subsection 2d requires that any - application-supplied function or table used by this function must - be optional: if the application does not supply it, the square - root function must still compute square roots.) - -These requirements apply to the modified work as a whole. If -identifiable sections of that work are not derived from the Library, -and can be reasonably considered independent and separate works in -themselves, then this License, and its terms, do not apply to those -sections when you distribute them as separate works. But when you -distribute the same sections as part of a whole which is a work based -on the Library, the distribution of the whole must be on the terms of -this License, whose permissions for other licensees extend to the -entire whole, and thus to each and every part regardless of who wrote -it. - -Thus, it is not the intent of this section to claim rights or contest -your rights to work written entirely by you; rather, the intent is to -exercise the right to control the distribution of derivative or -collective works based on the Library. - -In addition, mere aggregation of another work not based on the Library -with the Library (or with a work based on the Library) on a volume of -a storage or distribution medium does not bring the other work under -the scope of this License. - - 3. You may opt to apply the terms of the ordinary GNU General Public -License instead of this License to a given copy of the Library. To do -this, you must alter all the notices that refer to this License, so -that they refer to the ordinary GNU General Public License, version 2, -instead of to this License. (If a newer version than version 2 of the -ordinary GNU General Public License has appeared, then you can specify -that version instead if you wish.) Do not make any other change in -these notices. - - Once this change is made in a given copy, it is irreversible for -that copy, so the ordinary GNU General Public License applies to all -subsequent copies and derivative works made from that copy. - - This option is useful when you wish to copy part of the code of -the Library into a program that is not a library. - - 4. You may copy and distribute the Library (or a portion or -derivative of it, under Section 2) in object code or executable form -under the terms of Sections 1 and 2 above provided that you accompany -it with the complete corresponding machine-readable source code, which -must be distributed under the terms of Sections 1 and 2 above on a -medium customarily used for software interchange. - - If distribution of object code is made by offering access to copy -from a designated place, then offering equivalent access to copy the -source code from the same place satisfies the requirement to -distribute the source code, even though third parties are not -compelled to copy the source along with the object code. - - 5. A program that contains no derivative of any portion of the -Library, but is designed to work with the Library by being compiled or -linked with it, is called a "work that uses the Library". Such a -work, in isolation, is not a derivative work of the Library, and -therefore falls outside the scope of this License. - - However, linking a "work that uses the Library" with the Library -creates an executable that is a derivative of the Library (because it -contains portions of the Library), rather than a "work that uses the -library". The executable is therefore covered by this License. -Section 6 states terms for distribution of such executables. - - When a "work that uses the Library" uses material from a header file -that is part of the Library, the object code for the work may be a -derivative work of the Library even though the source code is not. -Whether this is true is especially significant if the work can be -linked without the Library, or if the work is itself a library. The -threshold for this to be true is not precisely defined by law. - - If such an object file uses only numerical parameters, data -structure layouts and accessors, and small macros and small inline -functions (ten lines or less in length), then the use of the object -file is unrestricted, regardless of whether it is legally a derivative -work. (Executables containing this object code plus portions of the -Library will still fall under Section 6.) - - Otherwise, if the work is a derivative of the Library, you may -distribute the object code for the work under the terms of Section 6. -Any executables containing that work also fall under Section 6, -whether or not they are linked directly with the Library itself. - - 6. As an exception to the Sections above, you may also combine or -link a "work that uses the Library" with the Library to produce a -work containing portions of the Library, and distribute that work -under terms of your choice, provided that the terms permit -modification of the work for the customer's own use and reverse -engineering for debugging such modifications. - - You must give prominent notice with each copy of the work that the -Library is used in it and that the Library and its use are covered by -this License. You must supply a copy of this License. If the work -during execution displays copyright notices, you must include the -copyright notice for the Library among them, as well as a reference -directing the user to the copy of this License. Also, you must do one -of these things: - - a) Accompany the work with the complete corresponding - machine-readable source code for the Library including whatever - changes were used in the work (which must be distributed under - Sections 1 and 2 above); and, if the work is an executable linked - with the Library, with the complete machine-readable "work that - uses the Library", as object code and/or source code, so that the - user can modify the Library and then relink to produce a modified - executable containing the modified Library. (It is understood - that the user who changes the contents of definitions files in the - Library will not necessarily be able to recompile the application - to use the modified definitions.) - - b) Use a suitable shared library mechanism for linking with the - Library. A suitable mechanism is one that (1) uses at run time a - copy of the library already present on the user's computer system, - rather than copying library functions into the executable, and (2) - will operate properly with a modified version of the library, if - the user installs one, as long as the modified version is - interface-compatible with the version that the work was made with. - - c) Accompany the work with a written offer, valid for at - least three years, to give the same user the materials - specified in Subsection 6a, above, for a charge no more - than the cost of performing this distribution. - - d) If distribution of the work is made by offering access to copy - from a designated place, offer equivalent access to copy the above - specified materials from the same place. - - e) Verify that the user has already received a copy of these - materials or that you have already sent this user a copy. - - For an executable, the required form of the "work that uses the -Library" must include any data and utility programs needed for -reproducing the executable from it. However, as a special exception, -the materials to be distributed need not include anything that is -normally distributed (in either source or binary form) with the major -components (compiler, kernel, and so on) of the operating system on -which the executable runs, unless that component itself accompanies -the executable. - - It may happen that this requirement contradicts the license -restrictions of other proprietary libraries that do not normally -accompany the operating system. Such a contradiction means you cannot -use both them and the Library together in an executable that you -distribute. - - 7. You may place library facilities that are a work based on the -Library side-by-side in a single library together with other library -facilities not covered by this License, and distribute such a combined -library, provided that the separate distribution of the work based on -the Library and of the other library facilities is otherwise -permitted, and provided that you do these two things: - - a) Accompany the combined library with a copy of the same work - based on the Library, uncombined with any other library - facilities. This must be distributed under the terms of the - Sections above. - - b) Give prominent notice with the combined library of the fact - that part of it is a work based on the Library, and explaining - where to find the accompanying uncombined form of the same work. - - 8. You may not copy, modify, sublicense, link with, or distribute -the Library except as expressly provided under this License. Any -attempt otherwise to copy, modify, sublicense, link with, or -distribute the Library is void, and will automatically terminate your -rights under this License. However, parties who have received copies, -or rights, from you under this License will not have their licenses -terminated so long as such parties remain in full compliance. - - 9. You are not required to accept this License, since you have not -signed it. However, nothing else grants you permission to modify or -distribute the Library or its derivative works. These actions are -prohibited by law if you do not accept this License. Therefore, by -modifying or distributing the Library (or any work based on the -Library), you indicate your acceptance of this License to do so, and -all its terms and conditions for copying, distributing or modifying -the Library or works based on it. - - 10. Each time you redistribute the Library (or any work based on the -Library), the recipient automatically receives a license from the -original licensor to copy, distribute, link with or modify the Library -subject to these terms and conditions. You may not impose any further -restrictions on the recipients' exercise of the rights granted herein. -You are not responsible for enforcing compliance by third parties with -this License. - - 11. If, as a consequence of a court judgment or allegation of patent -infringement or for any other reason (not limited to patent issues), -conditions are imposed on you (whether by court order, agreement or -otherwise) that contradict the conditions of this License, they do not -excuse you from the conditions of this License. If you cannot -distribute so as to satisfy simultaneously your obligations under this -License and any other pertinent obligations, then as a consequence you -may not distribute the Library at all. For example, if a patent -license would not permit royalty-free redistribution of the Library by -all those who receive copies directly or indirectly through you, then -the only way you could satisfy both it and this License would be to -refrain entirely from distribution of the Library. - -If any portion of this section is held invalid or unenforceable under any -particular circumstance, the balance of the section is intended to apply, -and the section as a whole is intended to apply in other circumstances. - -It is not the purpose of this section to induce you to infringe any -patents or other property right claims or to contest validity of any -such claims; this section has the sole purpose of protecting the -integrity of the free software distribution system which is -implemented by public license practices. Many people have made -generous contributions to the wide range of software distributed -through that system in reliance on consistent application of that -system; it is up to the author/donor to decide if he or she is willing -to distribute software through any other system and a licensee cannot -impose that choice. - -This section is intended to make thoroughly clear what is believed to -be a consequence of the rest of this License. - - 12. If the distribution and/or use of the Library is restricted in -certain countries either by patents or by copyrighted interfaces, the -original copyright holder who places the Library under this License may add -an explicit geographical distribution limitation excluding those countries, -so that distribution is permitted only in or among countries not thus -excluded. In such case, this License incorporates the limitation as if -written in the body of this License. - - 13. The Free Software Foundation may publish revised and/or new -versions of the Lesser General Public License from time to time. -Such new versions will be similar in spirit to the present version, -but may differ in detail to address new problems or concerns. - -Each version is given a distinguishing version number. If the Library -specifies a version number of this License which applies to it and -"any later version", you have the option of following the terms and -conditions either of that version or of any later version published by -the Free Software Foundation. If the Library does not specify a -license version number, you may choose any version ever published by -the Free Software Foundation. - - 14. If you wish to incorporate parts of the Library into other free -programs whose distribution conditions are incompatible with these, -write to the author to ask for permission. For software which is -copyrighted by the Free Software Foundation, write to the Free -Software Foundation; we sometimes make exceptions for this. Our -decision will be guided by the two goals of preserving the free status -of all derivatives of our free software and of promoting the sharing -and reuse of software generally. - - NO WARRANTY - - 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO -WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. -EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR -OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY -KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR -PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE -LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME -THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. - - 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN -WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY -AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU -FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR -CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE -LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING -RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A -FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF -SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH -DAMAGES. - - END OF TERMS AND CONDITIONS - - How to Apply These Terms to Your New Libraries - - If you develop a new library, and you want it to be of the greatest -possible use to the public, we recommend making it free software that -everyone can redistribute and change. You can do so by permitting -redistribution under these terms (or, alternatively, under the terms of the -ordinary General Public License). - - To apply these terms, attach the following notices to the library. It is -safest to attach them to the start of each source file to most effectively -convey the exclusion of warranty; and each file should have at least the -"copyright" line and a pointer to where the full notice is found. - - <one line to give the library's name and a brief idea of what it does.> - Copyright (C) <year> <name of author> - - This... [truncated message content] |
From: <bi...@us...> - 2008-10-13 17:48:57
|
Revision: 2609 http://archive-access.svn.sourceforge.net/archive-access/?rev=2609&view=rev Author: binzino Date: 2008-10-13 17:48:52 +0000 (Mon, 13 Oct 2008) Log Message: ----------- Changed date to be date of release: October 13. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-10-13 17:48:17 UTC (rev 2608) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-10-13 17:48:52 UTC (rev 2609) @@ -1,6 +1,6 @@ RELEASE-NOTES.TXT -2008-10-01 +2008-10-13 Aaron Binns Release notes for NutchWAX 0.12.2 This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-13 17:48:31
|
Revision: 2608 http://archive-access.svn.sourceforge.net/archive-access/?rev=2608&view=rev Author: binzino Date: 2008-10-13 17:48:17 +0000 (Mon, 13 Oct 2008) Log Message: ----------- Updated with more issues resolved. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-10-11 02:12:37 UTC (rev 2607) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-10-13 17:48:17 UTC (rev 2608) @@ -28,16 +28,23 @@ Issues resolved in this release: +WAX-19 + Add strict/loose option to DateAdder for revisit lines with extra + data on end. + +WAX-21 + Allow for blank lines and comment lines in manifest file. + +WAX-22 + Various code clean-ups based on code review using PMD tool. + WAX-23 Add a "field setter" filter to set a field to a static value in the Lucene document during indexing. -WAX-22 - Various code clean-ups based on code review using PMD tool. +WAX-24 + DateAdder fails due to uncaught exception in URL canonicalization -WAX-21 - Allow for blank lines and comment lines in manifest file. +WAX-25 + Add utility/tool to dump unique values of a field in an index. -WAX-19 - Add strict/loose option to DateAdder for revisit lines with extra - data on end. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-11 02:12:46
|
Revision: 2607 http://archive-access.svn.sourceforge.net/archive-access/?rev=2607&view=rev Author: bradtofel Date: 2008-10-11 02:12:37 +0000 (Sat, 11 Oct 2008) Log Message: ----------- ENHANCEMENT(ACC-38): added timeouts to HTTP requests for remote index and remote ARC/WARC documents. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2008-10-11 01:59:57 UTC (rev 2606) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2008-10-11 02:12:37 UTC (rev 2607) @@ -26,6 +26,8 @@ import java.io.File; import java.io.IOException; +import java.net.URL; +import java.net.URLConnection; import java.util.logging.Logger; import javax.xml.parsers.DocumentBuilder; @@ -71,7 +73,10 @@ .class.getName()); private String searchUrlBase; - + private int connectTimeout = 10000; + private int readTimeout = 10000; + + private DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); private static final String WB_XML_REQUEST_TAGNAME = "request"; @@ -333,7 +338,11 @@ // do an HTTP request, plus parse the result into an XML DOM protected Document getHttpDocument(String url) throws IOException, SAXException { - return (getDocumentBuilder()).parse(url); + URL u = new URL(url); + URLConnection conn = u.openConnection(); + conn.setConnectTimeout(connectTimeout); + conn.setReadTimeout(readTimeout); + return (getDocumentBuilder()).parse(conn.getInputStream(),url); } protected Document getFileDocument(File f) throws IOException, SAXException { @@ -365,4 +374,19 @@ public void setCanonicalizer(UrlCanonicalizer canonicalizer) { this.canonicalizer = canonicalizer; } + public int getConnectTimeout() { + return connectTimeout; + } + + public void setConnectTimeout(int connectTimeout) { + this.connectTimeout = connectTimeout; + } + + public int getReadTimeout() { + return readTimeout; + } + + public void setReadTimeout(int readTimeout) { + this.readTimeout = readTimeout; + } } \ No newline at end of file Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java 2008-10-11 01:59:57 UTC (rev 2606) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java 2008-10-11 02:12:37 UTC (rev 2607) @@ -4,6 +4,7 @@ import java.io.IOException; import java.net.URL; +import org.archive.io.ArchiveReader; import org.archive.io.ArchiveRecord; import org.archive.io.arc.ARCReader; import org.archive.io.arc.ARCReaderFactory; @@ -60,25 +61,27 @@ } public static Resource getResource(URL url, long offset) - throws IOException, ResourceNotAvailableException { + throws IOException, ResourceNotAvailableException { + Resource r = null; - String name = url.getFile(); - if (isArc(name)) { - - ARCReader reader = ARCReaderFactory.get(url, offset); - r = ARCArchiveRecordToResource(reader.get(),reader); - - } else if (isWarc(name)) { - - WARCReader reader = WARCReaderFactory.get(url, offset); - r = WARCArchiveRecordToResource(reader.get(),reader); - + // TODO: allow configuration of timeouts -- now using defaults.. + TimeoutArchiveReaderFactory tarf = new TimeoutArchiveReaderFactory(); + ArchiveReader reader = tarf.getArchiveReader(url,offset); + if(reader instanceof ARCReader) { + ARCReader areader = (ARCReader) reader; + r = ARCArchiveRecordToResource(areader.get(),areader); + + } else if(reader instanceof WARCReader) { + WARCReader wreader = (WARCReader) reader; + r = WARCArchiveRecordToResource(wreader.get(),wreader); + } else { - throw new ResourceNotAvailableException("Unknown extension"); + throw new ResourceNotAvailableException("Unknown ArchiveReader"); } return r; } - + + private static boolean isArc(final String name) { return (name.endsWith(ArcWarcFilenameFilter.ARC_SUFFIX) Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java 2008-10-11 02:12:37 UTC (rev 2607) @@ -0,0 +1,58 @@ +package org.archive.wayback.resourcestore.resourcefile; + +import java.io.IOException; +import java.net.HttpURLConnection; +import java.net.URL; +import java.net.URLConnection; + +import org.archive.io.ArchiveReader; +import org.archive.io.ArchiveReaderFactory; + +/** + * Sad but needed subclass of the ArchiveReaderFactory, allows config of + * timeouts for connect and reads on underlying HTTP connections, and overrides + * the one getArchiveReader(URL,long) method to enable setting the timeouts. + * + * This functionality should be moved into the ArchiveReaderFactory. + * + * @author brad + * + */ +public class TimeoutArchiveReaderFactory extends ArchiveReaderFactory { + + private final static int STREAM_ALL = -1; + private int connectTimeout = 10000; + private int readTimeout = 10000; + public TimeoutArchiveReaderFactory(int connectTimeout, int readTimeout) { + this.connectTimeout = connectTimeout; + this.readTimeout = readTimeout; + } + + public TimeoutArchiveReaderFactory(int timeout) { + this.connectTimeout = timeout; + this.readTimeout = timeout; + } + public TimeoutArchiveReaderFactory() { + } + protected ArchiveReader getArchiveReader(final URL f, final long offset) + throws IOException { + + // Get URL connection. + URLConnection connection = f.openConnection(); + if (connection instanceof HttpURLConnection) { + addUserAgent((HttpURLConnection)connection); + } + if (offset != STREAM_ALL) { + // Use a Range request (Assumes HTTP 1.1 on other end). If + // length >= 0, add open-ended range header to the request. Else, + // because end-byte is inclusive, subtract 1. + connection.addRequestProperty("Range", "bytes=" + offset + "-"); + } + + connection.setConnectTimeout(connectTimeout); + connection.setReadTimeout(readTimeout); + + return getArchiveReader(f.toString(), connection.getInputStream(), + (offset == 0)); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-11 02:00:04
|
Revision: 2606 http://archive-access.svn.sourceforge.net/archive-access/?rev=2606&view=rev Author: bradtofel Date: 2008-10-11 01:59:57 +0000 (Sat, 11 Oct 2008) Log Message: ----------- NEW FEATURE (ACC-43): Allow adding a generic ObjectFilter<CaptureSearchResult> on a LocalResourceIndex, implemented 2 new ObjectFilter<CaptureSearchResult>, one which allows include/exclude lists of HTTP response codes, and one highly experimental BeanShellFilter, which is too slow to use in any but small installations, but may provide the escape hatch needed for some installations where performance is not the crucial problem. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/BeanShellFilter.java Property Changed: ---------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2008-10-11 01:50:43 UTC (rev 2605) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2008-10-11 01:59:57 UTC (rev 2606) @@ -86,6 +86,8 @@ private boolean dedupeRecords = false; private ObjectFilter<CaptureSearchResult> annotater = null; + + private ObjectFilter<CaptureSearchResult> filter = null; public LocalResourceIndex() { canonicalizer = new AggressiveUrlCanonicalizer(); @@ -122,7 +124,7 @@ CaptureSearchResults results = new CaptureSearchResults(); CaptureQueryFilterState filterState = - new CaptureQueryFilterState(wbRequest,canonicalizer, type); + new CaptureQueryFilterState(wbRequest,canonicalizer, type, filter); String keyUrl = filterState.getKeyUrl(); CloseableIterator<CaptureSearchResult> itr = getCaptureIterator(keyUrl); @@ -159,7 +161,7 @@ CaptureQueryFilterState filterState = new CaptureQueryFilterState(wbRequest,canonicalizer, - CaptureQueryFilterState.TYPE_URL); + CaptureQueryFilterState.TYPE_URL, filter); String keyUrl = filterState.getKeyUrl(); CloseableIterator<CaptureSearchResult> citr = getCaptureIterator(keyUrl); @@ -287,6 +289,14 @@ public void setAnnotater(ObjectFilter<CaptureSearchResult> annotater) { this.annotater = annotater; } + + public ObjectFilter<CaptureSearchResult> getFilter() { + return filter; + } + + public void setFilter(ObjectFilter<CaptureSearchResult> filter) { + this.filter = filter; + } private class CaptureQueryFilterState { public final static int TYPE_REPLAY = 0; @@ -302,7 +312,8 @@ private String exactDate; public CaptureQueryFilterState(WaybackRequest request, - UrlCanonicalizer canonicalizer, int type) + UrlCanonicalizer canonicalizer, int type, + ObjectFilter<CaptureSearchResult> genericFilter) throws BadQueryException { String searchUrl = request.getRequestUrl(); @@ -333,6 +344,9 @@ preExclusionCounter = new CounterFilter(); DateRangeFilter drFilter = new DateRangeFilter(startDate,endDate); + if(genericFilter != null) { + filter.addFilter(genericFilter); + } // has the user asked for only results on the exact host specified? ObjectFilter<CaptureSearchResult> exactHost = getExactHostFilter(request); Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/BeanShellFilter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/BeanShellFilter.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/BeanShellFilter.java 2008-10-11 01:59:57 UTC (rev 2606) @@ -0,0 +1,80 @@ +package org.archive.wayback.resourceindex.filters; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.util.ObjectFilter; + +import bsh.EvalError; +import bsh.Interpreter; + +public class BeanShellFilter implements ObjectFilter<CaptureSearchResult> { + + private String expression = null; + private String method = null; + private String scriptPath = null; + + @SuppressWarnings("unchecked") + private final ThreadLocal tl = new ThreadLocal() { + protected synchronized Object initialValue() { + return new Interpreter(); + } + }; + private Interpreter getInterpreter() { + Interpreter i = (Interpreter) tl.get(); + if(method != null) { + + } + return i; + } + + public BeanShellFilter() { + } + + public int filterObject(CaptureSearchResult o) { + int result = FILTER_EXCLUDE; + try { + boolean bResult = false; + Interpreter interpreter = getInterpreter(); + interpreter.set("result", o); + + if(expression != null) { + bResult = (Boolean) interpreter.eval(expression); + } else if(method != null) { + bResult = (Boolean) interpreter.eval("matches(result)"); + } else if(scriptPath != null) { + bResult = (Boolean) interpreter.eval("matches(result)"); + } + + if(bResult) { + result = FILTER_INCLUDE; + } + + } catch (EvalError e) { + e.printStackTrace(); + } + return result; + } + + public String getExpression() { + return expression; + } + + public void setExpression(String expression) { + this.expression = expression; + } + + public String getMethod() { + return method; + } + + public void setMethod(String method) { + this.method = method; + } + + public String getScriptPath() { + return scriptPath; + } + + public void setScriptPath(String scriptPath) { + this.scriptPath = scriptPath; + } +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java ___________________________________________________________________ Added: svn:keywords + Date Rev This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |