You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bra...@us...> - 2009-07-18 00:37:43
|
Revision: 2780 http://archive-access.svn.sourceforge.net/archive-access/?rev=2780&view=rev Author: bradtofel Date: 2009-07-18 00:37:41 +0000 (Sat, 18 Jul 2009) Log Message: ----------- FEATURE(ACC-77): added example of prefix usage, and a comment describing the example. Modified Paths: -------------- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlReplay.xml Modified: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlReplay.xml =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlReplay.xml 2009-07-18 00:36:37 UTC (rev 2779) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlReplay.xml 2009-07-18 00:37:41 UTC (rev 2780) @@ -6,6 +6,15 @@ <bean id="archivalurlhttpheaderprocessor" class="org.archive.wayback.replay.RedirectRewritingHttpHeaderProcessor" /> +<!-- + The following optional replacement HttpHeaderProcessor configuration + enables prefixing all original HTTP headers with a giving String: +--> +<!-- + <bean id="archivalurlhttpheaderprocessor" class="org.archive.wayback.replay.RedirectRewritingHttpHeaderProcessor"> + <property name="prefix" value="X-Archive-Orig-" /> + </bean> +--> <bean id="archivaldateredirectingreplayrenderer" class="org.archive.wayback.replay.DateRedirectReplayRenderer" /> <bean id="archivalcssreplayrenderer" class="org.archive.wayback.archivalurl.ArchivalUrlCSSReplayRenderer"> <constructor-arg><ref bean="archivalurlhttpheaderprocessor"/></constructor-arg> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:36:38
|
Revision: 2779 http://archive-access.svn.sourceforge.net/archive-access/?rev=2779&view=rev Author: bradtofel Date: 2009-07-18 00:36:37 +0000 (Sat, 18 Jul 2009) Log Message: ----------- FEATURE(ACC-56): new OpenSearch RSS query .jsp implementations. Added Paths: ----------- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchCaptureResults.jsp branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchUrlResults.jsp branches/wayback-1_4_2/wayback-webapp/src/main/webapp/opensearchdescription.xml Added: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchCaptureResults.jsp =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchCaptureResults.jsp (rev 0) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchCaptureResults.jsp 2009-07-18 00:36:37 UTC (rev 2779) @@ -0,0 +1,81 @@ +<?xml version="1.0" encoding="UTF-8"?><%@ + page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8" +%><%@ + page import="java.util.Iterator" +%><%@ + page import="java.util.ArrayList" +%><%@ + page import="java.util.Map" +%><%@ + page import="java.util.Enumeration" +%><%@ + page import="org.archive.wayback.core.CaptureSearchResult" +%><%@ + page import="org.archive.wayback.core.CaptureSearchResults" +%><%@ + page import="org.archive.wayback.core.SearchResults" +%><%@ + page import="org.archive.wayback.core.UIResults" +%><%@ + page import="org.archive.wayback.core.WaybackRequest" +%><%@ + page import="org.archive.wayback.requestparser.OpenSearchRequestParser" +%><%@ + page import="org.archive.wayback.util.StringFormatter" +%><% +UIResults uiResults = UIResults.extractCaptureQuery(request); + +WaybackRequest wbRequest = uiResults.getWbRequest(); +StringFormatter fmt = wbRequest.getFormatter(); +CaptureSearchResults results = uiResults.getCaptureResults(); +Iterator<CaptureSearchResult> itr = results.iterator(); +String contextRoot = wbRequest.getContextPrefix(); +String searchString = wbRequest.getRequestUrl(); +long firstResult = results.getFirstReturned(); +long shownResultCount = results.getReturnedCount(); +long lastResult = results.getReturnedCount() + firstResult; +long resultCount = results.getMatchingCount(); +String searchTerms = ""; +Map<String,String[]> queryMap = request.getParameterMap(); +String arr[] = queryMap.get(OpenSearchRequestParser.SEARCH_QUERY); +if(arr != null && arr.length > 1) { + searchTerms = arr[0]; +} +%> +<rss version="2.0" + xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" + xmlns:atom="http://www.w3.org/2005/Atom"> + <channel> + <title>Wayback OpenSearch Results</title> + <link><%= contextRoot %>></link> + <description><%= fmt.format("PathQueryClassic.searchedFor",searchString) %></description> + <opensearch:totalResults><%= resultCount %></opensearch:totalResults> + <opensearch:startIndex><%= firstResult %></opensearch:startIndex> + <opensearch:itemsPerPage><%= shownResultCount %></opensearch:itemsPerPage> + <atom:link rel="search" type="application/opensearchdescription+xml" href="<%= contextRoot %>/opensearchdescription.xml"/> + <opensearch:Query role="request" searchTerms="<%= UIResults.encodeXMLContent(searchTerms) %>" startPage="<%= wbRequest.getPageNum() %>" /> +<% + while(itr.hasNext()) { + %> + <item> + <% + CaptureSearchResult result = itr.next(); + + String replayUrl = UIResults.encodeXMLEntity( + uiResults.resultToReplayUrl(result)); + + String prettyDate = UIResults.encodeXMLEntity( + fmt.format("MetaReplay.captureDateDisplay",result.getCaptureDate())); + + String requestUrl = UIResults.encodeXMLEntity( + wbRequest.getRequestUrl()); + %> + <title><%= prettyDate %></title> + <link><%= replayUrl %></link> + <description><%= requestUrl %></description> + </item> + <% + } +%> + </channel> + </rss> Added: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchUrlResults.jsp =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchUrlResults.jsp (rev 0) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/WEB-INF/query/OpenSearchUrlResults.jsp 2009-07-18 00:36:37 UTC (rev 2779) @@ -0,0 +1,97 @@ +<?xml version="1.0" encoding="UTF-8"?><%@ + page language="java" pageEncoding="utf-8" contentType="text/xml;charset=utf-8" +%><%@ + page import="java.util.Iterator" +%><%@ + page import="java.util.ArrayList" +%><%@ + page import="java.util.Date" +%><%@ + page import="java.util.Map" +%><%@ + page import="java.util.Enumeration" +%><%@ + page import="org.archive.wayback.core.UrlSearchResult" +%><%@ + page import="org.archive.wayback.core.UrlSearchResults" +%><%@ + page import="org.archive.wayback.core.SearchResults" +%><%@ + page import="org.archive.wayback.core.UIResults" +%><%@ + page import="org.archive.wayback.core.WaybackRequest" +%><%@ + page import="org.archive.wayback.requestparser.OpenSearchRequestParser" +%><%@ + page import="org.archive.wayback.util.StringFormatter" +%><% +UIResults uiResults = UIResults.extractUrlQuery(request); + +WaybackRequest wbRequest = uiResults.getWbRequest(); +StringFormatter fmt = wbRequest.getFormatter(); +UrlSearchResults results = uiResults.getUrlResults(); +Iterator<UrlSearchResult> itr = results.iterator(); +String contextRoot = wbRequest.getContextPrefix(); +String searchString = wbRequest.getRequestUrl(); +long firstResult = results.getFirstReturned(); +long shownResultCount = results.getReturnedCount(); +long lastResult = results.getReturnedCount() + firstResult; +long resultCount = results.getMatchingCount(); +String searchTerms = ""; +Map<String,String[]> queryMap = request.getParameterMap(); +String arr[] = queryMap.get(OpenSearchRequestParser.SEARCH_QUERY); +if(arr != null && arr.length > 1) { + searchTerms = arr[0]; +} +%> +<rss version="2.0" + xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/" + xmlns:atom="http://www.w3.org/2005/Atom"> + <channel> + <title>Wayback OpenSearch Results</title> + <link><%= contextRoot %>></link> + <description><%= fmt.format("PathQueryClassic.searchedFor",searchString) %></description> + <opensearch:totalResults><%= resultCount %></opensearch:totalResults> + <opensearch:startIndex><%= firstResult %></opensearch:startIndex> + <opensearch:itemsPerPage><%= shownResultCount %></opensearch:itemsPerPage> + <atom:link rel="search" type="application/opensearchdescription+xml" href="<%= contextRoot %>/opensearchdescription.xml"/> + <opensearch:Query role="request" searchTerms="<%= UIResults.encodeXMLContent(searchTerms) %>" startPage="<%= wbRequest.getPageNum() %>" /> +<% + while(itr.hasNext()) { + %> + <item> + <% + UrlSearchResult result = itr.next(); + + String originalUrl = result.getOriginalUrl(); + String title = UIResults.encodeXMLEntity(originalUrl); + + String queryUrl = UIResults.encodeXMLEntity( + uiResults.makeCaptureQueryUrl(originalUrl)); + + String requestUrl = UIResults.encodeXMLEntity( + wbRequest.getRequestUrl()); + long numCaptures = result.getNumCaptures(); + long numVersions = result.getNumVersions(); + + Date firstDate = result.getFirstCaptureDate(); + Date lastDate = result.getLastCaptureDate(); + %> + <title><%= title %></title> + <link><%= queryUrl %></link> + <description> + <%= requestUrl %> + <span class="mainSearchText"> + <%= fmt.format("PathPrefixQuery.versionCount",numVersions) %> + </span> + <span class="mainSearchText"> + <%= fmt.format("PathPrefixQuery.multiCaptureDate",numCaptures,firstDate,lastDate) %> + </span> + + </description> + </item> + <% + } +%> + </channel> + </rss> Added: branches/wayback-1_4_2/wayback-webapp/src/main/webapp/opensearchdescription.xml =================================================================== --- branches/wayback-1_4_2/wayback-webapp/src/main/webapp/opensearchdescription.xml (rev 0) +++ branches/wayback-1_4_2/wayback-webapp/src/main/webapp/opensearchdescription.xml 2009-07-18 00:36:37 UTC (rev 2779) @@ -0,0 +1,9 @@ +<?xml version="1.0" encoding="UTF-8"?> +<OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/"> + <ShortName>Wayback</ShortName> + <Description>Wayback Search Result RSS feed.</Description> + <Tags>wayback rss</Tags> + <Contact>arc...@ar...</Contact> + <Url type="application/rss+xml" + template="http://wayback.archive-it.org/query?q={searchTerms}&start_page={startPage?}&count={count?}"/> +</OpenSearchDescription> \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:35:01
|
Revision: 2778 http://archive-access.svn.sourceforge.net/archive-access/?rev=2778&view=rev Author: bradtofel Date: 2009-07-18 00:34:58 +0000 (Sat, 18 Jul 2009) Log Message: ----------- DOC: added mention of default port stripping Modified Paths: -------------- branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml Modified: branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml =================================================================== --- branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml 2009-07-18 00:29:00 UTC (rev 2777) +++ branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml 2009-07-18 00:34:58 UTC (rev 2778) @@ -275,10 +275,14 @@ </li> <li> <b>user info removal</b> - http://us...@ex... => example.com, - http://user:pas...@ex... => example.com, + http://us...@ex... => example.com, + http://user:pas...@ex... => example.com, </li> <li> + <b>default port removal</b> + http://example.com:80 => example.com, + </li> + <li> <b>session ID removal</b> http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx => @@ -313,12 +317,12 @@ <p> At the IA, we have recently switched to building CDX files using the <b>-identity</b> option on the <b>arc-indexer</b> and - <b>warc-indexer</b> tools, and have added an additional step in our - CDX creation processes which uses the <b>url-client</b> tool before - sorting and merging CDX files. By keeping the original "identity" CDX - files, we have been able to test various URL canonicalization - strategies without the overhead of re-processing all the source - materials. + <b>warc-indexer</b> tools. The <b>-identity</b> option + <b>requires</b> passing records through the <b>url-client</b> + tool before sorting and merging into production CDX files. By keeping + the original "identity" CDX files, we have been able to test various + URL canonicalization strategies without the overhead of + re-processing all the ARC/WARC source materials. </p> </subsection> <subsection name="Future Directions within Wayback"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:29:05
|
Revision: 2777 http://archive-access.svn.sourceforge.net/archive-access/?rev=2777&view=rev Author: bradtofel Date: 2009-07-18 00:29:00 +0000 (Sat, 18 Jul 2009) Log Message: ----------- DOC: added mention of 1.4.2 release, with link to release notes. Modified Paths: -------------- branches/wayback-1_4_2/dist/src/site/xdoc/index.xml Modified: branches/wayback-1_4_2/dist/src/site/xdoc/index.xml =================================================================== --- branches/wayback-1_4_2/dist/src/site/xdoc/index.xml 2009-07-18 00:26:34 UTC (rev 2776) +++ branches/wayback-1_4_2/dist/src/site/xdoc/index.xml 2009-07-18 00:29:00 UTC (rev 2777) @@ -74,6 +74,23 @@ </p> </section> <section name="News"> + <subsection name="Maintenance Release - 1.4.2, 7/17/2009"> + <p> + Release 1.4.2 fixes several problems discovered in the 1.4.1 + release. Please see the <a href="release_notes.html">release notes</a> for + a detailed list of changes. + </p> + </subsection> + <subsection name="Maintenance Release - 1.4.1, 11/10/2008"> + <p> + Release 1.4.1 fixes several problems discovered in the 1.4.0 + release, and most notably disables by default the AnchorDate and + AnchorWindow features which generated some confusion. Please + see the <a href="release_notes.html">release notes</a> for + a detailed list of changes. + </p> + </subsection> + <subsection name="New Release - 1.4.0, 8/20/2008"> <p> Release 1.4.0 has several new features, as well as several This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:26:37
|
Revision: 2776 http://archive-access.svn.sourceforge.net/archive-access/?rev=2776&view=rev Author: bradtofel Date: 2009-07-18 00:26:34 +0000 (Sat, 18 Jul 2009) Log Message: ----------- DOC: clarified usage and semantics of -identity option on arc-indexer and warc-indexer, and how url-client needs to fit within an indexing process. Modified Paths: -------------- branches/wayback-1_4_2/dist/src/site/xdoc/administrator_manual.xml Modified: branches/wayback-1_4_2/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- branches/wayback-1_4_2/dist/src/site/xdoc/administrator_manual.xml 2009-07-18 00:24:51 UTC (rev 2775) +++ branches/wayback-1_4_2/dist/src/site/xdoc/administrator_manual.xml 2009-07-18 00:26:34 UTC (rev 2776) @@ -1110,8 +1110,11 @@ </p> <p> The <b>-identity</b> option causes the tools to skip canonicalization - of URLs. See the documentation for the <b>url-client</b> tool, and - the <a href="resource_index.html#URL_Canonicalization"> + of URLs. When using this option, you will need to pass the CDX + records through the url-client tool before sorting them into a + production CDX index. See the documentation for the + <b>url-client</b> tool, and the + <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization </a> section for more information. </p> @@ -1182,15 +1185,19 @@ canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column - altered. The column that is changed is assumed to be a URL, + altered. The column that is changed is assumed to be an URL, and the version output is the canonicalized form of the input URL. </p> <p> - This tool is mostly useful for debugging the - canonicalization function, but can also be used, if the - canonicalization function is altered, to update an existing - CDX index, without recreating CDX files from original ARCs. See the + This tool is required when using the <b>arc-indexer</b> or + <b>warc-indexer</b> tools with the <b>-identity</b> option. Typical + usage involves generating an <i>identity</i> CDX index, then + passing the lines in that index through this tool to canonicalize the + record URL key for queries. If the <i>identity</i> CDX files are + kept, then canonicalization schemes can be swapped without + reindexing the original ARC/WARC content. This tool can also be + useful for debugging the canonicalization function. See the section <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:24:58
|
Revision: 2775 http://archive-access.svn.sourceforge.net/archive-access/?rev=2775&view=rev Author: bradtofel Date: 2009-07-18 00:24:51 +0000 (Sat, 18 Jul 2009) Log Message: ----------- TWEAK: added some tests for resolving URLs and extracting schemes from URLs Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/UrlOperationsTest.java Modified: branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/UrlOperationsTest.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/UrlOperationsTest.java 2009-07-18 00:24:00 UTC (rev 2774) +++ branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/UrlOperationsTest.java 2009-07-18 00:24:51 UTC (rev 2775) @@ -62,7 +62,33 @@ assertEquals("foo.com",UrlOperations.urlToHost("http://foo.com/path:/")); assertEquals("foo.com",UrlOperations.urlToHost("https://foo.com/path:/")); assertEquals("foo.com",UrlOperations.urlToHost("ftp://foo.com/path:/")); + } + + public void testResolveUrl() { + for(String scheme : UrlOperations.ALL_SCHEMES) { + + assertEquals(scheme + "a.org/1/2", + UrlOperations.resolveUrl(scheme + "a.org/3/","/1/2")); + + assertEquals(scheme + "b.org/1/2", + UrlOperations.resolveUrl(scheme + "a.org/3/", + scheme + "b.org/1/2")); + + assertEquals(scheme + "a.org/3/1/2", + UrlOperations.resolveUrl(scheme + "a.org/3/","1/2")); + + assertEquals(scheme + "a.org/1/2", + UrlOperations.resolveUrl(scheme + "a.org/3","1/2")); + } } + public void testUrlToScheme() { + assertEquals("http://",UrlOperations.urlToScheme("http://a.com/")); + assertEquals("https://",UrlOperations.urlToScheme("https://a.com/")); + assertEquals("ftp://",UrlOperations.urlToScheme("ftp://a.com/")); + assertEquals("rtsp://",UrlOperations.urlToScheme("rtsp://a.com/")); + assertEquals("mms://",UrlOperations.urlToScheme("mms://a.com/")); + assertNull(UrlOperations.urlToScheme("blah://a.com/")); + } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:24:03
|
Revision: 2774 http://archive-access.svn.sourceforge.net/archive-access/?rev=2774&view=rev Author: bradtofel Date: 2009-07-18 00:24:00 +0000 (Sat, 18 Jul 2009) Log Message: ----------- TWEAK: added tests for newly supported schemes. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizerTest.java Modified: branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizerTest.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizerTest.java 2009-07-18 00:22:52 UTC (rev 2773) +++ branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizerTest.java 2009-07-18 00:24:00 UTC (rev 2774) @@ -45,16 +45,15 @@ // simple strip of http:// checkCanonicalization("http://foo.com/","foo.com/"); -// would be nice to handle other protocols... -// // simple strip of https:// -// checkCanonicalization("https://foo.com/","foo.com/"); -// -// // simple strip of ftp:// -// checkCanonicalization("ftp://foo.com/","foo.com/"); -// -// // simple strip of rtsp:// -// checkCanonicalization("rtsp://foo.com/","foo.com/"); + // simple strip of https:// + checkCanonicalization("https://foo.com/","foo.com/"); + // simple strip of ftp:// + checkCanonicalization("ftp://foo.com/","foo.com/"); + + // simple strip of rtsp:// + checkCanonicalization("rtsp://foo.com/","foo.com/"); + // strip leading 'www.' checkCanonicalization("http://www.foo.com/","foo.com/"); @@ -63,6 +62,9 @@ // strip leading 'www##.' checkCanonicalization("http://www12.foo.com/","foo.com/"); + + // strip leading 'www##.' with https + checkCanonicalization("https://www12.foo.com/","foo.com/"); // strip leading 'www##.' with no protocol checkCanonicalization("www12.foo.com/","foo.com/"); @@ -174,13 +176,53 @@ checkCanonicalization( "http://legislature.mi.gov/(a(4hqa0555fwsecu455xqckv45)S(4hqa0555fwsecu455xqckv45)f(4hqa0555fwsecu455xqckv45))/mileg.aspx?page=sessionschedules", "legislature.mi.gov/mileg.aspx?page=sessionschedules"); + + + + + // default port stripping: + // FIRST the easy-on-the-eyes + // strip port 80 checkCanonicalization("http://www.chub.org:80/foo","chub.org/foo"); // but not other ports... checkCanonicalization("http://www.chub.org:8080/foo","chub.org:8080/foo"); + + // but not other ports... with "www#." massage + checkCanonicalization("http://www232.chub.org:8080/foo","chub.org:8080/foo"); + // default HTTP (:80) stripping without a scheme: + checkCanonicalization("www.chub.org:80/foo","chub.org/foo"); + + // no strip https port (443) without scheme: + checkCanonicalization("www.chub.org:443/foo","chub.org:443/foo"); + + // yes strip https port (443) with scheme: + checkCanonicalization("https://www.chub.org:443/foo","chub.org/foo"); + + // NEXT the exhaustive: + String origHost = "www.chub.org"; + String massagedHost = "chub.org"; + String path = "/foo"; + for(String scheme : UrlOperations.ALL_SCHEMES) { + + int defaultPort = UrlOperations.schemeToDefaultPort(scheme); + int nonDefaultPort = 19991; + + String origDefault = scheme + origHost + ":" + defaultPort + path; + String canonDefault = massagedHost + path; + + String origNonDefault = + scheme + origHost + ":" + nonDefaultPort + path; + String canonNonDefault = + massagedHost + ":" + nonDefaultPort + path; + + checkCanonicalization(origDefault,canonDefault); + checkCanonicalization(origNonDefault,canonNonDefault); + } + } private void checkCanonicalization(String orig, String want) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:23:06
|
Revision: 2773 http://archive-access.svn.sourceforge.net/archive-access/?rev=2773&view=rev Author: bradtofel Date: 2009-07-18 00:22:52 +0000 (Sat, 18 Jul 2009) Log Message: ----------- TWEAK: added currently failing test, which is commented out to prevent build breakage, but demonstrates a problem with current Timestamp implementation. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/TimestampTest.java Modified: branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/TimestampTest.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/TimestampTest.java 2009-07-18 00:21:01 UTC (rev 2772) +++ branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/util/TimestampTest.java 2009-07-18 00:22:52 UTC (rev 2773) @@ -25,6 +25,7 @@ package org.archive.wayback.util; import java.util.Calendar; +import java.util.TimeZone; import junit.framework.TestCase; @@ -42,7 +43,7 @@ */ public void testPadDateStr() { - String curYear = String.valueOf(Calendar.getInstance().get(Calendar.YEAR)); + String curYear = String.valueOf(Calendar.getInstance(TimeZone.getTimeZone("gmt")).get(Calendar.YEAR)); assertEquals("padStart '1'","19960101000000",Timestamp.padStartDateStr("1")); assertEquals("padEnd '1'","19991231235959",Timestamp.padEndDateStr("1")); @@ -90,4 +91,19 @@ String dateSpec = "20060518210548"; assertEquals("bad fromSSe",dateSpec,Timestamp.fromSse(sse).getDateStr()); } +// public void testConvertDate() { +// String timestamp = "20070101121212"; +// long value = Timestamp.parseAfter(timestamp).getDate().getTime(); +// for(int i = 0; i < 1000; i++) { +// long value2 = Timestamp.parseAfter(timestamp).getDate().getTime(); +// assertEquals(value, value2); +// try { +// Thread.sleep(1); +// } catch (InterruptedException e) { +// // TODO Auto-generated catch block +// e.printStackTrace(); +// } +// } +// } + } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:21:09
|
Revision: 2772 http://archive-access.svn.sourceforge.net/archive-access/?rev=2772&view=rev Author: bradtofel Date: 2009-07-18 00:21:01 +0000 (Sat, 18 Jul 2009) Log Message: ----------- TWEAK Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/replay/TagMagixTest.java Modified: branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/replay/TagMagixTest.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/replay/TagMagixTest.java 2009-07-18 00:14:42 UTC (rev 2771) +++ branches/wayback-1_4_2/wayback-core/src/test/java/org/archive/wayback/replay/TagMagixTest.java 2009-07-18 00:21:01 UTC (rev 2772) @@ -423,6 +423,13 @@ // "<table style=\"bg: url(\\\"http://w.a.org/wb/2004/http://f.au/css/b.gif\\\"); fg: url(\\\"http://w.a.org/wb/2004/http://f.au/css/f.gif\\\");\"></table>", // "http://w.a.org/wb/","2004","http://f.au/"); + + checkStyleUrlMarkup("<td style=\"b-i:url(i/b.jpg);\n\"></td>", + "<td style=\"b-i:url(http://w.a.org/wb/2004/http://f.au/i/b.jpg);\n\"></td>", + "http://w.a.org/wb/","2004","http://f.au/"); + +// "<td style=\"background-image:url(images/banner.jpg);\n\"></td>" + } @@ -449,11 +456,12 @@ ArchivalUrlResultURIConverter uriC = new ArchivalUrlResultURIConverter(); uriC.setReplayURIPrefix(prefix); TagMagix.markupCSSImports(buf, uriC, ts, url); + TagMagix.markupStyleUrls(buf,uriC,ts,url); String marked = buf.toString(); assertEquals(want,marked); } - private void checkStyleUrlMarkup(String orig, String want, String prefix, String ts, String url) { + private void checkStyleOnlyUrlMarkup(String orig, String want, String prefix, String ts, String url) { StringBuilder buf = new StringBuilder(orig); ArchivalUrlResultURIConverter uriC = new ArchivalUrlResultURIConverter(); uriC.setReplayURIPrefix(prefix); @@ -461,6 +469,16 @@ String marked = buf.toString(); assertEquals(want,marked); } + + private void checkStyleUrlMarkup(String orig, String want,String prefix, String ts, String url) { + StringBuilder buf = new StringBuilder(orig); + ArchivalUrlResultURIConverter uriC = new ArchivalUrlResultURIConverter(); + uriC.setReplayURIPrefix(prefix); + TagMagix.markupCSSImports(buf, uriC, ts, url); + TagMagix.markupStyleUrls(buf, uriC, ts, url); + String marked = buf.toString(); + assertEquals(want,marked); + } private void checkMarkup(String orig, String want, String tag, String attr, String prefix, String ts, String url) { StringBuilder buf = new StringBuilder(orig); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:14:48
|
Revision: 2771 http://archive-access.svn.sourceforge.net/archive-access/?rev=2771&view=rev Author: bradtofel Date: 2009-07-18 00:14:42 +0000 (Sat, 18 Jul 2009) Log Message: ----------- FEATURE(ACC-32): AccessPoints now have an exactSchemeMatch property. If set to true, only documents with the same scheme as the request URL will be returned within this AccessPoint. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizer.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java Added Paths: ----------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/SchemeMatchFilter.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2009-07-18 00:05:07 UTC (rev 2770) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2009-07-18 00:14:42 UTC (rev 2771) @@ -39,6 +39,7 @@ import org.archive.wayback.util.ObjectFilter; import org.archive.wayback.util.StringFormatter; import org.archive.wayback.util.Timestamp; +import org.archive.wayback.util.url.UrlOperations; import org.archive.wayback.webapp.AccessPoint; /** @@ -186,6 +187,12 @@ public static final String REQUEST_EXACT_HOST_ONLY = "requestexacthost"; /** + * Indicates user only wants results that were captured using the same + * scheme as that specified in REQUEST_URL. + */ + public static final String REQUEST_EXACT_SCHEME_ONLY = "requestexactscheme"; + + /** * indicates positive value for any request boolean flag. */ public static final String REQUEST_YES = "yes"; @@ -556,16 +563,27 @@ * @param urlStr Request URL. */ public void setRequestUrl(String urlStr) { - // TODO: fix this to use other schemes - if (!urlStr.startsWith("http://")) { + + // This looks a little confusing: We're trying to fixup an incoming + // request URL that starts with: + // "http:/www.archive.org" + // so it becomes: + // "http://www.archive.org" + // (note the missing second "/" in the first) + // + // if that is not the case, then see if the incoming scheme + // is known, adding an implied "http://" scheme if there doesn't appear + // to be a scheme.. + // TODO: make the default "http://" configurable. + if (!urlStr.startsWith(UrlOperations.HTTP_SCHEME)) { if(urlStr.startsWith("http:/")) { - urlStr = "http://" + urlStr.substring(6); + urlStr = UrlOperations.HTTP_SCHEME + urlStr.substring(6); } else { - urlStr = "http://" + urlStr; + if(UrlOperations.urlToScheme(urlStr) == null) { + urlStr = UrlOperations.HTTP_SCHEME + urlStr; + } } } -// UURI requestURI = UURIFactory.getInstance(urlStr); -// put(REQUEST_URL_CLEANED, requestURI.toString()); put(REQUEST_URL, urlStr); } @@ -614,6 +632,13 @@ public boolean isExactHost() { return getBoolean(REQUEST_EXACT_HOST_ONLY); } + + public void setExactScheme(boolean isExactScheme) { + setBoolean(REQUEST_EXACT_SCHEME_ONLY,isExactScheme); + } + public boolean isExactScheme() { + return getBoolean(REQUEST_EXACT_SCHEME_ONLY); + } public String getAnchorTimestamp() { return get(REQUEST_ANCHOR_DATE); Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2009-07-18 00:05:07 UTC (rev 2770) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/LocalResourceIndex.java 2009-07-18 00:14:42 UTC (rev 2771) @@ -51,6 +51,7 @@ import org.archive.wayback.resourceindex.filters.EndDateFilter; import org.archive.wayback.resourceindex.filters.GuardRailFilter; import org.archive.wayback.resourceindex.filters.HostMatchFilter; +import org.archive.wayback.resourceindex.filters.SchemeMatchFilter; import org.archive.wayback.resourceindex.filters.SelfRedirectFilter; import org.archive.wayback.resourceindex.filters.UrlMatchFilter; import org.archive.wayback.resourceindex.filters.UrlPrefixMatchFilter; @@ -63,6 +64,7 @@ import org.archive.wayback.util.ObjectFilterIterator; import org.archive.wayback.util.Timestamp; import org.archive.wayback.util.url.AggressiveUrlCanonicalizer; +import org.archive.wayback.util.url.UrlOperations; /** * @@ -370,6 +372,7 @@ filter.addFilter(drFilter); } else if(type == TYPE_URL) { filter.addFilter(new UrlPrefixMatchFilter(keyUrl)); + filter.addFilter(drFilter); } else { throw new BadQueryException("Unknown type"); } @@ -378,6 +381,10 @@ filter.addFilter(exactHost); } + if(request.isExactScheme()) { + filter.addFilter(new SchemeMatchFilter( + UrlOperations.urlToScheme(request.getRequestUrl()))); + } // count how many results got to the ExclusionFilter: filter.addFilter(preExclusionCounter); Added: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/SchemeMatchFilter.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/SchemeMatchFilter.java (rev 0) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/SchemeMatchFilter.java 2009-07-18 00:14:42 UTC (rev 2771) @@ -0,0 +1,60 @@ +/* SchemeMatchFilter + * + * $Id$ + * + * Created on 6:40:02 PM Nov 6, 2008. + * + * Copyright (C) 2008 Internet Archive. + * + * This file is part of wayback. + * + * wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +package org.archive.wayback.resourceindex.filters; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.util.ObjectFilter; +import org.archive.wayback.util.url.UrlOperations; + +/** + * ObjectFilter which omits CaptureSearchResult objects if their scheme does not + * match the specified scheme. + * + * @author brad + * @version $Date$, $Revision$ + */ + +public class SchemeMatchFilter implements ObjectFilter<CaptureSearchResult> { + + private String scheme = null; + + /** + * @param hostname String of original host to match + */ + public SchemeMatchFilter(final String scheme) { + this.scheme = scheme; + } + + /* (non-Javadoc) + * @see org.archive.wayback.util.ObjectFilter#filterObject(java.lang.Object) + */ + public int filterObject(CaptureSearchResult r) { + String captureScheme = UrlOperations.urlToScheme(r.getOriginalUrl()); + if(scheme == null) { + return captureScheme == null ? FILTER_INCLUDE : FILTER_EXCLUDE; + } + return scheme.equals(captureScheme) ? FILTER_INCLUDE : FILTER_EXCLUDE; + } +} Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizer.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizer.java 2009-07-18 00:05:07 UTC (rev 2770) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/AggressiveUrlCanonicalizer.java 2009-07-18 00:14:42 UTC (rev 2771) @@ -206,23 +206,17 @@ return urlString; } String searchUrl = canonicalize(urlString); - - // TODO: force https into http for the moment... - if(searchUrl.startsWith("https://")) { - searchUrl = searchUrl.substring(8); + String scheme = UrlOperations.urlToScheme(searchUrl); + if(scheme != null) { + searchUrl = searchUrl.substring(scheme.length()); + } else { + scheme = UrlOperations.HTTP_SCHEME; } - - // TODO: this will only work with http:// scheme. should work with all? - // force add of scheme and possible add '/' with empty path: - if (searchUrl.startsWith("http://")) { - if (-1 == searchUrl.indexOf('/', 8)) { - searchUrl = searchUrl + "/"; - } + + if (-1 == searchUrl.indexOf("/")) { + searchUrl = scheme + searchUrl + "/"; } else { - if (-1 == searchUrl.indexOf("/")) { - searchUrl = searchUrl + "/"; - } - searchUrl = "http://" + searchUrl; + searchUrl = scheme + searchUrl; } // TODO: These next few lines look crazy -- need to be reworked.. This @@ -250,23 +244,18 @@ // if((newPath.length() > 1) && newPath.endsWith("/")) { // newPath = newPath.substring(0,newPath.length()-1); // } -// searchURI.setEscapedPath(newPath); -// searchURI.setRawPath(newPath.toCharArray()); -// String query = searchURI.getEscapedQuery(); - // TODO: handle non HTTP port stripping, too. -// String portStr = ""; -// if(searchURI.getPort() != 80 && searchURI.getPort() != -1) { -// portStr = ":" + searchURI.getPort(); -// } -// return searchURI.getHostBasename() + portStr + -// searchURI.getEscapedPathQuery(); - StringBuilder sb = new StringBuilder(searchUrl.length()); sb.append(searchURI.getHostBasename()); - if(searchURI.getPort() != 80 && searchURI.getPort() != -1) { + + // omit port if scheme default: + int defaultSchemePort = UrlOperations.schemeToDefaultPort(scheme); + if(searchURI.getPort() != defaultSchemePort + && searchURI.getPort() != -1) { + sb.append(":").append(searchURI.getPort()); } + sb.append(newPath); if(searchURI.getEscapedQuery() != null) { sb.append("?").append(searchURI.getEscapedQuery()); Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java 2009-07-18 00:05:07 UTC (rev 2770) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java 2009-07-18 00:14:42 UTC (rev 2771) @@ -81,15 +81,16 @@ * @return url resolved against baseUrl, unless it is absolute already */ public static String resolveUrl(String baseUrl, String url) { - // TODO: this only works for http:// - if(url.startsWith("http://")) { - try { - return UURIFactory.getInstance(url).getEscapedURI(); - } catch (URIException e) { - e.printStackTrace(); - // can't let a space exist... send back close to whatever came - // in... - return url.replace(" ", "%20"); + for(final String scheme : ALL_SCHEMES) { + if(url.startsWith(scheme)) { + try { + return UURIFactory.getInstance(url).getEscapedURI(); + } catch (URIException e) { + e.printStackTrace(); + // can't let a space exist... send back close to whatever came + // in... + return url.replace(" ", "%20"); + } } } UURI absBaseURI; @@ -99,11 +100,39 @@ resolvedURI = UURIFactory.getInstance(absBaseURI, url); } catch (URIException e) { e.printStackTrace(); - return url; + return url.replace(" ", "%20"); } return resolvedURI.getEscapedURI(); } + public static String urlToScheme(final String url) { + for(final String scheme : ALL_SCHEMES) { + if(url.startsWith(scheme)) { + return scheme; + } + } + return null; + } + + public static int schemeToDefaultPort(final String scheme) { + if(scheme.equals(HTTP_SCHEME)) { + return 80; + } + if(scheme.equals(HTTPS_SCHEME)) { + return 443; + } + if(scheme.equals(FTP_SCHEME)) { + return 21; + } + if(scheme.equals(RTSP_SCHEME)) { + return 554; + } + if(scheme.equals(MMS_SCHEME)) { + return 1755; + } + return -1; + } + public static String urlToHost(String url) { if(url.startsWith("dns:")) { return url.substring(4); Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java 2009-07-18 00:05:07 UTC (rev 2770) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/AccessPoint.java 2009-07-18 00:14:42 UTC (rev 2771) @@ -80,6 +80,7 @@ private boolean useServerName = false; private boolean useAnchorWindow = false; + private boolean exactSchemeMatch = true; private int contextPort = 0; private String contextName = null; @@ -217,11 +218,6 @@ prefix.append(":").append(waybackPort); } String contextPath = getContextPath(httpRequest); -// if(contextPath.length() > 1) { -// prefix.append(contextPath); -// } else { -// prefix.append(contextPath); -// } prefix.append(contextPath); return prefix.toString(); } @@ -264,19 +260,6 @@ } catch(IOException e) { // TODO: figure out if we got IO because of a missing dispatcher } -// uiResults.storeInRequest(httpRequest,translated); -// RequestDispatcher dispatcher = null; -// // special case for the front '/' page: -// if(translated.length() == 0) { -// translated = "/"; -// } else { -// translated = "/" + translated; -// } -// dispatcher = httpRequest.getRequestDispatcher(translated); -// if(dispatcher != null) { -// dispatcher.forward(httpRequest, httpResponse); -// return true; -// } return false; } @@ -299,9 +282,13 @@ if(wbRequest != null) { handled = true; + + // TODO: refactor this code into RequestParser implementations wbRequest.setAccessPoint(this); wbRequest.setContextPrefix(getAbsoluteLocalPrefix(httpRequest)); wbRequest.fixup(httpRequest); + // end of refactor + if(authentication != null) { if(!authentication.isTrue(wbRequest)) { throw new AuthenticationControlException("Not authorized"); @@ -311,6 +298,12 @@ if(exclusionFactory != null) { wbRequest.setExclusionFilter(exclusionFactory.get()); } + // TODO: refactor this into RequestParser implementations, so a + // user could alter requests to change the behavior within a + // single AccessPoint. For now, this is a simple way to expose + // the feature to configuration. + wbRequest.setExactScheme(exactSchemeMatch); + if(wbRequest.isReplayRequest()) { handleReplay(wbRequest,httpRequest,httpResponse); @@ -488,7 +481,21 @@ public void setUseAnchorWindow(boolean useAnchorWindow) { this.useAnchorWindow = useAnchorWindow; } + + /** + * @return the exactSchemeMatch + */ + public boolean isExactSchemeMatch() { + return exactSchemeMatch; + } + /** + * @param exactSchemeMatch the exactSchemeMatch to set + */ + public void setExactSchemeMatch(boolean exactSchemeMatch) { + this.exactSchemeMatch = exactSchemeMatch; + } + public ExclusionFilterFactory getExclusionFactory() { return exclusionFactory; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-18 00:05:09
|
Revision: 2770 http://archive-access.svn.sourceforge.net/archive-access/?rev=2770&view=rev Author: bradtofel Date: 2009-07-18 00:05:07 +0000 (Sat, 18 Jul 2009) Log Message: ----------- BUGFIX(ACC-78): Now Wayback rewrites Content-Base HTTP headers. FEATURE(ACC-77): Allow prefixing of original HTTP headers with a fixed string Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/HttpHeaderProcessor.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/HttpHeaderProcessor.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/HttpHeaderProcessor.java 2009-07-17 23:57:57 UTC (rev 2769) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/HttpHeaderProcessor.java 2009-07-18 00:05:07 UTC (rev 2770) @@ -49,6 +49,10 @@ public final static String HTTP_CONTENT_BASE_HEADER_UP = HTTP_CONTENT_BASE_HEADER.toUpperCase(); + public final static String HTTP_CONTENT_LOCATION_HEADER = "Content-Location"; + public final static String HTTP_CONTENT_LOCATION_HEADER_UP = + HTTP_CONTENT_LOCATION_HEADER.toUpperCase(); + public final static String HTTP_CONTENT_TYPE_HEADER = "Content-Type"; public final static String HTTP_CONTENT_TYPE_HEADER_UP = HTTP_CONTENT_TYPE_HEADER.toUpperCase(); Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java 2009-07-17 23:57:57 UTC (rev 2769) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/RedirectRewritingHttpHeaderProcessor.java 2009-07-18 00:05:07 UTC (rev 2770) @@ -39,6 +39,18 @@ public class RedirectRewritingHttpHeaderProcessor implements HttpHeaderProcessor { + private static String DEFAULT_PREFIX = null; + private String prefix = DEFAULT_PREFIX; + + public String getPrefix() { + return prefix; + } + + public void setPrefix(String prefix) { + this.prefix = prefix; + } + + /* (non-Javadoc) * @see org.archive.wayback.replay.HttpHeaderProcessor#filter(java.util.Map, java.lang.String, java.lang.String, org.archive.wayback.ResultURIConverter, org.archive.wayback.core.CaptureSearchResult) */ @@ -47,9 +59,20 @@ String keyUp = key.toUpperCase(); + // first stick it in as-is, or with prefix, then maybe we'll overwrite + // with the later logic. + if(prefix == null) { + if(!keyUp.equals(HTTP_LENGTH_HEADER_UP)) { + output.put(key, value); + } + } else { + output.put(prefix + key, value); + } + // rewrite Location header URLs if (keyUp.startsWith(HTTP_LOCATION_HEADER_UP) || - keyUp.startsWith(HTTP_CONTENT_BASE_HEADER_UP)) { + keyUp.startsWith(HTTP_CONTENT_LOCATION_HEADER_UP) || + keyUp.startsWith(HTTP_CONTENT_BASE_HEADER_UP)) { String baseUrl = result.getOriginalUrl(); String cd = result.getCaptureTimestamp(); @@ -57,13 +80,10 @@ String u = UrlOperations.resolveUrl(baseUrl, value); output.put(key, uriConverter.makeReplayURI(cd,u)); -// } else if(keyUp.startsWith(HTTP_CONTENT_TYPE_HEADER_UP)) { -// output.put("X-Wayback-Orig-" + key,value); -// output.put(key,value); - } else { - // others go out as-is: - output.put(key, value); + } else if(keyUp.startsWith(HTTP_CONTENT_TYPE_HEADER_UP)) { + // let's leave this one as-is: + output.put(key,value); } } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:58:00
|
Revision: 2769 http://archive-access.svn.sourceforge.net/archive-access/?rev=2769&view=rev Author: bradtofel Date: 2009-07-17 23:57:57 +0000 (Fri, 17 Jul 2009) Log Message: ----------- BUGFIX(ACC-63): charset extraction from HTTP headers is now case-insensitive BUGFIX(ACC-65): no longer adding content to HTML pages with FrameSet tags, as they were being broken. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java 2009-07-17 23:53:43 UTC (rev 2768) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/replay/TextDocument.java 2009-07-17 23:57:57 UTC (rev 2769) @@ -31,6 +31,7 @@ import java.nio.charset.Charset; import java.nio.charset.IllegalCharsetNameException; import java.text.ParseException; +import java.util.Iterator; import java.util.Map; import javax.servlet.ServletException; @@ -102,7 +103,9 @@ } private String contentTypeToCharset(final String contentType) { - int offset = contentType.indexOf(CHARSET_TOKEN); + int offset = + contentType.toUpperCase().indexOf(CHARSET_TOKEN.toUpperCase()); + if (offset != -1) { String cs = contentType.substring(offset + CHARSET_TOKEN.length()); if(isCharsetSupported(cs)) { @@ -135,7 +138,16 @@ String charsetName = null; Map<String,String> httpHeaders = resource.getHttpHeaders(); - String ctype = httpHeaders.get(HTTP_CONTENT_TYPE_HEADER); + Iterator<String> keys = httpHeaders.keySet().iterator(); + String ctype = null; + while(keys.hasNext()) { + String headerKey = keys.next(); + String keyCmp = headerKey.toUpperCase().trim(); + if(keyCmp.equals(HTTP_CONTENT_TYPE_HEADER.toUpperCase())) { + ctype = httpHeaders.get(headerKey); + break; + } + } if (ctype != null) { charsetName = contentTypeToCharset(ctype); } @@ -423,9 +435,15 @@ public void insertAtStartOfBody(String toInsert) { int insertPoint = TagMagix.getEndOfFirstTag(sb,"body"); if (-1 == insertPoint) { - insertPoint = 0; + // see if there's a frameset, and don't insert if there is. + int framesetPoint = TagMagix.getEndOfFirstTag(sb,"frameset"); + if(-1 == framesetPoint) { + insertPoint = 0; + } } - sb.insert(insertPoint,toInsert); + if (-1 != insertPoint) { + sb.insert(insertPoint,toInsert); + } } /** * @param jspPath This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:53:48
|
Revision: 2768 http://archive-access.svn.sourceforge.net/archive-access/?rev=2768&view=rev Author: bradtofel Date: 2009-07-17 23:53:43 +0000 (Fri, 17 Jul 2009) Log Message: ----------- FEATURE(ACC-38): now times out requests to remote ResourceIndexes. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2009-07-17 23:52:26 UTC (rev 2767) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2009-07-17 23:53:43 UTC (rev 2768) @@ -26,6 +26,8 @@ import java.io.File; import java.io.IOException; +import java.net.URL; +import java.net.URLConnection; import java.util.logging.Logger; import javax.xml.parsers.DocumentBuilder; @@ -71,7 +73,10 @@ .class.getName()); private String searchUrlBase; - + private int connectTimeout = 10000; + private int readTimeout = 10000; + + private DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); private static final String WB_XML_REQUEST_TAGNAME = "request"; @@ -333,7 +338,11 @@ // do an HTTP request, plus parse the result into an XML DOM protected Document getHttpDocument(String url) throws IOException, SAXException { - return (getDocumentBuilder()).parse(url); + URL u = new URL(url); + URLConnection conn = u.openConnection(); + conn.setConnectTimeout(connectTimeout); + conn.setReadTimeout(readTimeout); + return (getDocumentBuilder()).parse(conn.getInputStream(),url); } protected Document getFileDocument(File f) throws IOException, SAXException { @@ -365,4 +374,19 @@ public void setCanonicalizer(UrlCanonicalizer canonicalizer) { this.canonicalizer = canonicalizer; } + public int getConnectTimeout() { + return connectTimeout; + } + + public void setConnectTimeout(int connectTimeout) { + this.connectTimeout = connectTimeout; + } + + public int getReadTimeout() { + return readTimeout; + } + + public void setReadTimeout(int readTimeout) { + this.readTimeout = readTimeout; + } } \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:53:35
|
Revision: 2762 http://archive-access.svn.sourceforge.net/archive-access/?rev=2762&view=rev Author: bradtofel Date: 2009-07-17 23:22:42 +0000 (Fri, 17 Jul 2009) Log Message: ----------- BUGFIX(ACC-70) Now explicitly grabs a Calendar set with GMT timezone instead of setting JVM default timezone at startup. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/query/resultspartitioner/ResultsPartitioner.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/Timestamp.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/RequestFilter.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/query/resultspartitioner/ResultsPartitioner.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/query/resultspartitioner/ResultsPartitioner.java 2009-07-17 23:19:41 UTC (rev 2761) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/query/resultspartitioner/ResultsPartitioner.java 2009-07-17 23:22:42 UTC (rev 2762) @@ -26,7 +26,6 @@ import java.util.Calendar; import java.util.GregorianCalendar; -import java.util.SimpleTimeZone; import java.util.TimeZone; import org.archive.wayback.core.WaybackRequest; @@ -41,12 +40,7 @@ public abstract class ResultsPartitioner { protected Calendar getCalendar() { - String[] ids = TimeZone.getAvailableIDs(0); - if (ids.length < 1) { - return null; - } - TimeZone gmt = new SimpleTimeZone(0, ids[0]); - return new GregorianCalendar(gmt); + return new GregorianCalendar(TimeZone.getTimeZone("gmt")); } protected Calendar dateStrToCalendar(String dateStr) { Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/Timestamp.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/Timestamp.java 2009-07-17 23:19:41 UTC (rev 2761) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/util/Timestamp.java 2009-07-17 23:22:42 UTC (rev 2762) @@ -26,7 +26,6 @@ import java.util.Calendar; import java.util.Date; import java.util.GregorianCalendar; -import java.util.SimpleTimeZone; import java.util.TimeZone; @@ -42,7 +41,7 @@ private final static String UPPER_TIMESTAMP_LIMIT = "29991939295959"; private final static String YEAR_LOWER_LIMIT = "1996"; private final static String YEAR_UPPER_LIMIT = - String.valueOf(Calendar.getInstance().get(Calendar.YEAR)); + String.valueOf(Calendar.getInstance(TimeZone.getTimeZone("gmt")).get(Calendar.YEAR)); private final static String MONTH_LOWER_LIMIT = "01"; private final static String MONTH_UPPER_LIMIT = "12"; private final static String DAY_LOWER_LIMIT = "01"; @@ -256,12 +255,7 @@ } private static Calendar getCalendar() { - String[] ids = TimeZone.getAvailableIDs(0); - if (ids.length < 1) { - return null; - } - TimeZone gmt = new SimpleTimeZone(0, ids[0]); - return new GregorianCalendar(gmt); + return new GregorianCalendar(TimeZone.getTimeZone("gmt")); } /** Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/RequestFilter.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/RequestFilter.java 2009-07-17 23:19:41 UTC (rev 2761) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/webapp/RequestFilter.java 2009-07-17 23:22:42 UTC (rev 2762) @@ -25,7 +25,6 @@ package org.archive.wayback.webapp; import java.io.IOException; -import java.util.TimeZone; import java.util.logging.Logger; import javax.servlet.Filter; @@ -39,10 +38,6 @@ import org.archive.wayback.exception.ConfigurationException; -//import org.archive.wayback.core.WaybackRequest; -//import org.archive.wayback.exception.BadQueryException; -//import org.archive.wayback.exception.ConfigurationException; - /** * * @@ -60,7 +55,6 @@ public void init(FilterConfig config) throws ServletException { LOGGER.info("Wayback Filter initializing..."); - TimeZone.setDefault(TimeZone.getTimeZone("GMT")); try { mapper = new RequestMapper(config.getServletContext()); } catch (ConfigurationException e) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:53:31
|
Revision: 2763 http://archive-access.svn.sourceforge.net/archive-access/?rev=2763&view=rev Author: bradtofel Date: 2009-07-17 23:28:06 +0000 (Fri, 17 Jul 2009) Log Message: ----------- FEATURE(ACC-38): Now attempts to time out fetches Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java Added Paths: ----------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java 2009-07-17 23:22:42 UTC (rev 2762) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/ResourceFactory.java 2009-07-17 23:28:06 UTC (rev 2763) @@ -4,6 +4,7 @@ import java.io.IOException; import java.net.URL; +import org.archive.io.ArchiveReader; import org.archive.io.ArchiveRecord; import org.archive.io.arc.ARCReader; import org.archive.io.arc.ARCReaderFactory; @@ -60,21 +61,22 @@ } public static Resource getResource(URL url, long offset) - throws IOException, ResourceNotAvailableException { + throws IOException, ResourceNotAvailableException { + Resource r = null; - String name = url.getFile(); - if (isArc(name)) { - - ARCReader reader = ARCReaderFactory.get(url, offset); - r = ARCArchiveRecordToResource(reader.get(),reader); - - } else if (isWarc(name)) { - - WARCReader reader = WARCReaderFactory.get(url, offset); - r = WARCArchiveRecordToResource(reader.get(),reader); - + // TODO: allow configuration of timeouts -- now using defaults.. + TimeoutArchiveReaderFactory tarf = new TimeoutArchiveReaderFactory(); + ArchiveReader reader = tarf.getArchiveReader(url,offset); + if(reader instanceof ARCReader) { + ARCReader areader = (ARCReader) reader; + r = ARCArchiveRecordToResource(areader.get(),areader); + + } else if(reader instanceof WARCReader) { + WARCReader wreader = (WARCReader) reader; + r = WARCArchiveRecordToResource(wreader.get(),wreader); + } else { - throw new ResourceNotAvailableException("Unknown extension"); + throw new ResourceNotAvailableException("Unknown ArchiveReader"); } return r; } @@ -91,7 +93,7 @@ || name.endsWith(ArcWarcFilenameFilter.WARC_GZ_SUFFIX)); } - private static Resource ARCArchiveRecordToResource(ArchiveRecord rec, + public static Resource ARCArchiveRecordToResource(ArchiveRecord rec, ARCReader reader) throws ResourceNotAvailableException, IOException { if (!(rec instanceof ARCRecord)) { @@ -102,7 +104,7 @@ return ar; } - private static Resource WARCArchiveRecordToResource(ArchiveRecord rec, + public static Resource WARCArchiveRecordToResource(ArchiveRecord rec, WARCReader reader) throws ResourceNotAvailableException, IOException { if (!(rec instanceof WARCRecord)) { Added: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java (rev 0) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/resourcefile/TimeoutArchiveReaderFactory.java 2009-07-17 23:28:06 UTC (rev 2763) @@ -0,0 +1,58 @@ +package org.archive.wayback.resourcestore.resourcefile; + +import java.io.IOException; +import java.net.HttpURLConnection; +import java.net.URL; +import java.net.URLConnection; + +import org.archive.io.ArchiveReader; +import org.archive.io.ArchiveReaderFactory; + +/** + * Sad but needed subclass of the ArchiveReaderFactory, allows config of + * timeouts for connect and reads on underlying HTTP connections, and overrides + * the one getArchiveReader(URL,long) method to enable setting the timeouts. + * + * This functionality should be moved into the ArchiveReaderFactory. + * + * @author brad + * + */ +public class TimeoutArchiveReaderFactory extends ArchiveReaderFactory { + + private final static int STREAM_ALL = -1; + private int connectTimeout = 10000; + private int readTimeout = 10000; + public TimeoutArchiveReaderFactory(int connectTimeout, int readTimeout) { + this.connectTimeout = connectTimeout; + this.readTimeout = readTimeout; + } + + public TimeoutArchiveReaderFactory(int timeout) { + this.connectTimeout = timeout; + this.readTimeout = timeout; + } + public TimeoutArchiveReaderFactory() { + } + protected ArchiveReader getArchiveReader(final URL f, final long offset) + throws IOException { + + // Get URL connection. + URLConnection connection = f.openConnection(); + if (connection instanceof HttpURLConnection) { + addUserAgent((HttpURLConnection)connection); + } + if (offset != STREAM_ALL) { + // Use a Range request (Assumes HTTP 1.1 on other end). If + // length >= 0, add open-ended range header to the request. Else, + // because end-byte is inclusive, subtract 1. + connection.addRequestProperty("Range", "bytes=" + offset + "-"); + } + + connection.setConnectTimeout(connectTimeout); + connection.setReadTimeout(readTimeout); + + return getArchiveReader(f.toString(), connection.getInputStream(), + (offset == 0)); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:53:31
|
Revision: 2761 http://archive-access.svn.sourceforge.net/archive-access/?rev=2761&view=rev Author: bradtofel Date: 2009-07-17 23:19:41 +0000 (Fri, 17 Jul 2009) Log Message: ----------- BUGFIX(ACC-62): Now sets correct offset on search Result prior to inserting into live web ResourceIndex Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/liveweb/LiveWebCache.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/liveweb/LiveWebCache.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/liveweb/LiveWebCache.java 2009-07-17 23:17:32 UTC (rev 2760) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/liveweb/LiveWebCache.java 2009-07-17 23:19:41 UTC (rev 2761) @@ -207,6 +207,8 @@ ARCRecord record = (ARCRecord) aResource.getArcRecord(); CaptureSearchResult result = adapter.adapt(record); + // HACKHACK: we're getting the wrong offset from the ARCReader: + result.setOffset(offset); index.addSearchResult(result); LOGGER.info("Added URL(" + url.toString() + ") in " + "ARC(" + name + ") at (" + offset + ") to LiveIndex"); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:53:25
|
Revision: 2756 http://archive-access.svn.sourceforge.net/archive-access/?rev=2756&view=rev Author: bradtofel Date: 2009-07-17 23:08:44 +0000 (Fri, 17 Jul 2009) Log Message: ----------- Removed Paths: ------------- branches/wayback-1_4_2/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:53:23
|
Revision: 2755 http://archive-access.svn.sourceforge.net/archive-access/?rev=2755&view=rev Author: bradtofel Date: 2009-07-17 23:02:39 +0000 (Fri, 17 Jul 2009) Log Message: ----------- Release 1.4.2 Added Paths: ----------- branches/wayback-1_4_2/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:52:36
|
Revision: 2767 http://archive-access.svn.sourceforge.net/archive-access/?rev=2767&view=rev Author: bradtofel Date: 2009-07-17 23:52:26 +0000 (Fri, 17 Jul 2009) Log Message: ----------- TWEAK: comment & whitespace. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java 2009-07-17 23:51:41 UTC (rev 2766) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java 2009-07-17 23:52:26 UTC (rev 2767) @@ -35,10 +35,10 @@ /** - * Implements ResourceStore where ARC/WARCs are accessed via HTTP 1.1 range - * requests. All files are assumed to be "rooted" at a particular HTTP URL, - * within a single directory, implying a file reverse-proxy to connect through - * to actual HTTP ARC/WARC locations. + * Implements ResourceStore where ARC/WARCs are accessed via a local file or an + * HTTP 1.1 range request. All files are assumed to be "rooted" at a particular + * HTTP URL, or within a single local directory. The HTTP version may imply a + * file reverse-proxy to connect through to actual HTTP ARC/WARC locations. * * @author brad * @version $Date$, $Revision$ @@ -47,7 +47,6 @@ private String prefix = null; - public Resource retrieveResource(CaptureSearchResult result) throws ResourceNotAvailableException { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:51:47
|
Revision: 2766 http://archive-access.svn.sourceforge.net/archive-access/?rev=2766&view=rev Author: bradtofel Date: 2009-07-17 23:51:41 +0000 (Fri, 17 Jul 2009) Log Message: ----------- REFACTOR: ResourceFactory is now responsible for deciding if a String is a path to a local file, or an HTTP URL to a remote one. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java 2009-07-17 23:50:08 UTC (rev 2765) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java 2009-07-17 23:51:41 UTC (rev 2766) @@ -24,9 +24,7 @@ */ package org.archive.wayback.resourcestore; -import java.io.File; import java.io.IOException; -import java.net.URL; import java.util.logging.Logger; import org.archive.wayback.ResourceStore; @@ -80,12 +78,7 @@ try { - if(url.startsWith("http://")) { - r = ResourceFactory.getResource(new URL(url), offset); - } else { - // assume local path: - r = ResourceFactory.getResource(new File(url), offset); - } + r = ResourceFactory.getResource(url, offset); // TODO: attempt to grab the first few KB? The underlying // InputStreams support mark(), so we could reset() after. // wait for now, currently this will parse HTTP headers, This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:50:12
|
Revision: 2765 http://archive-access.svn.sourceforge.net/archive-access/?rev=2765&view=rev Author: bradtofel Date: 2009-07-17 23:50:08 +0000 (Fri, 17 Jul 2009) Log Message: ----------- BUGFIX(ACC-45): now all mime-types are escaped to prevent spaces from getting into the CDX files. FEATURE(ACC-75): warc-indexer now has -all option to produce a CDX line for ALL records, not just captures and revisits. FEATURE(ACC-76): now includes file+offset for all records, keying off mime-time of warc/revist to determine revisits at query time. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WARCRecordToSearchResultAdapter.java branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WARCRecordToSearchResultAdapter.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WARCRecordToSearchResultAdapter.java 2009-07-17 23:38:22 UTC (rev 2764) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WARCRecordToSearchResultAdapter.java 2009-07-17 23:50:08 UTC (rev 2765) @@ -2,7 +2,7 @@ import java.io.File; import java.io.IOException; -//import java.util.logging.Logger; +import java.util.logging.Logger; import org.apache.commons.httpclient.Header; import org.apache.commons.httpclient.HttpParser; @@ -33,14 +33,23 @@ */ public class WARCRecordToSearchResultAdapter implements Adapter<WARCRecord,CaptureSearchResult>{ + private static final Logger LOGGER = + Logger.getLogger(WARCRecordToSearchResultAdapter.class.getName()); private final static String DEFAULT_VALUE = "-"; -// private static final Logger LOGGER = Logger.getLogger( -// WARCRecordToSearchResultAdapter.class.getName()); - private UrlCanonicalizer canonicalizer = null; + + private boolean processAll = false; + public boolean isProcessAll() { + return processAll; + } + + public void setProcessAll(boolean processAll) { + this.processAll = processAll; + } + public WARCRecordToSearchResultAdapter() { canonicalizer = new AggressiveUrlCanonicalizer(); } @@ -75,12 +84,19 @@ return output.toString(); } - private static String transformHTTPMime(final String input) { + private static String escapeSpaces(final String input) { + if(input.contains(" ")) { + return input.replace(" ", "%20"); + } + return input; + } + + private static String transformHTTPMime(String input) { int semiIdx = input.indexOf(";"); if(semiIdx > 0) { - return input.substring(0,semiIdx).trim(); + return escapeSpaces(input.substring(0,semiIdx).trim()); } - return input.trim(); + return escapeSpaces(input.trim()); } private String transformWarcFilename(String readerIdentifier) { @@ -148,16 +164,21 @@ return result; } - private CaptureSearchResult adaptRevisit(ArchiveRecordHeader header, WARCRecord rec) + private CaptureSearchResult adaptGeneric(ArchiveRecordHeader header, + WARCRecord rec, String mime) throws IOException { CaptureSearchResult result = getBlankSearchResult(); result.setCaptureTimestamp(transformDate(header.getDate())); + result.setFile(transformWarcFilename(header.getReaderIdentifier())); + result.setOffset(header.getOffset()); result.setDigest(transformDigest(header.getHeaderValue( - WARCRecord.HEADER_KEY_PAYLOAD_DIGEST))); + WARCRecord.HEADER_KEY_PAYLOAD_DIGEST))); addUrlDataToSearchResult(result,header.getUrl()); + + result.setMimeType(mime); return result; } @@ -257,7 +278,17 @@ result = adaptResponse(header,rec); } } else if(type.equals(WARCConstants.REVISIT)) { - result = adaptRevisit(header,rec); + result = adaptGeneric(header,rec,"warc/revisit"); + } else if(type.equals(WARCConstants.REQUEST)) { + if(processAll) { + result = adaptGeneric(header,rec,"warc/request"); + } + } else if(type.equals(WARCConstants.METADATA)) { + if(processAll) { + result = adaptGeneric(header,rec,"warc/metadata"); + } + } else { + LOGGER.info("Skipping record type : " + type); } return result; Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-07-17 23:38:22 UTC (rev 2764) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-07-17 23:50:08 UTC (rev 2765) @@ -26,9 +26,19 @@ public final static String CDX_HEADER_MAGIC = " CDX N b h m s k r V g"; private UrlCanonicalizer canonicalizer = null; + private boolean processAll = false; public WarcIndexer() { canonicalizer = new AggressiveUrlCanonicalizer(); } + + public boolean isProcessAll() { + return processAll; + } + + public void setProcessAll(boolean processAll) { + this.processAll = processAll; + } + /** * @param warc @@ -61,6 +71,7 @@ WARCRecordToSearchResultAdapter adapter2 = new WARCRecordToSearchResultAdapter(); adapter2.setCanonicalizer(canonicalizer); + adapter2.setProcessAll(processAll); ArchiveReaderCloseableIterator itr1 = new ArchiveReaderCloseableIterator(reader,reader.iterator()); @@ -82,11 +93,12 @@ private static void USAGE() { System.err.println("USAGE:"); System.err.println(""); - System.err.println("warc-indexer [-identity] WARCFILE"); - System.err.println("warc-indexer [-identity] WARCFILE CDXFILE"); + System.err.println("warc-indexer [-identity] [-all] WARCFILE"); + System.err.println("warc-indexer [-identity] [-all] WARCFILE CDXFILE"); System.err.println(""); System.err.println("Create a CDX format index at CDXFILE or to STDOUT"); System.err.println("With -identity, perform no url canonicalization."); + System.err.println("With -all, output request and metadata records."); System.exit(1); } @@ -96,8 +108,14 @@ public static void main(String[] args) { WarcIndexer indexer = new WarcIndexer(); int idx = 0; - if(args[0] != null && args[0].equals("-identity")) { - indexer.setCanonicalizer(new IdentityUrlCanonicalizer()); + while(args[idx] != null) { + if(args[idx].equals("-identity")) { + indexer.setCanonicalizer(new IdentityUrlCanonicalizer()); + } else if(args[idx].equals("-all")) { + indexer.setProcessAll(true); + } else { + break; + } idx++; } File arc = new File(args[idx]); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2764 http://archive-access.svn.sourceforge.net/archive-access/?rev=2764&view=rev Author: bradtofel Date: 2009-07-17 23:38:22 +0000 (Fri, 17 Jul 2009) Log Message: ----------- FEATURE(ACC-74): now accepts /OFFSET trailing path in addition to Content-Range HTTP header. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java 2009-07-17 23:28:06 UTC (rev 2763) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java 2009-07-17 23:38:22 UTC (rev 2764) @@ -24,11 +24,16 @@ */ package org.archive.wayback.resourcestore.locationdb; +import java.io.File; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; +import java.io.RandomAccessFile; import java.net.URL; import java.net.URLConnection; +import java.util.logging.Logger; +import java.util.regex.Matcher; +import java.util.regex.Pattern; import javax.servlet.ServletException; import javax.servlet.http.HttpServletRequest; @@ -46,62 +51,143 @@ * @version $Date$, $Revision$ */ public class FileProxyServlet extends ServletRequestContext { + private static final Logger LOGGER = Logger.getLogger(FileProxyServlet.class + .getName()); + private static final int BUF_SIZE = 4096; private static final String RANGE_HTTP_HEADER = "Range"; - private static final String CONTENT_TYPE_HEADER = "Content-Type"; - private static final String CONTENT_TYPE = "application/x-gzip"; - /** - * - */ + private static final String DEFAULT_CONTENT_TYPE = "application/x-gzip"; + private static final String HEADER_BYTES_PREFIX = "bytes="; + private static final String HEADER_BYTES_SUFFIX= "-"; + + private static final String FILE_REGEX = "/([^/]*)$"; + private static final String FILE_OFFSET_REGEX = "/([^/]*)/(\\d*)$"; + + private static final Pattern FILE_PATTERN = + Pattern.compile(FILE_REGEX); + private static final Pattern FILE_OFFSET_PATTERN = + Pattern.compile(FILE_OFFSET_REGEX); + private static final long serialVersionUID = 1L; + private ResourceFileLocationDB locationDB = null; public boolean handleRequest(HttpServletRequest httpRequest, HttpServletResponse httpResponse) throws IOException, ServletException { - String name = httpRequest.getRequestURI(); - name = name.substring(name.lastIndexOf('/')+1); - if(name.length() == 0) { + ResourceLocation location = parseRequest(httpRequest); + if(location == null) { httpResponse.sendError(HttpServletResponse.SC_BAD_REQUEST, - "no/invalid name"); + "no/invalid name"); } else { - - String urls[] = locationDB.nameToUrls(name); + String urls[] = locationDB.nameToUrls(location.getName()); + if(urls == null || urls.length == 0) { + LOGGER.warning("No locations for " + location.getName()); httpResponse.sendError(HttpServletResponse.SC_NOT_FOUND, - "Unable to locate("+name+")"); + "Unable to locate("+ location.getName() +")"); } else { + + DataSource ds = null; + for(String urlString : urls) { + try { + ds = locationToDataSource(urlString, + location.getOffset()); + if(ds != null) { + break; + } + } catch(IOException e) { + LOGGER.warning("failed proxy of " + urlString + " " + + e.getLocalizedMessage()); + } + } + if(ds == null) { + LOGGER.warning("No successful locations for " + + location.getName()); + httpResponse.sendError(HttpServletResponse.SC_BAD_GATEWAY, + "failed proxy of ("+ location.getName() +")"); + + } else { + httpResponse.setStatus(HttpServletResponse.SC_OK); + // BUGBUG: this will be broken for non compressed data... + httpResponse.setContentType(ds.getContentType()); + ds.copyTo(httpResponse.getOutputStream()); + } + } + } + return true; + } - String urlString = urls[0]; - String rangeHeader = httpRequest.getHeader(RANGE_HTTP_HEADER); - URL url = new URL(urlString); - URLConnection conn = url.openConnection(); + private DataSource locationToDataSource(String location, long offset) + throws IOException { + DataSource ds = null; + if(location.startsWith("http://")) { + URL url = new URL(location); + URLConnection conn = url.openConnection(); + if(offset != 0) { + conn.addRequestProperty(RANGE_HTTP_HEADER, + HEADER_BYTES_PREFIX + String.valueOf(offset) + + HEADER_BYTES_SUFFIX); + } + + ds = new URLDataSource(conn.getInputStream(),conn.getContentType()); + + } else { + // assume a local file path: + File f = new File(location); + if(f.isFile() && f.canRead()) { + long size = f.length(); + if(size < offset) { + throw new IOException("short file " + location + " cannot" + + " seek to offset " + offset); + } + RandomAccessFile raf = new RandomAccessFile(f,"r"); + raf.seek(offset); + // BUGBUG: is it compressed? + ds = new FileDataSource(raf,DEFAULT_CONTENT_TYPE); + + } else { + throw new IOException("No readable file at " + location); + } + + } + + return ds; + } + + private ResourceLocation parseRequest(HttpServletRequest request) { + ResourceLocation location = null; + + String path = request.getRequestURI(); + Matcher fo = FILE_OFFSET_PATTERN.matcher(path); + if(fo.find()) { + location = new ResourceLocation(fo.group(1), + Long.parseLong(fo.group(2))); + } else { + fo = FILE_PATTERN.matcher(path); + if(fo.find()) { + String rangeHeader = request.getHeader(RANGE_HTTP_HEADER); if(rangeHeader != null) { - conn.addRequestProperty(RANGE_HTTP_HEADER,rangeHeader); - } - InputStream is = conn.getInputStream(); - httpResponse.setStatus(HttpServletResponse.SC_OK); - String typeHeader = conn.getHeaderField(CONTENT_TYPE_HEADER); - if(typeHeader == null) { - typeHeader = CONTENT_TYPE; - } - httpResponse.setContentType(typeHeader); - OutputStream os = httpResponse.getOutputStream(); - int BUF_SIZE = 4096; - byte[] buffer = new byte[BUF_SIZE]; - try { - for(int r = -1; (r = is.read(buffer, 0, BUF_SIZE)) != -1;) { - os.write(buffer, 0, r); + if(rangeHeader.startsWith(HEADER_BYTES_PREFIX)) { + rangeHeader = rangeHeader.substring( + HEADER_BYTES_PREFIX.length()); + if(rangeHeader.endsWith(HEADER_BYTES_SUFFIX)) { + rangeHeader = rangeHeader.substring(0, + rangeHeader.length() - + HEADER_BYTES_SUFFIX.length()); + } } - } finally { - is.close(); + location = new ResourceLocation(fo.group(1), + Long.parseLong(rangeHeader)); + } else { + location = new ResourceLocation(fo.group(1)); } } } - return true; + return location; } /** @@ -117,4 +203,71 @@ public void setLocationDB(ResourceFileLocationDB locationDB) { this.locationDB = locationDB; } + + private class ResourceLocation { + private String name = null; + private long offset = 0; + public ResourceLocation(String name, long offset) { + this.name = name; + this.offset = offset; + } + public ResourceLocation(String name) { + this(name,0); + } + public String getName() { + return name; + } + public long getOffset() { + return offset; + } + } + + private interface DataSource { + public void copyTo(OutputStream os) throws IOException; + public String getContentType(); + } + private class FileDataSource implements DataSource { + private RandomAccessFile raf = null; + private String contentType = null; + public FileDataSource(RandomAccessFile raf, String contentType) { + this.raf = raf; + this.contentType = contentType; + } + public String getContentType() { + return contentType; + } + public void copyTo(OutputStream os) throws IOException { + byte[] buffer = new byte[BUF_SIZE]; + try { + int r = -1; + while((r = raf.read(buffer, 0, BUF_SIZE)) != -1) { + os.write(buffer, 0, r); + } + } finally { + raf.close(); + } + } + } + private class URLDataSource implements DataSource { + private InputStream is = null; + private String contentType = null; + public URLDataSource(InputStream is,String contentType) { + this.is = is; + this.contentType = contentType; + } + public String getContentType() { + return contentType; + } + public void copyTo(OutputStream os) throws IOException { + byte[] buffer = new byte[BUF_SIZE]; + try { + int r = -1; + while((r = is.read(buffer, 0, BUF_SIZE)) != -1) { + os.write(buffer, 0, r); + } + } finally { + is.close(); + } + } + } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2760 http://archive-access.svn.sourceforge.net/archive-access/?rev=2760&view=rev Author: bradtofel Date: 2009-07-17 23:17:32 +0000 (Fri, 17 Jul 2009) Log Message: ----------- BUGFIX(ACC-73): now does case-insensitive compare of hostPort of URLs to ensure we haven't already rewritten. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixResultURIConverter.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixResultURIConverter.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixResultURIConverter.java 2009-07-17 23:16:13 UTC (rev 2759) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixResultURIConverter.java 2009-07-17 23:17:32 UTC (rev 2760) @@ -45,6 +45,9 @@ public String makeReplayURI(String datespec, String url) { String replayURI = ""; try { + if(url.toUpperCase().contains(hostPort.toUpperCase())) { + return url; + } URI uri = new URI(url); StringBuilder sb = new StringBuilder(90); sb.append("http://"); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-07-17 23:16:22
|
Revision: 2759 http://archive-access.svn.sourceforge.net/archive-access/?rev=2759&view=rev Author: bradtofel Date: 2009-07-17 23:16:13 +0000 (Fri, 17 Jul 2009) Log Message: ----------- BUGFIX(ACC-73): now does case-insensitive compare of hostPort of incoming request. Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixRequestParser.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixRequestParser.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixRequestParser.java 2009-07-17 23:13:31 UTC (rev 2758) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/domainprefix/DomainPrefixRequestParser.java 2009-07-17 23:16:13 UTC (rev 2759) @@ -76,7 +76,7 @@ WaybackRequest wbRequest = null; String server = httpRequest.getServerName() + ":" + httpRequest.getServerPort(); - if(server.endsWith(hostPort)) { + if(server.toLowerCase().endsWith(hostPort.toLowerCase())) { int length = server.length() - hostPort.length(); if(server.length() > hostPort.length()) { String prefix = server.substring(0,length - 1); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2758 http://archive-access.svn.sourceforge.net/archive-access/?rev=2758&view=rev Author: bradtofel Date: 2009-07-17 23:13:31 +0000 (Fri, 17 Jul 2009) Log Message: ----------- FEATURE: added getter for replayURIPrefix Modified Paths: -------------- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlResultURIConverter.java Modified: branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlResultURIConverter.java =================================================================== --- branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlResultURIConverter.java 2009-07-17 23:09:28 UTC (rev 2757) +++ branches/wayback-1_4_2/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlResultURIConverter.java 2009-07-17 23:13:31 UTC (rev 2758) @@ -42,10 +42,14 @@ * @see org.archive.wayback.ResultURIConverter#makeReplayURI(java.lang.String, java.lang.String) */ public String makeReplayURI(String datespec, String url) { + String suffix = datespec + "/" + url; if(replayURIPrefix == null) { - return datespec + "/" + url; + return suffix; } else { - return replayURIPrefix + datespec + "/" + url; + if(url.startsWith(replayURIPrefix)) { + return url; + } + return replayURIPrefix + suffix; } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |