You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
Revision: 2605 http://archive-access.svn.sourceforge.net/archive-access/?rev=2605&view=rev Author: bradtofel Date: 2008-10-11 01:50:43 +0000 (Sat, 11 Oct 2008) Log Message: ----------- TWEAK: now outputs log message when failing to access a Resource, instead of dumping a stack trace. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java 2008-10-11 01:45:56 UTC (rev 2604) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/LocationDBResourceStore.java 2008-10-11 01:50:43 UTC (rev 2605) @@ -27,6 +27,7 @@ import java.io.File; import java.io.IOException; import java.net.URL; +import java.util.logging.Logger; import org.archive.wayback.ResourceStore; import org.archive.wayback.core.Resource; @@ -43,6 +44,8 @@ * @version $Date$, $Revision$ */ public class LocationDBResourceStore implements ResourceStore { + private static final Logger LOGGER = + Logger.getLogger(LocationDBResourceStore.class.getName()); private ResourceFileLocationDB db = null; @@ -89,7 +92,7 @@ // which means we've already read some } catch (IOException e) { - e.printStackTrace(); + LOGGER.warning("Unable to retrieve resource from " + url); } if(r != null) { break; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-11 01:46:06
|
Revision: 2604 http://archive-access.svn.sourceforge.net/archive-access/?rev=2604&view=rev Author: binzino Date: 2008-10-11 01:45:56 +0000 (Sat, 11 Oct 2008) Log Message: ----------- Moved NutchWAX-specific part of "package" rule to "onlypack" so it can be run w/o triggering all the dependencies. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/build.xml Modified: trunk/archive-access/projects/nutchwax/archive/build.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/build.xml 2008-10-11 01:45:00 UTC (rev 2603) +++ trunk/archive-access/projects/nutchwax/archive/build.xml 2008-10-11 01:45:56 UTC (rev 2604) @@ -107,9 +107,12 @@ and let the individual user decide whether or not to incorporate our modifications. --> - <target name="package" depends="jar, job, war, javadoc"> + <target name="package" depends="jar, job, war, javadoc" > <ant dir="${nutch.dir}" target="package" inheritAll="false" /> + <ant target="onlypack" /> + </target> + <target name="onlypack"> <copy todir="${dist.dir}/lib" includeEmptyDirs="false"> <fileset dir="lib"/> </copy> @@ -133,6 +136,11 @@ </fileset> </copy> + <mkdir dir="${dist.dir}/contrib/archive/web"/> + <copy todir="${dist.dir}/contrib/archive/web"> + <fileset dir="src/web" /> + </copy> + </target> </project> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-11 01:45:12
|
Revision: 2603 http://archive-access.svn.sourceforge.net/archive-access/?rev=2603&view=rev Author: binzino Date: 2008-10-11 01:45:00 +0000 (Sat, 11 Oct 2008) Log Message: ----------- Added page links template, some comments and other various tweaks. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl Modified: trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl 2008-10-10 03:00:53 UTC (rev 2602) +++ trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl 2008-10-11 01:45:00 UTC (rev 2603) @@ -122,13 +122,37 @@ <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"><a href="{nutch:nextPage}">Next</a></span></div> <!-- Search results --> <ol start="{opensearch:startIndex + 1}"> - <xsl:apply-templates select="item" /> + <xsl:apply-templates select="item" /> </ol> - <a href="{nutch:nextPage}">Next</a> + <!-- Generate list of page links --> + <center> + <xsl:if test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) != 1"> + <a href="search?query={nutch:query}&start={(floor(opensearch:startIndex div opensearch:itemsPerPage) - 1) * opensearch:itemsPerPage}">«</a><xsl:text> </xsl:text> + </xsl:if> + <xsl:choose> + <xsl:when test="(floor(opensearch:startIndex div opensearch:itemsPerPage) + 1) < 11"> + <xsl:call-template name="pageLinks" > + <xsl:with-param name="begin" select="1" /> + <xsl:with-param name="end" select="21" /> + <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> + </xsl:call-template> + </xsl:when> + <xsl:otherwise> + <xsl:call-template name="pageLinks" > + <xsl:with-param name="begin" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 - 10" /> + <xsl:with-param name="end" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1 + 11" /> + <xsl:with-param name="current" select="floor(opensearch:startIndex div opensearch:itemsPerPage) + 1" /> + </xsl:call-template> + </xsl:otherwise> + </xsl:choose> + <a href="{nutch:nextPage}">»</a> + </center> </body> </html> </xsl:template> +<!-- Template to emit a search result as an HTML list item (<li/>). + --> <xsl:template match="item"> <li> <div class="searchResult"> @@ -146,8 +170,38 @@ </li> </xsl:template> +<!-- Template to emit a date in YYYY/MM/DD format + --> <xsl:template match="nutch:date" > <xsl:value-of select="substring(.,1,4)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,5,2)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,7,2)" /><xsl:text> </xsl:text> </xsl:template> +<!-- Template to generate a list of numbered links to results pages. + Parameters: + begin starting # inclusive + end ending # exclusive + current the current page, don't emit a link + --> +<xsl:template name="pageLinks"> + <xsl:param name="begin" /> + <xsl:param name="end" /> + <xsl:param name="current" /> + <xsl:if test="$begin < $end"> + <xsl:choose> + <xsl:when test="$begin = $current" > + <xsl:value-of select="$current" /> + </xsl:when> + <xsl:otherwise> + <a href="?query={nutch:query}&start={($begin -1) * opensearch:itemsPerPage}&hitsPerPage={opensearch:itemsPerPage}"><xsl:value-of select="$begin" /></a> + </xsl:otherwise> + </xsl:choose> + <xsl:text> </xsl:text> + <xsl:call-template name="pageLinks"> + <xsl:with-param name="begin" select="$begin + 1" /> + <xsl:with-param name="end" select="$end" /> + <xsl:with-param name="current" select="$current" /> + </xsl:call-template> + </xsl:if> +</xsl:template> + </xsl:stylesheet> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2602 http://archive-access.svn.sourceforge.net/archive-access/?rev=2602&view=rev Author: bradtofel Date: 2008-10-10 03:00:53 +0000 (Fri, 10 Oct 2008) Log Message: ----------- INITIAL REV (ARI-619): allows filtering of records based on either inclusion or exclusion list of HTTP response codes. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/HttpCodeFilter.java 2008-10-10 03:00:53 UTC (rev 2602) @@ -0,0 +1,74 @@ +package org.archive.wayback.resourceindex.filters; + +import java.util.ArrayList; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.util.ObjectFilter; + +/** + * ObjectFilter which allows including or excluding results based on the + * Http response code. + * + * @author brad + * @version $Date$, $Rev$ + */ +public class HttpCodeFilter implements ObjectFilter<CaptureSearchResult> { + + private Map<String,Object> includes = null; + private Map<String,Object> excludes = null; + + private static Map<String,Object> listToMap(List<String> list) { + if(list == null) { + return null; + } + HashMap<String, Object> map = new HashMap<String, Object>(); + for(String s : list) { + map.put(s, null); + } + return map; + } + private static List<String> mapToList(Map<String,Object> map) { + if(map == null) { + return null; + } + List<String> list = new ArrayList<String>(); + list.addAll(map.keySet()); + return list; + } + + public List<String> getIncludes() { + return mapToList(includes); + } + + public void setIncludes(List<String> includes) { + this.includes = listToMap(includes); + } + + + public List<String> getExcludes() { + return mapToList(excludes); + } + + + public void setExcludes(List<String> excludes) { + this.excludes = listToMap(excludes); + } + + public int filterObject(CaptureSearchResult o) { + String code = o.getHttpCode(); + if(excludes != null) { + if(excludes.containsKey(code)) { + return FILTER_EXCLUDE; + } + } + if(includes != null) { + if(!includes.containsKey(code)) { + return FILTER_EXCLUDE; + } + } + return FILTER_INCLUDE; + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-10-10 00:30:39
|
Revision: 2601 http://archive-access.svn.sourceforge.net/archive-access/?rev=2601&view=rev Author: bradtofel Date: 2008-10-10 00:30:29 +0000 (Fri, 10 Oct 2008) Log Message: ----------- BUGFIX (unreported): do not forward WaybackRequest fields with 'null' values. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-10-10 00:26:15 UTC (rev 2600) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2008-10-10 00:30:29 UTC (rev 2601) @@ -822,6 +822,7 @@ } if(isStandard) continue; String val = filters.get(key); + if(val == null) continue; if (queryString.length() > 0) { queryString.append(" "); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2600 http://archive-access.svn.sourceforge.net/archive-access/?rev=2600&view=rev Author: bradtofel Date: 2008-10-10 00:26:15 +0000 (Fri, 10 Oct 2008) Log Message: ----------- INITIAL-REV: (ACC-35) prefix original HTTP headers with X-Archive-Orig- Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/XArchiveHttpHeaderProcessor.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/XArchiveHttpHeaderProcessor.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/XArchiveHttpHeaderProcessor.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/XArchiveHttpHeaderProcessor.java 2008-10-10 00:26:15 UTC (rev 2600) @@ -0,0 +1,34 @@ +package org.archive.wayback.replay; + +import java.util.Map; + +import org.archive.wayback.ResultURIConverter; +import org.archive.wayback.core.CaptureSearchResult; + +public class XArchiveHttpHeaderProcessor implements HttpHeaderProcessor { + + private static String DEFAULT_PREFIX = "X-Wayback-Orig-"; + private String prefix = DEFAULT_PREFIX; + + public String getPrefix() { + return prefix; + } + + public void setPrefix(String prefix) { + this.prefix = prefix; + } + + public void filter(Map<String, String> output, String key, String value, + ResultURIConverter uriConverter, CaptureSearchResult result) { + String keyUp = key.toUpperCase(); + + // rewrite Location header URLs + if (keyUp.startsWith(HTTP_CONTENT_TYPE_HEADER_UP)) { + // let's leave this one alone... seems important. + output.put(key, value); + } else { + // others go out with prefix: + output.put(prefix + key,value); + } + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2599 http://archive-access.svn.sourceforge.net/archive-access/?rev=2599&view=rev Author: bradtofel Date: 2008-10-09 23:39:15 +0000 (Thu, 09 Oct 2008) Log Message: ----------- INITIAL REV: experimental archival URL query specification for proxy replay mode. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/proxy/ProxyArchivalRequestParser.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/proxy/ProxyArchivalRequestParser.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/proxy/ProxyArchivalRequestParser.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/proxy/ProxyArchivalRequestParser.java 2008-10-09 23:39:15 UTC (rev 2599) @@ -0,0 +1,34 @@ +package org.archive.wayback.proxy; + +import java.util.List; + +import org.archive.wayback.RequestParser; +import org.archive.wayback.archivalurl.requestparser.PathDatePrefixQueryRequestParser; +import org.archive.wayback.archivalurl.requestparser.PathDateRangeQueryRequestParser; +import org.archive.wayback.archivalurl.requestparser.PathPrefixDatePrefixQueryRequestParser; +import org.archive.wayback.archivalurl.requestparser.PathPrefixDateRangeQueryRequestParser; +import org.archive.wayback.requestparser.FormRequestParser; +import org.archive.wayback.requestparser.OpenSearchRequestParser; + +public class ProxyArchivalRequestParser extends ProxyRequestParser { + private ProxyReplayRequestParser prrp = new ProxyReplayRequestParser(); + protected RequestParser[] getRequestParsers() { + prrp.init(); + RequestParser[] theParsers = { + prrp, + new PathDatePrefixQueryRequestParser(), + new PathDateRangeQueryRequestParser(), + new PathPrefixDatePrefixQueryRequestParser(), + new PathPrefixDateRangeQueryRequestParser(), + new OpenSearchRequestParser(), + new FormRequestParser() + }; + return theParsers; + } + public List<String> getLocalhostNames() { + return prrp.getLocalhostNames(); + } + public void setLocalhostNames(List<String> localhostNames) { + prrp.setLocalhostNames(localhostNames); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-03 23:18:46
|
Revision: 2598 http://archive-access.svn.sourceforge.net/archive-access/?rev=2598&view=rev Author: binzino Date: 2008-10-03 23:18:28 +0000 (Fri, 03 Oct 2008) Log Message: ----------- Updated info on latest SVN revision of Nutch. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-10-03 23:08:29 UTC (rev 2597) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-10-03 23:18:28 UTC (rev 2598) @@ -46,11 +46,11 @@ Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.2 is built against is: - 697964 + 701524 To checkout this revision of Nutch, use: - $ svn checkout -r 697964 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch + $ svn checkout -r 701524 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-10-03 23:08:35
|
Revision: 2597 http://archive-access.svn.sourceforge.net/archive-access/?rev=2597&view=rev Author: binzino Date: 2008-10-03 23:08:29 +0000 (Fri, 03 Oct 2008) Log Message: ----------- Initial revision. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/web/ trunk/archive-access/projects/nutchwax/archive/src/web/style/ trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl trunk/archive-access/projects/nutchwax/archive/src/web/web.xml Added: trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/web/style/search.xsl 2008-10-03 23:08:29 UTC (rev 2597) @@ -0,0 +1,153 @@ +<?xml version="1.0" encoding="utf-8" ?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<xsl:stylesheet + version="1.0" + xmlns:xsl="http://www.w3.org/1999/XSL/Transform" + xmlns:nutch="http://www.nutch.org/opensearchrss/1.0/" + xmlns:opensearch="http://a9.com/-/spec/opensearchrss/1.0/" +> +<xsl:output method="xml" /> + +<xsl:template match="rss/channel"> + <html xmlns="http://www.w3.org/1999/xhtml"> + <head> + <title><xsl:value-of select="title" /></title> + <style media="all" lang="en" type="text/css"> + body + { + padding : 20px; + margin : 0; + font-family : Verdana; sans-serif; + font-size : 9pt; + color : #000000; + background-color: #ffffff; + } + .pageTitle + { + font-size : 125% ; + font-weight : bold ; + text-align : center ; + padding-bottom : 2em ; + } + .searchForm + { + margin : 20px 0 5px 0; + padding-bottom : 0px; + border-bottom : 1px solid black; + } + .searchResult + { + margin : 0; + padding : 0; + } + .searchResult h1 + { + margin : 0 0 5px 0 ; + padding : 0 ; + font-size : 120%; + } + .searchResult .details + { + font-size: 80%; + color: green; + } + .searchResult .dates + { + font-size: 80%; + } + .searchResult .dates a + { + color: #3366cc; + } + form#searchForm + { + margin : 0; padding: 0 0 10px 0; + } + .searchFields + { + padding : 3px 0; + } + .searchFields input + { + margin : 0 0 0 15px; + padding : 0; + } + input#query + { + margin : 0; + } + ol + { + margin : 5px 0 0 0; + padding : 0 0 0 2em; + } + ol li + { + margin : 0 0 15px 0; + } + </style> + </head> + <body> + <!-- Page header: title and search form --> + <div class="pageTitle" > + NutchWAX Sample XSLT + </div> + <div> + This simple XSLT demonstrates the transformation of OpenSearch XML results into a fully-functional, human-friendly HTML search page. No JSP needed. + </div> + <div class="searchForm"> + <form id="searchForm" name="searchForm" method="get" action="search" > + <span class="searchFields"> + Search for + <input id="query" name="query" type="text" size="40" value="{nutch:query}" /> + <input type="submit" value="Search"/> + </span> + </form> + </div> + <div style="font-size: 8pt; margin:0; padding:0 0 0.5em 0;">Results <xsl:value-of select="opensearch:startIndex + 1" />-<xsl:value-of select="opensearch:startIndex + opensearch:itemsPerPage" /> of about <xsl:value-of select="opensearch:totalResults" /> <span style="margin-left: 1em;"><a href="{nutch:nextPage}">Next</a></span></div> + <!-- Search results --> + <ol start="{opensearch:startIndex + 1}"> + <xsl:apply-templates select="item" /> + </ol> + <a href="{nutch:nextPage}">Next</a> + </body> +</html> +</xsl:template> + +<xsl:template match="item"> + <li> + <div class="searchResult"> + <h1><a href="{concat('http://wayback.archive-it.org/',nutch:collection,'/',nutch:date,'/',link)}"><xsl:value-of select="title" /></a></h1> + <div> + <xsl:value-of select="description" /> + </div> + <div class="details"> + <xsl:value-of select="link" /> - <xsl:value-of select="round( nutch:length div 1024 )"/>k - <xsl:value-of select="nutch:type" /> + </div> + <div class="dates"> + <a href="{concat('http://wayback.archive-it.org/',nutch:collection,'/*/',link)}">All versions</a> - <a href="?query={../nutch:query} site:{nutch:site}&hitsPerSite=0">More from <xsl:value-of select="nutch:site" /></a> + </div> + </div> + </li> +</xsl:template> + +<xsl:template match="nutch:date" > + <xsl:value-of select="substring(.,1,4)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,5,2)" /><xsl:text>-</xsl:text><xsl:value-of select="substring(.,7,2)" /><xsl:text> </xsl:text> +</xsl:template> + +</xsl:stylesheet> Added: trunk/archive-access/projects/nutchwax/archive/src/web/web.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/web/web.xml (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/web/web.xml 2008-10-03 23:08:29 UTC (rev 2597) @@ -0,0 +1,80 @@ +<?xml version="1.0" encoding="ISO-8859-1"?> +<!DOCTYPE web-app + PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN" + "http://java.sun.com/dtd/web-app_2_3.dtd"> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<web-app> + +<!-- order is very important here --> + +<listener> + <listener-class>org.apache.nutch.searcher.NutchBean$NutchBeanConstructor</listener-class> + <listener-class>org.archive.nutchwax.NutchWaxBean$NutchWaxBeanConstructor</listener-class> +</listener> + +<servlet> + <servlet-name>Cached</servlet-name> + <servlet-class>org.apache.nutch.servlet.Cached</servlet-class> +</servlet> + +<servlet> + <servlet-name>OpenSearch</servlet-name> + <servlet-class>org.apache.nutch.searcher.OpenSearchServlet</servlet-class> +</servlet> + +<servlet-mapping> + <servlet-name>Cached</servlet-name> + <url-pattern>/servlet/cached</url-pattern> +</servlet-mapping> + +<servlet-mapping> + <servlet-name>OpenSearch</servlet-name> + <url-pattern>/opensearch</url-pattern> +</servlet-mapping> + +<servlet-mapping> + <servlet-name>OpenSearch</servlet-name> + <url-pattern>/search</url-pattern> +</servlet-mapping> + +<filter> + <filter-name>XSLT Filter</filter-name> + <filter-class>org.archive.nutchwax.XSLTFilter</filter-class> + <init-param> + <param-name>xsltUrl</param-name> + <param-value>style/search.xsl</param-value> + </init-param> +</filter> + +<filter-mapping> + <filter-name>XSLT Filter</filter-name> + <url-pattern>/search</url-pattern> +</filter-mapping> + +<welcome-file-list> + <welcome-file>search.html</welcome-file> + <welcome-file>index.html</welcome-file> + <welcome-file>index.jsp</welcome-file> +</welcome-file-list> + +<taglib> + <taglib-uri>http://jakarta.apache.org/taglibs/i18n</taglib-uri> + <taglib-location>/WEB-INF/taglibs-i18n.tld</taglib-location> + </taglib> + +</web-app> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-26 20:38:37
|
Revision: 2596 http://archive-access.svn.sourceforge.net/archive-access/?rev=2596&view=rev Author: binzino Date: 2008-09-26 20:38:26 +0000 (Fri, 26 Sep 2008) Log Message: ----------- Fix WAX-25: Add new utility to dump the unique values of a field in an index. Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/GetUniqFieldValues.java Added: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/GetUniqFieldValues.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/GetUniqFieldValues.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/GetUniqFieldValues.java 2008-09-26 20:38:26 UTC (rev 2596) @@ -0,0 +1,88 @@ +/* + * Copyright (C) 2008 Internet Archive. + * + * This file is part of the archive-access tools project + * (http://sourceforge.net/projects/archive-access). + * + * The archive-access tools are free software; you can redistribute them and/or + * modify them under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or any + * later version. + * + * The archive-access tools are distributed in the hope that they will be + * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser + * Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License along with + * the archive-access tools; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +package org.archive.nutchwax.tools; + +import java.io.File; +import java.util.Iterator; +import java.util.Set; +import java.util.HashSet; +import java.util.Collection; + +import org.apache.lucene.index.IndexReader; + +/** + * A quick-n-dirty command-line utility to get the unique values for a + * field in an index and print them to stdout. + */ +public class GetUniqFieldValues +{ + public static void main(String[] args) throws Exception + { + String fieldName = ""; + String indexDir = ""; + + if ( args.length == 2 ) + { + fieldName = args[0]; + indexDir = args[1]; + } + + if (! (new File(indexDir)).exists()) + { + usageAndExit(); + } + + dumpUniqValues( fieldName, indexDir ); + } + + private static void dumpUniqValues( String fieldName, String indexDir ) throws Exception + { + IndexReader reader = IndexReader.open(indexDir); + + Collection fieldNames = reader.getFieldNames( IndexReader.FieldOption.ALL ); + + if ( ! fieldNames.contains( fieldName ) ) + { + System.out.println( "Field not in index: " + fieldName ); + System.exit( 2 ); + } + + int numDocs = reader.numDocs(); + Set<String> values = new HashSet<String>( ); + + for ( int i = 0; i < numDocs; i++ ) + { + values.add( reader.document(i).get( fieldName ) ); + } + + for ( String v : values ) + { + System.out.println( v ); + } + + } + + private static void usageAndExit() + { + System.out.println("Usage: GetUniqFieldValues field index"); + System.exit(1); + } +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-25 21:13:31
|
Revision: 2595 http://archive-access.svn.sourceforge.net/archive-access/?rev=2595&view=rev Author: binzino Date: 2008-09-25 21:13:25 +0000 (Thu, 25 Sep 2008) Log Message: ----------- Added try/catch around use of UrlCanonicalizer so that we ignore URIs that are malformed. A warning is emitted to stderr. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2008-09-22 19:55:50 UTC (rev 2594) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2008-09-25 21:13:25 UTC (rev 2595) @@ -137,14 +137,27 @@ newDoc.add( new Field( NutchWax.DATE_KEY, date, Field.Store.YES, Field.Index.UN_TOKENIZED ) ); } - // First, apply URL canonicalization from Wayback - String canonicalizedUrl = canonicalizer.urlStringToKey( oldDoc.get( NutchWax.URL_KEY ) ); + // Obtain the new dates for the document. + String newDates = null; + try + { + // First, apply URL canonicalization from Wayback + String canonicalizedUrl = canonicalizer.urlStringToKey( oldDoc.get( NutchWax.URL_KEY ) ); - // Now, get the digest+ URL of the document, look for it in - // the updateRecords and if found, add the date. - String key = canonicalizedUrl + oldDoc.get( NutchWax.DIGEST_KEY ); + // Now, get the digest+URL of the document, look for it in + // the updateRecords and if found, add the date. + String key = canonicalizedUrl + oldDoc.get( NutchWax.DIGEST_KEY ); + + newDates = dateRecords.get( key ); + } + catch ( Exception e ) + { + // The canonicalizer can throw various types of exceptions + // due to malformed URIs. + System.err.println( "WARN: Not adding dates on malformed URI: " + oldDoc.get( NutchWax.URL_KEY ) ); + } - String newDates = dateRecords.get( key ); + // If there are any new dates, add them to the new document. if ( newDates != null ) { for ( String date : newDates.split("\\s+") ) @@ -153,6 +166,7 @@ } } + // Finally, add the new document to the new index. writer.addDocument( newDoc ); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-22 19:56:03
|
Revision: 2594 http://archive-access.svn.sourceforge.net/archive-access/?rev=2594&view=rev Author: binzino Date: 2008-09-22 19:55:50 +0000 (Mon, 22 Sep 2008) Log Message: ----------- Updates in anticipation for 0.12.2 release. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt trunk/archive-access/projects/nutchwax/archive/README.txt trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt Modified: trunk/archive-access/projects/nutchwax/archive/INSTALL.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-09-22 19:08:40 UTC (rev 2593) +++ trunk/archive-access/projects/nutchwax/archive/INSTALL.txt 2008-09-22 19:55:50 UTC (rev 2594) @@ -1,6 +1,6 @@ INSTALL.txt -2008-07-28 +2008-10-01 Aaron Binns This installation guide assumes the reader is already familiar with @@ -43,14 +43,14 @@ ------------- As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Nutch doesn't have a 1.0 release package yet, so we have to use the -Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12 is +Nutch SVN trunk. The specific SVN revision that NutchWAX 0.12.2 is built against is: - 676736 + 697964 To checkout this revision of Nutch, use: - $ svn checkout -r 676736 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch + $ svn checkout -r 697964 http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch $ cd nutch Modified: trunk/archive-access/projects/nutchwax/archive/README.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/README.txt 2008-09-22 19:08:40 UTC (rev 2593) +++ trunk/archive-access/projects/nutchwax/archive/README.txt 2008-09-22 19:55:50 UTC (rev 2594) @@ -1,9 +1,9 @@ README.txt -2008-07-25 +2008-10-01 Aaron Binns -Welcome to NutchWAX 0.12.1! +Welcome to NutchWAX 0.12.2! NutchWAX is a set of add-ons to Nutch in order to index and search archived web data. Modified: trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt =================================================================== --- trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-09-22 19:08:40 UTC (rev 2593) +++ trunk/archive-access/projects/nutchwax/archive/RELEASE-NOTES.txt 2008-09-22 19:55:50 UTC (rev 2594) @@ -1,9 +1,9 @@ RELEASE-NOTES.TXT -2007-07-25 +2008-10-01 Aaron Binns -Release notes for NutchWAX 0.12.1 +Release notes for NutchWAX 0.12.2 For the most recent updates and information on NutchWAX, please visit the project wiki at: @@ -15,9 +15,8 @@ Overview ====================================================================== -NutchWAX 0.12.1 contains some minor enhancements and fixes to NutchWAX -0.12. One of the driving forces behind some of the enhancements was -integration with the Wayback machine. +NutchWAX 0.12.2 contains some minor enhancements and fixes to NutchWAX +0.12.1. ====================================================================== Issues @@ -29,24 +28,16 @@ Issues resolved in this release: -WAX-16 - Option to skip ARC record import based on HTTP status code of - content +WAX-23 + Add a "field setter" filter to set a field to a static value in the + Lucene document during indexing. -WAX-12 - Add metadata field "fileoffset" +WAX-22 + Various code clean-ups based on code review using PMD tool. -WAX-11 - Change metadata field name in search results from "arcname" to - "filename" +WAX-21 + Allow for blank lines and comment lines in manifest file. -WAX-10 - Add "exacturl" metadata field to indexing so it can be searched - as-is, not parsed/tokenized like the "url" field. - -WAX-6 - Change DateAdder to allow for implementation of URLCanonicalizer to - be defined in property. - -WAX-4 - Implementor/user-provided XSLT for OpenSearch results +WAX-19 + Add strict/loose option to DateAdder for revisit lines with extra + data on end. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-22 19:08:51
|
Revision: 2593 http://archive-access.svn.sourceforge.net/archive-access/?rev=2593&view=rev Author: binzino Date: 2008-09-22 19:08:40 +0000 (Mon, 22 Sep 2008) Log Message: ----------- Initial revision of FieldSetter filter. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/plugin.xml Added Paths: ----------- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/plugin.xml =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/plugin.xml 2008-09-22 18:40:08 UTC (rev 2592) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/plugin.xml 2008-09-22 19:08:40 UTC (rev 2593) @@ -38,8 +38,15 @@ <extension id="org.archive.nutchwax.index" name="Configurable Indexing Filter" point="org.apache.nutch.indexer.IndexingFilter"> - <implementation id="ConfigurableIndexingFilter" - class="org.archive.nutchwax.index.ConfigurableIndexingFilter" /> + <implementation id="ConfigurableIndexingFilter" + class="org.archive.nutchwax.index.ConfigurableIndexingFilter" /> </extension> + <extension id="org.archive.nutchwax.index" + name="Field Setter" + point="org.apache.nutch.indexer.IndexingFilter"> + <implementation id="FieldSetter" + class="org.archive.nutchwax.index.FieldSetter" /> + </extension> + </plugin> Added: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java (rev 0) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/FieldSetter.java 2008-09-22 19:08:40 UTC (rev 2593) @@ -0,0 +1,178 @@ +/* + * Copyright (C) 2008 Internet Archive. + * + * This file is part of the archive-access tools project + * (http://sourceforge.net/projects/archive-access). + * + * The archive-access tools are free software; you can redistribute them and/or + * modify them under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or any + * later version. + * + * The archive-access tools are distributed in the hope that they will be + * useful, but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser + * Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License along with + * the archive-access tools; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ +package org.archive.nutchwax.index; + +import java.util.List; +import java.util.ArrayList; + +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; +import org.apache.lucene.document.Document; +import org.apache.lucene.document.Field; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.metadata.Metadata; +import org.apache.nutch.parse.Parse; + +/** + * <p>Indexing filter that assigns a static value to a field in the Lucene document. + * It can also be used to remove a field from the Lucene document by specifying + * an empty value for a field.</p> + * <p>This filter is pretty simple and is configured via the + * <code>nutchwax.filter.index.setField</code> property.</p> + * <p> + * The configuration syntax is a list of white-space delimited key=value + * pairs of the form: + * </p><pre> key=value + * key:stored=value + * key:stored:tokenized=value</pre> + * <p>where <code>stored</code> and <code>tokenized</code> are boolean + * values. This syntax is similar to that of the + * <code>ConfigurableIndexingFilter</code> and + * <code>ConfigurableQueryFilter</code>.</p> + * <p>The default values of these properties are + * <code>stored=true</code> and <code>tokenized=false</code>. The + * field value <b>is</b> indexed, whether or not it is stored or + * tokenized.</p> + * <p>If no value, or an empty value is given in the configuration, then + * the field is removed from the document all together.</p> + * <p>This filter is primarily used in situations where the NutchWAX operator + * makes a mistake and needs to set a field value on all the indexed + * documents. For example, if the operator forgets to include the collection + * name during import, then this filter can be used to set the <code>collection</code> + * field during indexing.</p> + */ +public class FieldSetter implements IndexingFilter +{ + public static final Log LOG = LogFactory.getLog( FieldSetter.class ); + + private Configuration conf; + private List<FieldSetting> settings; + + public void setConf( Configuration conf ) + { + this.conf = conf; + + String s = conf.get( "nutchwax.filter.index.setField" ); + + if ( null == s ) + { + return ; + } + + s = s.trim( ); + + if ( s.length( ) == 0 ) + { + return ; + } + + this.settings = new ArrayList<FieldSetting>( ); + + // Get field setting, by splitting: "foo=bar frotz=baz" into ["foo=bar","frotz=baz"] + for ( String field : s.split( "\\s+" ) ) + { + // Split: "foo:true:false=bar" into ["foo:true:false","bar"] + String[] fieldParts = field.split("="); + + // Split: "foo:true:false" into ["foo","true","false"] + String[] keyParts = fieldParts[0].split( "[:]" ); + + String key = keyParts[0]; + boolean store = true; + boolean tokenize = false; + switch ( keyParts.length ) + { + default: + LOG.warn( "Extra fields in field setting ignored: " + fieldParts[0] ); + case 3: + tokenize = Boolean.parseBoolean( keyParts[2] ); + case 2: + store = Boolean.parseBoolean( keyParts[1] ); + case 1: + // nothing to do + ; + } + + String value = fieldParts.length > 1 ? fieldParts[1] : null; + + LOG.info( "Add field spetting: " + key + "[" + store + ":" + tokenize + "] = " + value ); + + this.settings.add( new FieldSetting( key, store, tokenize, value ) ); + } + } + + private static class FieldSetting + { + String key; + boolean store; + boolean tokenize; + String value; + + public FieldSetting( String key, boolean store, boolean tokenize, String value ) + { + this.key = key; + this.store = store; + this.tokenize = tokenize; + this.value = value; + } + } + + public Configuration getConf() + { + return this.conf; + } + + /** + * <p>Set Lucene document field to fixed value.</p> + * <p> + * Remove field if specified value is <code>null</code>. + * </p> + */ + public Document filter( Document doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks ) + throws IndexingException + { + Metadata meta = parse.getData().getContentMeta(); + + for ( FieldSetting setting : this.settings ) + { + // First, remove the existing field. + doc.removeFields( setting.key ); + + // Add the value if it is given. + if ( setting.value != null ) + { + doc.add( new Field( setting.key, + setting.value, + setting.store ? Field.Store.YES : Field.Store.NO, + setting.tokenize ? Field.Index.TOKENIZED : Field.Index.UN_TOKENIZED ) ); + } + } + + return doc; + } + + +} This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-22 18:40:19
|
Revision: 2592 http://archive-access.svn.sourceforge.net/archive-access/?rev=2592&view=rev Author: binzino Date: 2008-09-22 18:40:08 +0000 (Mon, 22 Sep 2008) Log Message: ----------- WAX-21: Allow for blank linkes and comment lines in manifest file. Comment lines start with '#'. Extra whitespace at the start/end of all lines is also eliminated. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2008-09-22 18:07:59 UTC (rev 2591) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2008-09-22 18:40:08 UTC (rev 2592) @@ -19,7 +19,6 @@ import java.io.IOException; import java.net.MalformedURLException; import java.util.Map.Entry; -import java.util.Iterator; import java.util.List; import java.util.ArrayList; @@ -37,8 +36,6 @@ import org.apache.hadoop.mapred.OutputCollector; import org.apache.hadoop.mapred.Reporter; import org.apache.hadoop.mapred.TextInputFormat; -import org.apache.hadoop.mapred.TextOutputFormat; -import org.apache.hadoop.util.StringUtils; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; import org.apache.nutch.crawl.CrawlDatum; @@ -59,17 +56,14 @@ import org.apache.nutch.protocol.Content; import org.apache.nutch.protocol.ProtocolStatus; import org.apache.nutch.scoring.ScoringFilters; -import org.apache.nutch.util.LogUtil; import org.apache.nutch.util.NutchConfiguration; import org.apache.nutch.util.NutchJob; import org.apache.nutch.util.StringUtil; import org.archive.io.ArchiveReader; import org.archive.io.ArchiveReaderFactory; -import org.archive.io.ArchiveRecordHeader; import org.archive.io.arc.ARCRecord; import org.archive.io.arc.ARCRecordMetaData; -import org.archive.io.warc.WARCConstants; /** @@ -175,14 +169,22 @@ String arcUrl = ""; String collection = ""; String segmentName = getConf().get( Nutch.SEGMENT_NAME_KEY ); - + + // First, ignore blank manifest lines, and those that are comments. + String line = value.toString().trim( ); + if ( line.length() == 0 || line.charAt( 0 ) == '#' ) + { + // Ignore it. + return ; + } + // Each line of the manifest is "<url> <collection>" where <collection> is optional - String[] line = value.toString().split( "\\s+" ); - arcUrl = line[0]; + String[] parts = line.split( "\\s+" ); + arcUrl = parts[0]; - if ( line.length > 1 ) + if ( parts.length > 1 ) { - collection = line[1]; + collection = parts[1]; } if ( LOG.isInfoEnabled() ) LOG.info( "Importing ARC: " + arcUrl ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-22 18:08:11
|
Revision: 2591 http://archive-access.svn.sourceforge.net/archive-access/?rev=2591&view=rev Author: binzino Date: 2008-09-22 18:07:59 +0000 (Mon, 22 Sep 2008) Log Message: ----------- WAX-22: Various code clean-ups based on code review using PMD tool. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/XSLTFilter.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2008-09-04 21:36:09 UTC (rev 2590) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2008-09-22 18:07:59 UTC (rev 2591) @@ -16,10 +16,12 @@ */ package org.archive.nutchwax; -import java.io.*; +//import java.io.*; import java.util.*; import java.lang.reflect.Field; -import javax.servlet.*; +import javax.servlet.ServletContext; +import javax.servlet.ServletContextEvent; +import javax.servlet.ServletContextListener; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; @@ -34,7 +36,6 @@ import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; -import org.apache.hadoop.io.Closeable; import org.apache.hadoop.conf.Configuration; import org.apache.lucene.index.IndexReader; Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/XSLTFilter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/XSLTFilter.java 2008-09-04 21:36:09 UTC (rev 2590) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/XSLTFilter.java 2008-09-22 18:07:59 UTC (rev 2591) @@ -25,7 +25,6 @@ import java.io.PrintWriter; import java.io.ByteArrayInputStream; import java.io.ByteArrayOutputStream; -import java.io.CharArrayWriter; import javax.servlet.Filter; import javax.servlet.FilterChain; @@ -35,15 +34,14 @@ import javax.servlet.ServletOutputStream; import javax.servlet.ServletRequest; import javax.servlet.ServletResponse; -import javax.servlet.ServletResponseWrapper; -import javax.servlet.http.*; +import javax.servlet.http.HttpServletResponse; +import javax.servlet.http.HttpServletResponseWrapper; import javax.xml.transform.Source; import javax.xml.transform.stream.StreamSource; import javax.xml.transform.Templates; import javax.xml.transform.TransformerFactory; import javax.xml.transform.Transformer; -import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; @@ -55,8 +53,6 @@ public void init( FilterConfig config ) throws ServletException { - ServletContext app = config.getServletContext( ); - this.xsltUrl = config.getInitParameter( "xsltUrl" ); if ( this.xsltUrl != null ) @@ -116,9 +112,11 @@ } catch ( javax.xml.transform.TransformerConfigurationException tce ) { + // TODO: Re-throw, or log it and eat it? } catch( javax.xml.transform.TransformerException te ) { + // TODO: Re-throw, or log it and eat it? } } else Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2008-09-04 21:36:09 UTC (rev 2590) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2008-09-22 18:07:59 UTC (rev 2591) @@ -20,17 +20,15 @@ */ package org.archive.nutchwax.tools; -import java.io.File; import java.io.BufferedReader; -import java.io.File; import java.io.FileInputStream; import java.io.InputStream; import java.io.InputStreamReader; -import java.util.Map; +import java.util.Collections; import java.util.HashMap; import java.util.HashSet; +import java.util.Map; import java.util.Set; -import java.util.Collections; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; @@ -132,7 +130,7 @@ String dates[] = sourceDoc.getValues( NutchWax.DATE_KEY ); - java.util.Collections.addAll( uniqueDates, dates ); + Collections.addAll( uniqueDates, dates ); } for ( String date : uniqueDates ) { Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java 2008-09-04 21:36:09 UTC (rev 2590) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DumpParallelIndex.java 2008-09-22 18:07:59 UTC (rev 2591) @@ -31,9 +31,6 @@ { public static void main( String[] args ) throws Exception { - String option = ""; - String indexDir = ""; - if ( args.length < 1 ) { usageAndExit( ); Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2008-09-04 21:36:09 UTC (rev 2590) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/index-nutchwax/src/java/org/archive/nutchwax/index/ConfigurableIndexingFilter.java 2008-09-22 18:07:59 UTC (rev 2591) @@ -73,6 +73,7 @@ String destKey = srcKey; switch ( spec.length ) { + default: case 6: destKey = spec[5]; case 5: @@ -83,6 +84,9 @@ store = Boolean.parseBoolean( spec[2] ); case 2: lowerCase = Boolean.parseBoolean( spec[1] ); + case 1: + // Nothing to do + ; } LOG.info( "Add field specification: " + srcKey + ":" + lowerCase + ":" + store + ":" + tokenize + ":" + exclusive + ":" + destKey ); Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java 2008-09-04 21:36:09 UTC (rev 2590) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java 2008-09-22 18:07:59 UTC (rev 2591) @@ -24,14 +24,10 @@ import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; -import java.net.MalformedURLException; -import java.util.Arrays; import java.util.Collections; import java.util.HashSet; import java.util.Set; -import org.apache.commons.logging.Log; -import org.apache.commons.logging.LogFactory; import org.apache.commons.httpclient.URIException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-09-04 21:35:59
|
Revision: 2590 http://archive-access.svn.sourceforge.net/archive-access/?rev=2590&view=rev Author: bradtofel Date: 2008-09-04 21:36:09 +0000 (Thu, 04 Sep 2008) Log Message: ----------- BUGFIX(ACC-31): now escapes URLs as they are resolved in UrlOperations. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java 2008-09-02 23:26:08 UTC (rev 2589) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/url/UrlOperations.java 2008-09-04 21:36:09 UTC (rev 2590) @@ -83,7 +83,14 @@ public static String resolveUrl(String baseUrl, String url) { // TODO: this only works for http:// if(url.startsWith("http://")) { - return url; + try { + return UURIFactory.getInstance(url).getEscapedURI(); + } catch (URIException e) { + e.printStackTrace(); + // can't let a space exist... send back close to whatever came + // in... + return url.replace(" ", "%20"); + } } UURI absBaseURI; UURI resolvedURI = null; This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-09-02 23:26:04
|
Revision: 2589 http://archive-access.svn.sourceforge.net/archive-access/?rev=2589&view=rev Author: binzino Date: 2008-09-02 23:26:08 +0000 (Tue, 02 Sep 2008) Log Message: ----------- Changed parsing of dup/date file to allow for extra, unused fields. Also updated dedup-cdx script to add archive filename to output. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java Modified: trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx 2008-08-29 22:18:41 UTC (rev 2588) +++ trunk/archive-access/projects/nutchwax/archive/bin/dedup-cdx 2008-09-02 23:26:08 UTC (rev 2589) @@ -11,16 +11,26 @@ echo "Duplicate records are found by sorting all the CDX records, then" echo "comparing subsequent records by URL+digest." echo - echo "Output is in abbreviated form of \"URL digest date\", ex:" + echo "Output is in abbreviated form of \"URL digest date arcname\", ex:" echo - echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20070208173443" - echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20080626121505" + echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20070208173443 foo.arc.gz" + echo " example.org sha1:H4NTDLP5DNH6KON63ZALKEV5ELVUDGXJ 20080626121505 bar.arc.gz" echo echo "The output of this script can be used as an exclusions file for" echo "importing ARC files with NutchWAX, and also for adding dates" echo "to a parallel index." echo + echo "NOTE: This script uses Unix 'sort' binary. If you wish to use a different" + echo "implementation, specify it via the SORT shell variable, e.g.:" + echo + echo " SORT=my_cool_sort dedup-cdx file1.cdx" + echo exit 1; fi -cat $@ | awk '{ print $1 " sha1:" $6 " " $2 }' | sort -u | awk '{ if ( url == $1 && digest == $2 ) print $1 " " $2 " " $3 ; url = $1 ; digest = $2 }' +# Use Unix 'sort', unless over-ridden by caller. +if [ -z "$SORT" ]; then + SORT=sort +fi + +cat $@ | awk '{ print $1, "sha1:" $6, $2, $9 }' | $SORT -u | awk '{ if ( url == $1 && digest == $2 ) print $1, $2, $3, $4 ; url = $1 ; digest = $2 }' Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2008-08-29 22:18:41 UTC (rev 2588) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/tools/DateAdder.java 2008-09-02 23:26:08 UTC (rev 2589) @@ -84,24 +84,24 @@ String line; while ( (line = br.readLine( )) != null ) { - String parts[] = line.split("\\s+"); - if ( parts.length != 3 ) + String fields[] = line.split("\\s+"); + if ( fields.length < 3 ) { - System.out.println( "Malformed line: " + line ); + System.out.println( "Malformed line, not enough fields (" + fields.length +"): " + line ); continue; } // Key is hash+url, value is String which is a " "-separated list of dates - String key = parts[0] + parts[1]; + String key = fields[0] + fields[1]; String dates = dateRecords.get( key ); if ( dates != null ) { - dates += " " + parts[2]; + dates += " " + fields[2]; dateRecords.put( key, dates ); } else { - dateRecords.put( key , parts[2] ); + dateRecords.put( key , fields[2] ); } } Modified: trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java 2008-08-29 22:18:41 UTC (rev 2588) +++ trunk/archive-access/projects/nutchwax/archive/src/plugin/urlfilter-nutchwax/src/java/org/archive/nutchwax/urlfilter/WaybackURLFilter.java 2008-09-02 23:26:08 UTC (rev 2589) @@ -48,7 +48,6 @@ * same logic as the Wayback. By making Wayback canonicalization * available, we can use exclusion rules generated from CDX files. */ -// TODO: Add logging public class WaybackURLFilter implements URLFilter { public static final Log LOG = LogFactory.getLog( WaybackURLFilter.class ); @@ -75,7 +74,7 @@ if ( s.length != 3 ) { // Don't filter. - LOG.info( "Allowing: " + urlString ); + LOG.info( "Allowing : " + urlString ); return urlString; } @@ -94,7 +93,7 @@ // Then, build a key to be compared against the exclusion // list. - String key = url + " " + digest + " " + date; + String key = url + digest + date; exclude = this.exclusions.contains( key ); } @@ -192,6 +191,20 @@ String line; while ( (line = reader.readLine()) != null ) { + String fields[] = line.split( "\\s+" ); + + if ( fields.length < 3 ) + { + LOG.warn( "Malformed exclusion, not enough fields ("+fields.length+"): " + line ); + continue ; + } + + // We only want the first three fields. Chop-off anything extra. + if ( fields.length >= 3 ) + { + line = fields[0] + fields[1] + fields[2]; + } + exclusions.add( line ); } } @@ -222,5 +235,5 @@ return exclusions; } - + } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2588 http://archive-access.svn.sourceforge.net/archive-access/?rev=2588&view=rev Author: bradtofel Date: 2008-08-29 22:18:41 +0000 (Fri, 29 Aug 2008) Log Message: ----------- FEATURE: now supports second offset specification form: ".../fileproxy/foo.arc.gz/12345" to return data starting at offset 12345. FEATURE: now attempts to load from multiple locations if they are present in the locationDB FEATURE: now proxies from local files as well as remote HTTP urls, including combinations of some local files, and some URLs for the same name. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java 2008-08-28 22:04:08 UTC (rev 2587) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/locationdb/FileProxyServlet.java 2008-08-29 22:18:41 UTC (rev 2588) @@ -24,11 +24,16 @@ */ package org.archive.wayback.resourcestore.locationdb; +import java.io.File; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; +import java.io.RandomAccessFile; import java.net.URL; import java.net.URLConnection; +import java.util.logging.Logger; +import java.util.regex.Matcher; +import java.util.regex.Pattern; import javax.servlet.ServletException; import javax.servlet.http.HttpServletRequest; @@ -46,64 +51,145 @@ * @version $Date$, $Revision$ */ public class FileProxyServlet extends ServletRequestContext { + private static final Logger LOGGER = Logger.getLogger(FileProxyServlet.class + .getName()); + private static final int BUF_SIZE = 4096; private static final String RANGE_HTTP_HEADER = "Range"; - private static final String CONTENT_TYPE_HEADER = "Content-Type"; - private static final String CONTENT_TYPE = "application/x-gzip"; - /** - * - */ + private static final String DEFAULT_CONTENT_TYPE = "application/x-gzip"; + private static final String HEADER_BYTES_PREFIX = "bytes="; + private static final String HEADER_BYTES_SUFFIX= "-"; + + private static final String FILE_REGEX = "/([^/]*)$"; + private static final String FILE_OFFSET_REGEX = "/([^/]*)/(\\d*)$"; + + private static final Pattern FILE_PATTERN = + Pattern.compile(FILE_REGEX); + private static final Pattern FILE_OFFSET_PATTERN = + Pattern.compile(FILE_OFFSET_REGEX); + private static final long serialVersionUID = 1L; + private ResourceFileLocationDB locationDB = null; public boolean handleRequest(HttpServletRequest httpRequest, HttpServletResponse httpResponse) throws IOException, ServletException { - String name = httpRequest.getRequestURI(); - name = name.substring(name.lastIndexOf('/')+1); - if(name.length() == 0) { + ResourceLocation location = parseRequest(httpRequest); + if(location == null) { httpResponse.sendError(HttpServletResponse.SC_BAD_REQUEST, - "no/invalid name"); + "no/invalid name"); } else { - - String urls[] = locationDB.nameToUrls(name); + String urls[] = locationDB.nameToUrls(location.getName()); + if(urls == null || urls.length == 0) { + LOGGER.warning("No locations for " + location.getName()); httpResponse.sendError(HttpServletResponse.SC_NOT_FOUND, - "Unable to locate("+name+")"); + "Unable to locate("+ location.getName() +")"); } else { + + DataSource ds = null; + for(String urlString : urls) { + try { + ds = locationToDataSource(urlString, + location.getOffset()); + if(ds != null) { + break; + } + } catch(IOException e) { + LOGGER.warning("failed proxy of " + urlString + " " + + e.getLocalizedMessage()); + } + } + if(ds == null) { + LOGGER.warning("No successful locations for " + + location.getName()); + httpResponse.sendError(HttpServletResponse.SC_BAD_GATEWAY, + "failed proxy of ("+ location.getName() +")"); + + } else { + httpResponse.setStatus(HttpServletResponse.SC_OK); + // BUGBUG: this will be broken for non compressed data... + httpResponse.setContentType(ds.getContentType()); + ds.copyTo(httpResponse.getOutputStream()); + } + } + } + return true; + } - String urlString = urls[0]; - String rangeHeader = httpRequest.getHeader(RANGE_HTTP_HEADER); - URL url = new URL(urlString); - URLConnection conn = url.openConnection(); + private DataSource locationToDataSource(String location, long offset) + throws IOException { + DataSource ds = null; + if(location.startsWith("http://")) { + URL url = new URL(location); + URLConnection conn = url.openConnection(); + if(offset != 0) { + conn.addRequestProperty(RANGE_HTTP_HEADER, + HEADER_BYTES_PREFIX + String.valueOf(offset) + + HEADER_BYTES_SUFFIX); + } + + ds = new URLDataSource(conn.getInputStream(),conn.getContentType()); + + } else { + // assume a local file path: + File f = new File(location); + if(f.isFile() && f.canRead()) { + long size = f.length(); + if(size < offset) { + throw new IOException("short file " + location + " cannot" + + " seek to offset " + offset); + } + RandomAccessFile raf = new RandomAccessFile(f,"r"); + raf.seek(offset); + // BUGBUG: is it compressed? + ds = new FileDataSource(raf,DEFAULT_CONTENT_TYPE); + + } else { + throw new IOException("No readable file at " + location); + } + + } + + return ds; + } + + private ResourceLocation parseRequest(HttpServletRequest request) { + ResourceLocation location = null; + + String path = request.getRequestURI(); + Matcher fo = FILE_OFFSET_PATTERN.matcher(path); + if(fo.find()) { + location = new ResourceLocation(fo.group(1), + Long.parseLong(fo.group(2))); + } else { + fo = FILE_PATTERN.matcher(path); + if(fo.find()) { + String rangeHeader = request.getHeader(RANGE_HTTP_HEADER); if(rangeHeader != null) { - conn.addRequestProperty(RANGE_HTTP_HEADER,rangeHeader); - } - InputStream is = conn.getInputStream(); - httpResponse.setStatus(HttpServletResponse.SC_OK); - String typeHeader = conn.getHeaderField(CONTENT_TYPE_HEADER); - if(typeHeader == null) { - typeHeader = CONTENT_TYPE; - } - httpResponse.setContentType(typeHeader); - OutputStream os = httpResponse.getOutputStream(); - int BUF_SIZE = 4096; - byte[] buffer = new byte[BUF_SIZE]; - try { - for(int r = -1; (r = is.read(buffer, 0, BUF_SIZE)) != -1;) { - os.write(buffer, 0, r); + if(rangeHeader.startsWith(HEADER_BYTES_PREFIX)) { + rangeHeader = rangeHeader.substring( + HEADER_BYTES_PREFIX.length()); + if(rangeHeader.endsWith(HEADER_BYTES_SUFFIX)) { + rangeHeader = rangeHeader.substring(0, + rangeHeader.length() - + HEADER_BYTES_SUFFIX.length()); + } } - } finally { - is.close(); + location = new ResourceLocation(fo.group(1), + Long.parseLong(rangeHeader)); + } else { + location = new ResourceLocation(fo.group(1)); } } } - return true; + return location; } - + /** * @return the locationDB */ @@ -117,4 +203,71 @@ public void setLocationDB(ResourceFileLocationDB locationDB) { this.locationDB = locationDB; } + + private class ResourceLocation { + private String name = null; + private long offset = 0; + public ResourceLocation(String name, long offset) { + this.name = name; + this.offset = offset; + } + public ResourceLocation(String name) { + this(name,0); + } + public String getName() { + return name; + } + public long getOffset() { + return offset; + } + } + + private interface DataSource { + public void copyTo(OutputStream os) throws IOException; + public String getContentType(); + } + private class FileDataSource implements DataSource { + private RandomAccessFile raf = null; + private String contentType = null; + public FileDataSource(RandomAccessFile raf, String contentType) { + this.raf = raf; + this.contentType = contentType; + } + public String getContentType() { + return contentType; + } + public void copyTo(OutputStream os) throws IOException { + byte[] buffer = new byte[BUF_SIZE]; + try { + int r = -1; + while((r = raf.read(buffer, 0, BUF_SIZE)) != -1) { + os.write(buffer, 0, r); + } + } finally { + raf.close(); + } + } + } + private class URLDataSource implements DataSource { + private InputStream is = null; + private String contentType = null; + public URLDataSource(InputStream is,String contentType) { + this.is = is; + this.contentType = contentType; + } + public String getContentType() { + return contentType; + } + public void copyTo(OutputStream os) throws IOException { + byte[] buffer = new byte[BUF_SIZE]; + try { + int r = -1; + while((r = is.read(buffer, 0, BUF_SIZE)) != -1) { + os.write(buffer, 0, r); + } + } finally { + is.close(); + } + } + } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-08-28 22:03:58
|
Revision: 2587 http://archive-access.svn.sourceforge.net/archive-access/?rev=2587&view=rev Author: binzino Date: 2008-08-28 22:04:08 +0000 (Thu, 28 Aug 2008) Log Message: ----------- Updated use of Hadoop FS utilities to match changed interfaces in Nutch and Hadoop when Nutch upgraded to Hadoop 0.17. Also added a gratuitous log messages when looking for Lucene indexes to open in NutchWaxBean. These messages should help operators avoid common errors where the Nutch index directories are not being found yet NutchBean emits no helpful log messages. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2008-08-28 21:54:54 UTC (rev 2586) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/NutchWaxBean.java 2008-08-28 22:04:08 UTC (rev 2587) @@ -21,6 +21,9 @@ import java.lang.reflect.Field; import javax.servlet.*; +import org.apache.commons.logging.Log; +import org.apache.commons.logging.LogFactory; + import org.apache.nutch.searcher.NutchBean; import org.apache.nutch.searcher.IndexSearcher; import org.apache.nutch.searcher.Query; @@ -28,6 +31,7 @@ import org.apache.nutch.searcher.Hit; import org.apache.nutch.searcher.Hits; import org.apache.nutch.searcher.Summary; +import org.apache.hadoop.fs.FileStatus; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.Closeable; @@ -68,6 +72,7 @@ */ public class NutchWaxBean { + public static final Log LOG = LogFactory.getLog( NutchWaxBean.class ); /** * Static utility class for modifying a NutchBean instance. @@ -79,10 +84,12 @@ * IndexSearcher with one we create that uses * ArchiveParallelReader for searching across parallel indices. */ - public static void modify( NutchBean bean ) + public static boolean modify( NutchBean bean ) { try { + LOG.info( "Modifying NutchBean with NutchWAX extensions..." ); + // First, get the configuration from the bean. Gosh it would be // nice if NutchBean had a getConf() public method, wouldn't it? Field fConf = NutchBean.class.getDeclaredField( "conf" ); @@ -95,20 +102,57 @@ FileSystem fs = FileSystem.get( conf ); - Path dir = new Path( conf.get( "searcher.dir", "crawl") ); + Path dir = new Path( conf.get( "searcher.dir", "crawl") ).makeQualified( fs ); + LOG.info( "Looking for Nutch indexes in: " + dir ); + if ( ! fs.exists( dir ) ) + { + LOG.warn( "Directory does not exist: " + dir ); + LOG.warn( "NutchBean not modified." ); + LOG.warn( "No Nutch indexes will be found and all queries will return no results." ); + + return false; + } + + Path indexesDir = new Path( dir, "pindexes" ).makeQualified(fs); + LOG.info( "Looking for NutchWax parallel indexes in: " + indexesDir ); + if ( ! fs.exists( indexesDir ) ) + { + LOG.warn( "Parallel indexes directory does not exist: " + indexesDir ); + LOG.warn( "NutchBean not modified." ); + + return false; + } + + if ( ! fs.getFileStatus( indexesDir ).isDir( ) ) + { + LOG.warn( "Parallel indexes directory is not a directory: " + indexesDir ); + LOG.warn( "NutchBean not modified." ); + + return false; + } + + FileStatus[] fstats = fs.listStatus(indexesDir, HadoopFSUtil.getPassDirectoriesFilter(fs)); + Path[] indexDirs = HadoopFSUtil.getPaths( fstats ); + + if ( indexDirs.length < 1 ) + { + LOG.info( "No sub-dirs found in parallel indexes directory: " + indexesDir ); + LOG.warn( "NutchBean not modified." ); + + return false; + } - Path indexesDir = new Path( dir, "pindexes" ); - - Path indexDirs[] = fs.listPaths(indexesDir, HadoopFSUtil.getPassDirectoriesFilter(fs)); - List<IndexReader> readers = new ArrayList<IndexReader>( indexDirs.length ); for ( Path indexDir : indexDirs ) { - Path parallelDirs[] = fs.listPaths( indexDir, HadoopFSUtil.getPassDirectoriesFilter(fs) ); + fstats = fs.listStatus( indexDir, HadoopFSUtil.getPassDirectoriesFilter(fs) ); + Path parallelDirs[] = HadoopFSUtil.getPaths( fstats ); if ( parallelDirs.length < 1 ) { + LOG.info( "No sub-directories, skipping: " + indexDir ); + continue; } @@ -120,11 +164,20 @@ for ( Path p : parallelDirs ) { + LOG.info( "Adding reader for: " + p ); reader.add( IndexReader.open( new FsDirectory( fs, p, false, conf ) ) ); } readers.add( reader ); } + + if ( readers.size( ) == 0 ) + { + LOG.warn( "No parallel indexes in: " + indexesDir ); + LOG.warn( "NutchBean not modified." ); + + return false; + } MultiReader reader = new MultiReader( readers.toArray( new IndexReader[0] ) ); @@ -142,6 +195,8 @@ IndexSearcher searcher = (IndexSearcher) fSearcher.get( bean ); fLuceneSearcher.set( searcher, newLuceneSearcher ); fReader .set( searcher, reader ); + + return true; } catch ( Exception e ) { @@ -177,7 +232,7 @@ if ( bean == null ) { - NutchBean.LOG.fatal( "No value for \"" + NutchBean.KEY + "\" in servlet context" ); + LOG.fatal( "No value for \"" + NutchBean.KEY + "\" in servlet context" ); return ; } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2008-08-28 21:54:45
|
Revision: 2586 http://archive-access.svn.sourceforge.net/archive-access/?rev=2586&view=rev Author: binzino Date: 2008-08-28 21:54:54 +0000 (Thu, 28 Aug 2008) Log Message: ----------- Nutch updated to Hadoop 0.17 and the Mapper interface added generics. So, this class was updated accordingly. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2008-08-28 21:44:41 UTC (rev 2585) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2008-08-28 21:54:54 UTC (rev 2586) @@ -97,7 +97,7 @@ * to the importing of ARC files. I've noted those details with * comments prefaced with "?:". */ -public class Importer extends Configured implements Tool, Mapper +public class Importer extends Configured implements Tool, Mapper<WritableComparable, Writable, Text, NutchWritable> { public static final Log LOG = LogFactory.getLog( Importer.class ); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2585 http://archive-access.svn.sourceforge.net/archive-access/?rev=2585&view=rev Author: bradtofel Date: 2008-08-28 21:44:41 +0000 (Thu, 28 Aug 2008) Log Message: ----------- COMMENT+WHITESPACE Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java 2008-08-25 23:30:34 UTC (rev 2584) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/SimpleResourceStore.java 2008-08-28 21:44:41 UTC (rev 2585) @@ -35,10 +35,10 @@ /** - * Implements ResourceStore where ARC/WARCs are accessed via HTTP 1.1 range - * requests. All files are assumed to be "rooted" at a particular HTTP URL, - * within a single directory, implying a file reverse-proxy to connect through - * to actual HTTP ARC/WARC locations. + * Implements ResourceStore where ARC/WARCs are accessed via a local file or an + * HTTP 1.1 range request. All files are assumed to be "rooted" at a particular + * HTTP URL, or within a single local directory. The HTTP version may imply a + * file reverse-proxy to connect through to actual HTTP ARC/WARC locations. * * @author brad * @version $Date$, $Revision$ @@ -47,7 +47,6 @@ private String prefix = null; - public Resource retrieveResource(CaptureSearchResult result) throws ResourceNotAvailableException { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2584 http://archive-access.svn.sourceforge.net/archive-access/?rev=2584&view=rev Author: bradtofel Date: 2008-08-25 23:30:34 +0000 (Mon, 25 Aug 2008) Log Message: ----------- BUGFIX(unreported): If self-redirect filters cause all documents to be filtered from results, now throws ResourceNotInArchiveException. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2008-08-25 23:22:00 UTC (rev 2583) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/RemoteResourceIndex.java 2008-08-25 23:30:34 UTC (rev 2584) @@ -207,7 +207,8 @@ } protected SearchResults documentToSearchResults(Document document, - ObjectFilter<CaptureSearchResult> filter) { + ObjectFilter<CaptureSearchResult> filter) + throws ResourceNotInArchiveException { SearchResults results = null; NodeList filters = getRequestFilters(document); String resultsType = getResultsType(document); @@ -237,9 +238,11 @@ return results; } private CaptureSearchResults documentToCaptureSearchResults( - Document document, ObjectFilter<CaptureSearchResult> filter) { + Document document, ObjectFilter<CaptureSearchResult> filter) + throws ResourceNotInArchiveException { CaptureSearchResults results = new CaptureSearchResults(); NodeList xresults = getSearchResults(document); + int numAdded = 0; for(int i = 0; i < xresults.getLength(); i++) { Node xresult = xresults.item(i); CaptureSearchResult result = searchElementToCaptureSearchResult(xresult); @@ -252,9 +255,14 @@ if (ruling == ObjectFilter.FILTER_ABORT) { break; } else if (ruling == ObjectFilter.FILTER_INCLUDE) { + numAdded++; results.addSearchResult(result, true); } } + if(numAdded == 0) { + throw new ResourceNotInArchiveException("No documents matching" + + " filter"); + } return results; } private UrlSearchResult searchElementToUrlSearchResult(Node e) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2583 http://archive-access.svn.sourceforge.net/archive-access/?rev=2583&view=rev Author: bradtofel Date: 2008-08-25 23:22:00 +0000 (Mon, 25 Aug 2008) Log Message: ----------- BUGFIX (ACC-30): now use original URL + timestamp as uniqueness key. This could still cause problems, in which case, we'll add digest perhaps. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java 2008-08-25 23:20:02 UTC (rev 2582) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/filters/DuplicateRecordFilter.java 2008-08-25 23:22:00 UTC (rev 2583) @@ -18,7 +18,7 @@ * @see org.archive.wayback.util.ObjectFilter#filterObject(java.lang.Object) */ public int filterObject(CaptureSearchResult o) { - String thisUrl = o.getUrlKey(); + String thisUrl = o.getOriginalUrl(); String thisDate = o.getCaptureTimestamp(); int result = ObjectFilter.FILTER_INCLUDE; if(lastUrl != null) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-08-25 23:20:01
|
Revision: 2582 http://archive-access.svn.sourceforge.net/archive-access/?rev=2582&view=rev Author: bradtofel Date: 2008-08-25 23:20:02 +0000 (Mon, 25 Aug 2008) Log Message: ----------- POST-RELEASE version update. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/pom.xml trunk/archive-access/projects/wayback/pom.xml trunk/archive-access/projects/wayback/wayback-core/pom.xml trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml trunk/archive-access/projects/wayback/wayback-webapp/pom.xml Modified: trunk/archive-access/projects/wayback/dist/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/pom.xml 2008-08-21 20:50:48 UTC (rev 2581) +++ trunk/archive-access/projects/wayback/dist/pom.xml 2008-08-25 23:20:02 UTC (rev 2582) @@ -3,7 +3,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> @@ -54,13 +54,13 @@ <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-webapp</artifactId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> <type>war</type> </dependency> <dependency> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> </dependency> </dependencies> Modified: trunk/archive-access/projects/wayback/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/pom.xml 2008-08-21 20:50:48 UTC (rev 2581) +++ trunk/archive-access/projects/wayback/pom.xml 2008-08-25 23:20:02 UTC (rev 2582) @@ -17,9 +17,9 @@ <groupId>org.archive</groupId> <artifactId>wayback</artifactId> <properties> - <globalVersion>1.4.0</globalVersion> + <globalVersion>1.5.0-SNAPSHOT</globalVersion> </properties> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> <packaging>pom</packaging> <name>Wayback</name> Modified: trunk/archive-access/projects/wayback/wayback-core/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/pom.xml 2008-08-21 20:50:48 UTC (rev 2581) +++ trunk/archive-access/projects/wayback/wayback-core/pom.xml 2008-08-25 23:20:02 UTC (rev 2582) @@ -17,7 +17,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-core</artifactId> Modified: trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml 2008-08-21 20:50:48 UTC (rev 2581) +++ trunk/archive-access/projects/wayback/wayback-mapreduce/pom.xml 2008-08-25 23:20:02 UTC (rev 2582) @@ -12,7 +12,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce</artifactId> Modified: trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml 2008-08-21 20:50:48 UTC (rev 2581) +++ trunk/archive-access/projects/wayback/wayback-mapreduce-prereq/pom.xml 2008-08-25 23:20:02 UTC (rev 2582) @@ -10,7 +10,7 @@ <parent> <groupId>org.archive</groupId> <artifactId>wayback</artifactId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> </parent> <groupId>org.archive.wayback</groupId> <artifactId>wayback-mapreduce-prereq</artifactId> Modified: trunk/archive-access/projects/wayback/wayback-webapp/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/pom.xml 2008-08-21 20:50:48 UTC (rev 2581) +++ trunk/archive-access/projects/wayback/wayback-webapp/pom.xml 2008-08-25 23:20:02 UTC (rev 2582) @@ -3,7 +3,7 @@ <parent> <artifactId>wayback</artifactId> <groupId>org.archive</groupId> - <version>1.4.0</version> + <version>1.5.0-SNAPSHOT</version> </parent> <modelVersion>4.0.0</modelVersion> <groupId>org.archive.wayback</groupId> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2008-08-21 20:50:42
|
Revision: 2581 http://archive-access.svn.sourceforge.net/archive-access/?rev=2581&view=rev Author: bradtofel Date: 2008-08-21 20:50:48 +0000 (Thu, 21 Aug 2008) Log Message: ----------- RELEASE BRANCH. Added Paths: ----------- branches/wayback-1_4_0/ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |