You can subscribe to this list here.
2005 |
Jan
|
Feb
|
Mar
|
Apr
|
May
|
Jun
|
Jul
(1) |
Aug
(10) |
Sep
(36) |
Oct
(339) |
Nov
(103) |
Dec
(152) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
2006 |
Jan
(141) |
Feb
(102) |
Mar
(125) |
Apr
(203) |
May
(57) |
Jun
(30) |
Jul
(139) |
Aug
(46) |
Sep
(64) |
Oct
(105) |
Nov
(34) |
Dec
(162) |
2007 |
Jan
(81) |
Feb
(57) |
Mar
(141) |
Apr
(72) |
May
(9) |
Jun
(1) |
Jul
(144) |
Aug
(88) |
Sep
(40) |
Oct
(43) |
Nov
(34) |
Dec
(20) |
2008 |
Jan
(44) |
Feb
(45) |
Mar
(16) |
Apr
(36) |
May
(8) |
Jun
(77) |
Jul
(177) |
Aug
(66) |
Sep
(8) |
Oct
(33) |
Nov
(13) |
Dec
(37) |
2009 |
Jan
(2) |
Feb
(5) |
Mar
(8) |
Apr
|
May
(36) |
Jun
(19) |
Jul
(46) |
Aug
(8) |
Sep
(1) |
Oct
(66) |
Nov
(61) |
Dec
(10) |
2010 |
Jan
(13) |
Feb
(16) |
Mar
(38) |
Apr
(76) |
May
(47) |
Jun
(32) |
Jul
(35) |
Aug
(45) |
Sep
(20) |
Oct
(61) |
Nov
(24) |
Dec
(16) |
2011 |
Jan
(22) |
Feb
(34) |
Mar
(11) |
Apr
(8) |
May
(24) |
Jun
(23) |
Jul
(11) |
Aug
(42) |
Sep
(81) |
Oct
(48) |
Nov
(21) |
Dec
(20) |
2012 |
Jan
(30) |
Feb
(25) |
Mar
(4) |
Apr
(6) |
May
(1) |
Jun
(5) |
Jul
(5) |
Aug
(8) |
Sep
(6) |
Oct
(6) |
Nov
|
Dec
|
From: <bi...@us...> - 2010-02-12 20:39:58
|
Revision: 2955 http://archive-access.svn.sourceforge.net/archive-access/?rev=2955&view=rev Author: binzino Date: 2010-02-12 20:39:44 +0000 (Fri, 12 Feb 2010) Log Message: ----------- Fix counting of hits and add use of hitsPerSite argument. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java 2010-01-29 00:22:22 UTC (rev 2954) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java 2010-02-12 20:39:44 UTC (rev 2955) @@ -431,10 +431,10 @@ try { final Query query = Query.parse( queryString, conf); - final Hits hits = bean.search(query, numHits); + final Hits hits = bean.search(query, numHits, hitsPerSite); System.out.println( "Total hits : " + hits.getTotal () ); System.out.println( "Hits length: " + hits.getLength() ); - final int length = (int)Math.min(hits.getTotal(), numHits); + final int length = (int)Math.min(hits.getLength(), numHits); final Hit[] show = hits.getHits(0, length); final HitDetails[] details = bean.getDetails(show); final Summary[] summaries = bean.getSummary(details, query); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-01-29 00:50:29
|
Revision: 2953 http://archive-access.svn.sourceforge.net/archive-access/?rev=2953&view=rev Author: binzino Date: 2010-01-29 00:20:42 +0000 (Fri, 29 Jan 2010) Log Message: ----------- Updated to use NutchBean since NutchWaxBean is deprecated. Also fixed bug in NutchBean not observing the -n parameter. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java Modified: trunk/archive-access/projects/nutchwax/archive/bin/nutchwax =================================================================== --- trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2010-01-19 22:11:50 UTC (rev 2952) +++ trunk/archive-access/projects/nutchwax/archive/bin/nutchwax 2010-01-29 00:20:42 UTC (rev 2953) @@ -80,7 +80,7 @@ ;; search) shift - ${NUTCH_HOME}/bin/nutch org.archive.nutchwax.NutchWaxBean "$@" + ${NUTCH_HOME}/bin/nutch org.apache.nutch.searcher.NutchBean "$@" ;; *) echo "" Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java 2010-01-19 22:11:50 UTC (rev 2952) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/NutchBean.java 2010-01-29 00:20:42 UTC (rev 2953) @@ -431,10 +431,10 @@ try { final Query query = Query.parse( queryString, conf); - final Hits hits = bean.search(query, 10); + final Hits hits = bean.search(query, numHits); System.out.println( "Total hits : " + hits.getTotal () ); System.out.println( "Hits length: " + hits.getLength() ); - final int length = (int)Math.min(hits.getTotal(), 10); + final int length = (int)Math.min(hits.getTotal(), numHits); final Hit[] show = hits.getHits(0, length); final HitDetails[] details = bean.getDetails(show); final Summary[] summaries = bean.getSummary(details, query); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-01-29 00:43:10
|
Revision: 2954 http://archive-access.svn.sourceforge.net/archive-access/?rev=2954&view=rev Author: binzino Date: 2010-01-29 00:22:22 +0000 (Fri, 29 Jan 2010) Log Message: ----------- Added code (from NutchWAX 0.12.x) to handle cases where segment is missing or record is not found in the segment. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java Modified: trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2010-01-29 00:20:42 UTC (rev 2953) +++ trunk/archive-access/projects/nutchwax/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2010-01-29 00:22:22 UTC (rev 2954) @@ -340,10 +340,26 @@ if (this.summarizer == null) { return new Summary(); } - final Segment segment = getSegment(details); - final ParseText parseText = segment.getParseText(getKey(details)); - final String text = (parseText != null) ? parseText.getText() : ""; + String text = ""; + Segment segment = getSegment(details); + if ( segment != null ) + { + try + { + ParseText parseText = segment.getParseText(getKey(details)); + text = (parseText != null ) ? parseText.getText() : ""; + } + catch ( Exception e ) + { + LOG.error( "segment = " + segment.segmentDir, e ); + } + } + else + { + LOG.warn( "No segment for: " + details ); + } + return this.summarizer.getSummary(text, query); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-01-19 22:11:58
|
Revision: 2952 http://archive-access.svn.sourceforge.net/archive-access/?rev=2952&view=rev Author: binzino Date: 2010-01-19 22:11:50 +0000 (Tue, 19 Jan 2010) Log Message: ----------- Added property check to disable writing of crawl datum. Modified Paths: -------------- tags/nutchwax-0_12_9/archive/src/java/org/archive/nutchwax/Importer.java Modified: tags/nutchwax-0_12_9/archive/src/java/org/archive/nutchwax/Importer.java =================================================================== --- tags/nutchwax-0_12_9/archive/src/java/org/archive/nutchwax/Importer.java 2010-01-13 01:31:20 UTC (rev 2951) +++ tags/nutchwax-0_12_9/archive/src/java/org/archive/nutchwax/Importer.java 2010-01-19 22:11:50 UTC (rev 2952) @@ -465,7 +465,11 @@ try { - output.collect( key, new NutchWritable( datum ) ); + // Back-port this little change from NutchWAX 0.13. We don't need the crawl datum. + if ( jobConf.getBoolean( "nutchwax.import.store.crawl", false ) ) + { + output.collect( key, new NutchWritable( datum ) ); + } if ( jobConf.getBoolean( "nutchwax.import.store.content", false ) ) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2951 http://archive-access.svn.sourceforge.net/archive-access/?rev=2951&view=rev Author: bradtofel Date: 2010-01-13 01:31:20 +0000 (Wed, 13 Jan 2010) Log Message: ----------- BUGFIX(unreported): ID String was incorrect due to very old copy-paste problem Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveDocumentNotAvailableException.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveDocumentNotAvailableException.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveDocumentNotAvailableException.java 2010-01-13 01:30:27 UTC (rev 2950) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveDocumentNotAvailableException.java 2010-01-13 01:31:20 UTC (rev 2951) @@ -39,7 +39,7 @@ * */ private static final long serialVersionUID = 1L; - protected static final String ID = "resourceIndexNotAvailable"; + protected static final String ID = "liveDocumentNotAvailable"; protected static final String defaultMessage = "Live document unavailable"; /** * Constructor This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2950 http://archive-access.svn.sourceforge.net/archive-access/?rev=2950&view=rev Author: bradtofel Date: 2010-01-13 01:30:27 +0000 (Wed, 13 Jan 2010) Log Message: ----------- INITIAL REV: Exception to indicate unexpected and uncorrectable problems with the the LiveWebCache Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveWebCacheUnavailableException.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveWebCacheUnavailableException.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveWebCacheUnavailableException.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveWebCacheUnavailableException.java 2010-01-13 01:30:27 UTC (rev 2950) @@ -0,0 +1,79 @@ +/* LiveWebCacheUnavailableException + * + * $Id$: + * + * Created on Dec 18, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.exception; + +import java.net.URL; + +import javax.servlet.http.HttpServletResponse; + +/** + * @author brad + * + */ +public class LiveWebCacheUnavailableException extends WaybackException { + /** + * + */ + private static final long serialVersionUID = 1L; + protected static final String ID = "liveWebCacheNotAvailable"; + protected static final String defaultMessage = "LiveWebCache unavailable"; + /** + * Constructor + * @param url + * @param code + */ + public LiveWebCacheUnavailableException(URL url, int code) { + super("The URL " + url.toString() + " is not available(HTTP " + code + + " returned)",defaultMessage); + id = ID; + } + /** + * Constructor with message and details + * @param url + * @param code + * @param details + */ + public LiveWebCacheUnavailableException(URL url, int code, String details){ + super("The URL " + url.toString() + " is not available(HTTP " + code + + " returned)",defaultMessage,details); + id = ID; + } + /** + * @param url + */ + public LiveWebCacheUnavailableException(String url){ + super("The URL " + url + " is not available",defaultMessage); + id = ID; + } + + /** + * @return the HTTP status code appropriate to this exception class. + */ + public int getStatus() { + return HttpServletResponse.SC_BAD_GATEWAY; + } + +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/exception/LiveWebCacheUnavailableException.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2949 http://archive-access.svn.sourceforge.net/archive-access/?rev=2949&view=rev Author: bradtofel Date: 2010-01-13 01:28:52 +0000 (Wed, 13 Jan 2010) Log Message: ----------- FEATURE: added logging Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/robotstxt/RobotExclusionFilter.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/robotstxt/RobotExclusionFilter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/robotstxt/RobotExclusionFilter.java 2010-01-13 01:26:57 UTC (rev 2948) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/robotstxt/RobotExclusionFilter.java 2010-01-13 01:28:52 UTC (rev 2949) @@ -31,6 +31,7 @@ import java.util.HashMap; import java.util.Iterator; import java.util.List; +import java.util.logging.Logger; import java.util.regex.Matcher; import java.util.regex.Pattern; @@ -38,6 +39,7 @@ import org.archive.wayback.core.Resource; import org.archive.wayback.core.CaptureSearchResult; import org.archive.wayback.exception.LiveDocumentNotAvailableException; +import org.archive.wayback.exception.LiveWebCacheUnavailableException; import org.archive.wayback.liveweb.LiveWebCache; import org.archive.wayback.util.ObjectFilter; @@ -58,6 +60,8 @@ */ public class RobotExclusionFilter implements ObjectFilter<CaptureSearchResult> { + private final static Logger LOGGER = Logger.getLogger(RobotExclusionFilter.class.getName()); + private final static String HTTP_PREFIX = "http://"; private final static String ROBOT_SUFFIX = "/robots.txt"; @@ -142,18 +146,28 @@ firstUrlString = urlString; } if(rulesCache.containsKey(urlString)) { + LOGGER.fine("ROBOT: Cached("+urlString+")"); rules = rulesCache.get(urlString); } else { try { - + LOGGER.fine("ROBOT: NotCached("+urlString+")"); + tmpRules = new RobotRules(); Resource resource = webCache.getCachedResource(new URL(urlString), maxCacheMS,true); + if(resource.getStatusCode() != 200) { + LOGGER.info("ROBOT: NotAvailable("+urlString+")"); + throw new LiveDocumentNotAvailableException(urlString); + } tmpRules.parse(resource); rulesCache.put(firstUrlString,tmpRules); rules = tmpRules; + LOGGER.info("ROBOT: Downloaded("+urlString+")"); } catch (LiveDocumentNotAvailableException e) { + // cache an empty rule: all OK +// rulesCache.put(firstUrlString, emptyRules); +// rules = emptyRules; continue; } catch (MalformedURLException e) { e.printStackTrace(); @@ -161,6 +175,9 @@ } catch (IOException e) { e.printStackTrace(); return null; + } catch (LiveWebCacheUnavailableException e) { + e.printStackTrace(); + return null; } } } @@ -186,6 +203,8 @@ url = new URL(ArchiveUtils.addImpliedHttpIfNecessary(resultURL)); if(!rules.blocksPathForUA(url.getPath(), userAgent)) { filterResult = ObjectFilter.FILTER_INCLUDE; + } else { + LOGGER.info("ROBOT: BLOCKED("+resultURL+")"); } } catch (MalformedURLException e) { e.printStackTrace(); This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-01-13 01:27:04
|
Revision: 2948 http://archive-access.svn.sourceforge.net/archive-access/?rev=2948&view=rev Author: bradtofel Date: 2010-01-13 01:26:57 +0000 (Wed, 13 Jan 2010) Log Message: ----------- FEATURE: added proxy host & port configuration methods Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilter.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilterFactory.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilter.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilter.java 2010-01-13 00:26:44 UTC (rev 2947) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilter.java 2010-01-13 01:26:57 UTC (rev 2948) @@ -51,7 +51,24 @@ * @param accessGroup String group to use with requests to the Oracle */ public OracleExclusionFilter(String oracleUrl, String accessGroup) { + this(oracleUrl,accessGroup,null); + } + /** + * @param oracleUrl String URL prefix for the Oracle HTTP server + * @param accessGroup String group to use with requests to the Oracle + * @param proxyHostPort String proxyHost:proxyPort to use for robots.txt + */ + public OracleExclusionFilter(String oracleUrl, String accessGroup, + String proxyHostPort) { client = new AccessControlClient(oracleUrl); + if(proxyHostPort != null) { + int colonIdx = proxyHostPort.indexOf(':'); + if(colonIdx > 0) { + String host = proxyHostPort.substring(0,colonIdx); + int port = Integer.valueOf(proxyHostPort.substring(colonIdx+1)); + client.setRobotProxy(host, port); + } + } this.accessGroup = accessGroup; } Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilterFactory.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilterFactory.java 2010-01-13 00:26:44 UTC (rev 2947) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/accesscontrol/oracleclient/OracleExclusionFilterFactory.java 2010-01-13 01:26:57 UTC (rev 2948) @@ -38,10 +38,11 @@ private String oracleUrl = null; private String accessGroup = null; + private String proxyHostPort = null; public ObjectFilter<CaptureSearchResult> get() { OracleExclusionFilter filter = new OracleExclusionFilter(oracleUrl, - accessGroup); + accessGroup, proxyHostPort); return filter; } @@ -77,4 +78,18 @@ this.accessGroup = accessGroup; } + /** + * @return the proxyHostPort + */ + public String getProxyHostPort() { + return proxyHostPort; + } + + /** + * @param proxyHostPort the proxyHostPort to set, ex. "localhost:3128" + */ + public void setProxyHostPort(String proxyHostPort) { + this.proxyHostPort = proxyHostPort; + } + } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-01-13 01:16:00
|
Revision: 2946 http://archive-access.svn.sourceforge.net/archive-access/?rev=2946&view=rev Author: binzino Date: 2010-01-13 00:14:06 +0000 (Wed, 13 Jan 2010) Log Message: ----------- Completed fix for WAX-68. Modified Paths: -------------- tags/nutchwax-0_12_9/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java Modified: tags/nutchwax-0_12_9/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java =================================================================== --- tags/nutchwax-0_12_9/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2010-01-12 22:27:18 UTC (rev 2945) +++ tags/nutchwax-0_12_9/archive/src/nutch/src/java/org/apache/nutch/searcher/FetchedSegments.java 2010-01-13 00:14:06 UTC (rev 2946) @@ -233,20 +233,25 @@ } String version = fields[1]; - if ( ! ( "10".equals( version ) || "12".equals( version ) ) ) + if ( "10".equals( version ) ) { - LOG.warn( "Malformed versions line, invalid version ("+version+"): " + version ); - continue; + LOG.info( "Version: " + fields[0] + " : " + fields[1] ); + if ( this.oldFormatSegments == null ) + { + this.oldFormatSegments = new HashSet( ); + } + + this.oldFormatSegments.add( segment ); } - - LOG.info( "Version: " + fields[0] + " : " + fields[1] ); - - if ( this.oldFormatSegments == null ) + else if ( "12".equals( version ) ) { - this.oldFormatSegments = new HashSet( ); + LOG.info( "Version: " + fields[0] + " : " + fields[1] ); + // For version 12, nothing to do. } - - this.oldFormatSegments.add( segment ); + else + { + LOG.warn( "Malformed versions line, invalid version ("+version+"): " + line ); + } } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-01-13 00:26:52
|
Revision: 2947 http://archive-access.svn.sourceforge.net/archive-access/?rev=2947&view=rev Author: binzino Date: 2010-01-13 00:26:44 +0000 (Wed, 13 Jan 2010) Log Message: ----------- Updated for 0.12.9 release. Modified Paths: -------------- tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt tags/nutchwax-0_12_9/archive/HOWTO.txt tags/nutchwax-0_12_9/archive/INSTALL.txt tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt Modified: tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt =================================================================== --- tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/BUILD-NOTES.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,6 +1,6 @@ BUILD-NOTES.txt -2009-09-18 +2010-01-13 Aaron Binns ====================================================================== @@ -79,7 +79,7 @@ ---------------------------------------------------------------------- The file - /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.9/conf/tika-mimetypes.xml contains two errors: one where a mimetype is referenced before it is defined; and a second where a definition has an illegal character. @@ -110,11 +110,11 @@ You can either apply these patches yourself, or copy an already-patched copy from: - /opt/nutchwax-0.12.8/contrib/archive/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.9/contrib/archive/conf/tika-mimetypes.xml to - /opt/nutchwax-0.12.8/conf/tika-mimetypes.xml + /opt/nutchwax-0.12.9/conf/tika-mimetypes.xml ---------------------------------------------------------------------- Modified: tags/nutchwax-0_12_9/archive/HOWTO.txt =================================================================== --- tags/nutchwax-0_12_9/archive/HOWTO.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/HOWTO.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,6 +1,6 @@ HOWTO.txt -2009-09-18 +2010-01-13 Aaron Binns Table of Contents @@ -26,7 +26,7 @@ This HOWTO assumes it is installed in - /opt/nutchwax-0.12.8 + /opt/nutchwax-0.12.9 2. ARC/WARC files. @@ -68,14 +68,10 @@ $ mkdir crawl $ cd crawl - $ /opt/nutchwax-0.12.8/bin/nutchwax import ../manifest - $ /opt/nutchwax-0.12.8/bin/nutch updatedb crawldb -dir segments - $ /opt/nutchwax-0.12.8/bin/nutch invertlinks linkdb -dir segments - $ /opt/nutchwax-0.12.8/bin/nutch index indexes crawldb linkdb segments/* + $ /opt/nutchwax-0.12.9/bin/nutchwax import ../manifest + $ /opt/nutchwax-0.12.9/bin/nutchwax index indexes segments/* $ ls -F1 - crawldb/ indexes/ - linkdb/ segments/ To those already familiar with Nutch, these steps should be quite @@ -96,7 +92,7 @@ $ cd ../ $ ls -F1 crawl/ - $ /opt/nutchwax-0.12.8/bin/nutchwax search computer + $ /opt/nutchwax-0.12.9/bin/nutchwax search computer This calls the NutchWaxBean to execute a simple keyword search for "computer". Use whatever query term you think appears in the @@ -109,7 +105,7 @@ The Nutch(WAX) web application is bundled with NutchWAX as - /opt/nutchwax-0.12.8/nutch-1.0-dev.war + /opt/nutchwax-0.12.9/nutch-1.0-dev.war Simply deploy that web application in the same fashion as with Nutch. Modified: tags/nutchwax-0_12_9/archive/INSTALL.txt =================================================================== --- tags/nutchwax-0_12_9/archive/INSTALL.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/INSTALL.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,6 +1,6 @@ INSTALL.txt -2009-09-18 +2010-01-13 Aaron Binns Table of Contents @@ -63,10 +63,10 @@ ------------------ As mentioned above, NutchWAX 0.12 is built against Nutch-1.0-dev. Although the Nutch project released 1.0 in early 2009, there were so -many changes that NutchWAX 0.12.8 is still built against pre-1.0 +many changes that NutchWAX 0.12.9 is still built against pre-1.0 codebase. -The specific SVN revision that NutchWAX 0.12.8 is built against is: +The specific SVN revision that NutchWAX 0.12.9 is built against is: 701524 @@ -81,14 +81,14 @@ SVN: NutchWAX ------------- -Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.8 +Once you have Nutch-1.0-dev checked-out, check-out NutchWAX 0.12.9 source into Nutch's "contrib" directory. $ cd contrib $ svn checkout http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_8/archive This will create a sub-directory named "archive" containing the -NutchWAX 0.12.8 sources. +NutchWAX 0.12.9 sources. Build and install ----------------- @@ -115,7 +115,7 @@ $ cd /opt $ tar xvfz nutch-1.0-dev.tar.gz - $ mv nutch-1.0-dev nutchwax-0.12.8 + $ mv nutch-1.0-dev nutchwax-0.12.9 ====================================================================== @@ -128,24 +128,24 @@ Install it simply by untarring it, for example: $ cd /opt - $ tar xvfz nutchwax-0.12.8.tar.gz + $ tar xvfz nutchwax-0.12.9.tar.gz ====================================================================== Install start-up scripts ====================================================================== -NutchWAX 0.12.8 comes with a Unix init.d script which can be used to +NutchWAX 0.12.9 comes with a Unix init.d script which can be used to automatically start the searcher slaves for a multi-node search configuration. Assuming you installed NutchWAX as - /opt/nutchwax-0.12.8 + /opt/nutchwax-0.12.9 the script is found at - /opt/nutchwax-0.12.8/contrib/archive/etc/init.d/searcher-slave + /opt/nutchwax-0.12.9/contrib/archive/etc/init.d/searcher-slave This script can be placed in /etc/init.d then added to the list of startup scripts to run at bootup by using commands appropriate to your Modified: tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt =================================================================== --- tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt 2010-01-13 00:14:06 UTC (rev 2946) +++ tags/nutchwax-0_12_9/archive/RELEASE-NOTES.txt 2010-01-13 00:26:44 UTC (rev 2947) @@ -1,70 +1,50 @@ RELEASE-NOTES.TXT -2009-09-18 +2010-01-13 Aaron Binns -Release notes for NutchWAX 0.12.8 +Release notes for NutchWAX 0.12.9 For the most recent updates and information on NutchWAX, please visit the project wiki at: - http://webteam.archive.org/confluence/display/search/NutchWAX + http://webarchive.jira.com/wiki/display/search/NutchWAX - ====================================================================== Overview ====================================================================== -The main enhancement in NutchWAX 0.12.8 is the ability to configure -HTTP headers to support caching. +The main enhancement in NutchWAX 0.12.9 is the ability to search +indexes created with NutchWAX 0.10. -The Archive is starting to use Squid to cache the HTTP responses from -NutchWAX and some explicit HTTP response headers were needed to enable -this. Rather than relying on the servlet container (Tomcat/Jetty) to -add the response headers, we added a servlet filter to NutchWAX. +In the segments directory, create a "versions" file and in it +list the names of the segments and their version, e.g. -Right now the filter is very basic, in the web.xml file we now have + foo-segment 10 + bar-segment 12 - <filter> - <filter-name>Cache Settings</filter-name> - <filter-class>org.archive.nutchwax.CacheSettingsFilter</filter-class> - <init-param> - <param-name>max-age</param-name> - <param-value>259200</param-value> <!-- 72 hours (in seconds) --> - </init-param> - </filter> +where the version number is either 10 or 12. If a segment is not +listed in the "versions" file, it is assumed to be version 12. - <filter-mapping> - <filter-name>Cache Settings</filter-name> - <servlet-name>OpenSearch</servlet-name> - </filter-mapping> +Also, a minor, but convenient enhancement is to no longer require the +crawldb and linkdb to be present at index time. Neither one of these +are actually used for indexing and the fact that they were required to +be given to the index step was a legecy of Nutch. Now, there is a +NutchWAX 'index' command which only requires the segment(s) to be +present. -which configures the filter to add a 'max-age' header with a 72 hour -limit. This filter is then applied to all instances of the OpenSearch -servlet. - -This allows browsers to cache the OpenSearch response for up to 72 -hours. It also enables any proxies between the browser and server to -cache the response as well. With the addition of Squid into our -deployment, we let Squid serve cached responses to repeat queries. - -Since our deployment updates every 4 days, a 72-hour expiration works -well. - ====================================================================== Issues ====================================================================== For an up-to-date list of NutchWAX issues: - http://webteam.archive.org/jira/browse/WAX + http://webarchive.jira.com/browse/WAX Issues resolved in this release: -WAX-61 Change mime-type of OpenSearch XML response from text/xml to - application/xml. +WAX-66 Index documents without crawldb nor linkdb. -WAX-62 Add ability to configure HTTP headers to support caching. +WAX-67 Nutch OpenOffice parser does not pass along metadata. -WAX-63 LengthNormUpdater returning error code if no fields in index - have norms is inconvenient. +WAX-68 Compatibility with {index+segment}s created by NutchWAX 0.10. This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-01-12 22:27:24
|
Revision: 2945 http://archive-access.svn.sourceforge.net/archive-access/?rev=2945&view=rev Author: bradtofel Date: 2010-01-12 22:27:18 +0000 (Tue, 12 Jan 2010) Log Message: ----------- FEATURE: Added identity flag to incoming requests - the intention being to allow clients to explicitly request a raw copy of archived docs. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlRequestParser.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/selector/IdentityRequestSelector.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlRequestParser.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlRequestParser.java 2010-01-12 22:24:04 UTC (rev 2944) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlRequestParser.java 2010-01-12 22:27:18 UTC (rev 2945) @@ -59,6 +59,10 @@ */ public final static String IMG_CONTEXT = "im"; /** + * raw/identity context + */ + public final static String IDENTITY_CONTEXT = "id"; + /** * Charset detection strategy context - should be followed by an integer * indicating which strategy to use */ Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java 2010-01-12 22:24:04 UTC (rev 2944) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/requestparser/ReplayRequestParser.java 2010-01-12 22:27:18 UTC (rev 2945) @@ -124,6 +124,8 @@ wbRequest.setJSContext(true); } else if(flag.equals(ArchivalUrlRequestParser.IMG_CONTEXT)) { wbRequest.setIMGContext(true); + } else if(flag.equals(ArchivalUrlRequestParser.IDENTITY_CONTEXT)) { + wbRequest.setIdentityContext(true); } else if(flag.startsWith(ArchivalUrlRequestParser.CHARSET_MODE)) { String modeString = flag.substring( ArchivalUrlRequestParser.CHARSET_MODE.length()); Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2010-01-12 22:24:04 UTC (rev 2944) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/core/WaybackRequest.java 2010-01-12 22:27:18 UTC (rev 2945) @@ -262,6 +262,12 @@ public static final String REQUEST_IMAGE_CONTEXT = "imagecontext"; /** + * Request: Identity context requested (totally transparent) + */ + public static final String REQUEST_IDENTITY_CONTEXT = "identitycontext"; + + + /** * Request: Charset detection mode */ public static final String REQUEST_CHARSET_MODE = "charsetmode"; @@ -488,6 +494,7 @@ this.exclusionFilter = exclusionFilter; } + @Deprecated public ObjectFilter<CaptureSearchResult> getResultFilters() { ObjectFilterChain<CaptureSearchResult> tmpFilters = new ObjectFilterChain<CaptureSearchResult>(); @@ -772,6 +779,13 @@ return getBoolean(REQUEST_IMAGE_CONTEXT); } + public void setIdentityContext(boolean isIdentityContext) { + setBoolean(REQUEST_IDENTITY_CONTEXT,isIdentityContext); + } + public boolean isIdentityContext() { + return getBoolean(REQUEST_IDENTITY_CONTEXT); + } + public void setCharsetMode(int mode) { setInt(REQUEST_CHARSET_MODE,mode); } Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/selector/IdentityRequestSelector.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/selector/IdentityRequestSelector.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/selector/IdentityRequestSelector.java 2010-01-12 22:27:18 UTC (rev 2945) @@ -0,0 +1,48 @@ +/* IdentityRequestSelector + * + * $Id$: + * + * Created on Dec 17, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.replay.selector; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.core.Resource; +import org.archive.wayback.core.WaybackRequest; + +/** + * @author brad + * + */ +public class IdentityRequestSelector extends BaseReplayRendererSelector { + + /* (non-Javadoc) + * @see org.archive.wayback.replay.selector.BaseReplayRendererSelector#canHandle(org.archive.wayback.core.WaybackRequest, org.archive.wayback.core.CaptureSearchResult, org.archive.wayback.core.Resource) + */ + @Override + public boolean canHandle(WaybackRequest wbRequest, + CaptureSearchResult result, Resource resource) { + return wbRequest.isIdentityContext(); + } + + +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/selector/IdentityRequestSelector.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-01-12 22:24:11
|
Revision: 2944 http://archive-access.svn.sourceforge.net/archive-access/?rev=2944&view=rev Author: bradtofel Date: 2010-01-12 22:24:04 +0000 (Tue, 12 Jan 2010) Log Message: ----------- FEATURE: added copyStream() methods to drain bytes from an InputStream to an OutputStream Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ByteOp.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ByteOp.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ByteOp.java 2010-01-12 22:17:44 UTC (rev 2943) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/ByteOp.java 2010-01-12 22:24:04 UTC (rev 2944) @@ -24,7 +24,13 @@ */ package org.archive.wayback.util; +import java.io.IOException; +import java.io.InputStream; +import java.io.OutputStream; + public class ByteOp { + public final static int BUFFER_SIZE = 4096; + public static byte[] copy(byte[] src, int offset, int length) { byte[] copy = new byte[length]; System.arraycopy(src, offset, copy, 0, length); @@ -41,4 +47,28 @@ } return true; } + /** + * Write all bytes from is to os. Does not close either stream. + * @param is to copy bytes from + * @param os to copy bytes to + * @throws IOException for usual reasons + */ + public static void copyStream(InputStream is, OutputStream os) + throws IOException { + copyStream(is,os,BUFFER_SIZE); + } + /** + * Write all bytes from is to os. Does not close either stream. + * @param is to copy bytes from + * @param os to copy bytes to + * @param size number of bytes to buffer between read and write operations + * @throws IOException for usual reasons + */ + public static void copyStream(InputStream is, OutputStream os, int size) + throws IOException { + byte[] buffer = new byte[size]; + for (int r = -1; (r = is.read(buffer, 0, size)) != -1;) { + os.write(buffer, 0, r); + } + } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bi...@us...> - 2010-01-12 22:17:50
|
Revision: 2943 http://archive-access.svn.sourceforge.net/archive-access/?rev=2943&view=rev Author: binzino Date: 2010-01-12 22:17:44 +0000 (Tue, 12 Jan 2010) Log Message: ----------- WAX-69. Comment out code that writes crawl_data. Modified Paths: -------------- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java Modified: trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java =================================================================== --- trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2010-01-11 21:46:57 UTC (rev 2942) +++ trunk/archive-access/projects/nutchwax/archive/src/java/org/archive/nutchwax/Importer.java 2010-01-12 22:17:44 UTC (rev 2943) @@ -467,7 +467,14 @@ try { - output.collect( key, new NutchWritable( datum ) ); + // Some weird problem with Hadoop 0.19.x - when the crawl_data + // is merged during the reduce step, the classloader cannot + // find the org.apache.nutch.protocol.ProtocolStatus class. + // + // We avoid the whole issue by omitting the crawl_data all + // together, which we don't use anyways. + // + // output.collect( key, new NutchWritable( datum ) ); if ( jobConf.getBoolean( "nutchwax.import.store.content", false ) ) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2010-01-11 21:47:07
|
Revision: 2942 http://archive-access.svn.sourceforge.net/archive-access/?rev=2942&view=rev Author: bradtofel Date: 2010-01-11 21:46:57 +0000 (Mon, 11 Jan 2010) Log Message: ----------- Adding explicit error handler so stack traces aren't exposed to users. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/web.xml Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/web.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/web.xml 2009-12-22 05:15:56 UTC (rev 2941) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/web.xml 2010-01-11 21:46:57 UTC (rev 2942) @@ -52,4 +52,11 @@ <realm-name>Secured-Wayback</realm-name> </login-config> --> + + + <error-page> + <exception-type>java.lang.Exception</exception-type> + <location>/WEB-INF/exception/HTMLError.jsp</location> + </error-page> + </web-app> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-22 05:16:06
|
Revision: 2941 http://archive-access.svn.sourceforge.net/archive-access/?rev=2941&view=rev Author: bradtofel Date: 2009-12-22 05:15:56 +0000 (Tue, 22 Dec 2009) Log Message: ----------- Sending File not String to ArchiveReaderFactory.get() methods Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/ArcIndexer.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/ArcIndexer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/ArcIndexer.java 2009-12-18 00:34:47 UTC (rev 2940) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/ArcIndexer.java 2009-12-22 05:15:56 UTC (rev 2941) @@ -69,7 +69,12 @@ */ public CloseableIterator<CaptureSearchResult> iterator(String pathOrUrl) throws IOException { - return iterator(ARCReaderFactory.get(pathOrUrl)); + File f = new File(pathOrUrl); + if(f.isFile()) { + return iterator(ARCReaderFactory.get(f)); + } else { + return iterator(ARCReaderFactory.get(pathOrUrl)); + } } /** Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-12-18 00:34:47 UTC (rev 2940) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourcestore/indexer/WarcIndexer.java 2009-12-22 05:15:56 UTC (rev 2941) @@ -71,7 +71,12 @@ */ public CloseableIterator<CaptureSearchResult> iterator(String pathOrUrl) throws IOException { - return iterator(WARCReaderFactory.get(pathOrUrl)); + File f = new File(pathOrUrl); + if(f.isFile()) { + return iterator(WARCReaderFactory.get(f)); + } else { + return iterator(WARCReaderFactory.get(pathOrUrl)); + } } /** * @param arc This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2940 http://archive-access.svn.sourceforge.net/archive-access/?rev=2940&view=rev Author: bradtofel Date: 2009-12-18 00:34:47 +0000 (Fri, 18 Dec 2009) Log Message: ----------- BUGFIX(ACC-89): now explicitly declare valid hostname characters, to try to ensure a legitimate match.. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/transformer/JSStringTransformer.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/transformer/JSStringTransformer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/transformer/JSStringTransformer.java 2009-12-18 00:33:13 UTC (rev 2939) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/transformer/JSStringTransformer.java 2009-12-18 00:34:47 UTC (rev 2940) @@ -38,7 +38,7 @@ */ public class JSStringTransformer implements StringTransformer { private final static Pattern httpPattern = Pattern - .compile("(http://[^/]*/)"); + .compile("(http://[A-Za-z0-9:_@.-]+)"); public String transform(ReplayParseContext context, String input) { This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2939 http://archive-access.svn.sourceforge.net/archive-access/?rev=2939&view=rev Author: bradtofel Date: 2009-12-18 00:33:13 +0000 (Fri, 18 Dec 2009) Log Message: ----------- BUGFIX(ACC-89): now explicitly declare valid hostname characters, to try to ensure a legitimate match.. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlJSReplayRenderer.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlJSReplayRenderer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlJSReplayRenderer.java 2009-12-09 06:50:07 UTC (rev 2938) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlJSReplayRenderer.java 2009-12-18 00:33:13 UTC (rev 2939) @@ -63,7 +63,7 @@ } private final static Pattern httpPattern = Pattern - .compile("(http://[^/]*/)"); + .compile("(http://[A-Za-z0-9:_@.-]+)"); protected void updatePage(TextDocument page, HttpServletRequest httpRequest, HttpServletResponse httpResponse, This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-09 06:50:16
|
Revision: 2938 http://archive-access.svn.sourceforge.net/archive-access/?rev=2938&view=rev Author: bradtofel Date: 2009-12-09 06:50:07 +0000 (Wed, 09 Dec 2009) Log Message: ----------- INITIAL REV: SearchResultSource composed of a series of alphabetically partitioned ziplined CDX files. Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/StringPrefixIterator.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinedBlock.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesChunkIterator.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSource.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSourceTest.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/StringPrefixIterator.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/StringPrefixIterator.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/StringPrefixIterator.java 2009-12-09 06:50:07 UTC (rev 2938) @@ -0,0 +1,90 @@ +/* StringPrefixIterator + * + * $Id$: + * + * Created on Nov 23, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.resourceindex.ziplines; + +import java.io.IOException; +import java.util.Iterator; + +import org.archive.wayback.util.CloseableIterator; + +/** + * @author brad + * + */ +public class StringPrefixIterator implements CloseableIterator<String> { + private String prefix = null; + Iterator<String> inner = null; + private String cachedNext = null; + private boolean done = false; + public StringPrefixIterator(Iterator<String> inner, String prefix) { + this.prefix = prefix; + this.inner = inner; + } + /* (non-Javadoc) + * @see java.util.Iterator#hasNext() + */ + public boolean hasNext() { + if(done) return false; + if(cachedNext != null) { + return true; + } + while(inner.hasNext()) { + String tmp = inner.next(); + if(tmp.startsWith(prefix)) { + cachedNext = tmp; + return true; + } else if(tmp.compareTo(prefix) > 0) { + done = true; + return false; + } + } + return false; + } + /* (non-Javadoc) + * @see java.util.Iterator#next() + */ + public String next() { + String tmp = cachedNext; + cachedNext = null; + return tmp; + } + /* (non-Javadoc) + * @see java.util.Iterator#remove() + */ + public void remove() { + // TODO Auto-generated method stub + + } + /* (non-Javadoc) + * @see java.io.Closeable#close() + */ + public void close() throws IOException { + if(inner instanceof CloseableIterator) { + CloseableIterator<String> toBeClosed = (CloseableIterator<String>) inner; + toBeClosed.close(); + } + } +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/StringPrefixIterator.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinedBlock.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinedBlock.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinedBlock.java 2009-12-09 06:50:07 UTC (rev 2938) @@ -0,0 +1,68 @@ +/* ZiplinedBlock + * + * $Id$: + * + * Created on Nov 23, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.resourceindex.ziplines; + +import java.io.BufferedReader; +import java.io.IOException; +import java.io.InputStreamReader; +import java.net.URL; +import java.net.URLConnection; +import java.util.zip.GZIPInputStream; + +/** + * @author brad + * + */ +public class ZiplinedBlock { + String urlOrPath = null; + long offset = -1; + public final static int BLOCK_SIZE = 128 * 1024; + private final static String RANGE_HEADER = "Range"; + private final static String BYTES_HEADER = "bytes="; + private final static String BYTES_MINUS = "-"; + /** + * @param urlOrPath URL where this file can be downloaded + * @param offset start of 128K block boundary. + */ + public ZiplinedBlock(String urlOrPath, long offset) { + this.urlOrPath = urlOrPath; + this.offset = offset; + } + /** + * @return a BufferedReader of the underlying compressed data in this block + * @throws IOException for usual reasons + */ + public BufferedReader readBlock() throws IOException { + URL u = new URL(urlOrPath); + URLConnection uc = u.openConnection(); + StringBuilder sb = new StringBuilder(16); + sb.append(BYTES_HEADER).append(offset).append(BYTES_MINUS); + sb.append((offset + BLOCK_SIZE)-1); + uc.setRequestProperty(RANGE_HEADER, sb.toString()); + return new BufferedReader(new InputStreamReader( + new GZIPInputStream(uc.getInputStream()))); + } +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinedBlock.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesChunkIterator.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesChunkIterator.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesChunkIterator.java 2009-12-09 06:50:07 UTC (rev 2938) @@ -0,0 +1,151 @@ +/* ZiplinesChunkIterator + * + * $Id$: + * + * Created on Nov 23, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.resourceindex.ziplines; + +import java.io.BufferedReader; +import java.io.File; +import java.io.FileInputStream; +import java.io.IOException; +import java.io.InputStream; +import java.io.InputStreamReader; +import java.io.RandomAccessFile; +import java.util.Iterator; +import java.util.List; +import java.util.RandomAccess; +import java.util.zip.GZIPInputStream; + +import org.archive.wayback.util.CloseableIterator; + +/** + * @author brad + * + */ +public class ZiplinesChunkIterator implements CloseableIterator<String> { + private BufferedReader br = null; + private Iterator<ZiplinedBlock> blockItr = null; + private String cachedNext = null; + /** + * @param blocks which should be fetched and unzipped, one after another + */ + public ZiplinesChunkIterator(List<ZiplinedBlock> blocks) { + blockItr = blocks.iterator(); + } + /* (non-Javadoc) + * @see java.util.Iterator#hasNext() + */ + public boolean hasNext() { + if(cachedNext != null) { + return true; + } + while(cachedNext == null) { + if(br != null) { + // attempt to read the next line from this: + try { + cachedNext = br.readLine(); + if(cachedNext == null) { + br = null; + // next loop: + } else { + return true; + } + } catch (IOException e) { + e.printStackTrace(); + br = null; + } + } else { + // do we have more blocks to use? + if(blockItr.hasNext()) { + try { + br = blockItr.next().readBlock(); + } catch (IOException e) { + e.printStackTrace(); + } + } else { + return false; + } + } + } + + return false; + } + + /* (non-Javadoc) + * @see java.util.Iterator#next() + */ + public String next() { + String tmp = cachedNext; + cachedNext = null; + return tmp; + } + + /* (non-Javadoc) + * @see java.util.Iterator#remove() + */ + public void remove() { + throw new UnsupportedOperationException(); + } + + /* (non-Javadoc) + * @see java.io.Closeable#close() + */ + public void close() throws IOException { + if(br != null) { + br.close(); + } + } + public static void main(String[] args) { + if(args.length != 1) { + System.err.println("Usage: ZIPLINES_PATH"); + System.exit(1); + } + File f = new File(args[0]); + long size = f.length(); + long numBlocks = (long) (size / ZiplinedBlock.BLOCK_SIZE); + long size2 = numBlocks * ZiplinedBlock.BLOCK_SIZE; + if(size != size2) { + System.err.println("File size of " + args[0] + " is not a mulitple" + + " of " + ZiplinedBlock.BLOCK_SIZE); + } + try { + RandomAccessFile raf = new RandomAccessFile(f, "r"); + for(int i = 0; i < numBlocks; i++) { + long offset = i * ZiplinedBlock.BLOCK_SIZE; + raf.seek(offset); + BufferedReader br = new BufferedReader(new InputStreamReader( + new GZIPInputStream(new FileInputStream(raf.getFD())))); + String line = br.readLine(); + if(line == null) { + System.err.println("Bad block at " + offset + " in " + args[0]); + System.exit(1); + } + System.out.println(args[0] + " " + offset + " " + line); + } + } catch (IOException e) { + e.printStackTrace(); + System.exit(1); + } + } +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesChunkIterator.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSource.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSource.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSource.java 2009-12-09 06:50:07 UTC (rev 2938) @@ -0,0 +1,218 @@ +/* ZiplinesSearchResultSource + * + * $Id$: + * + * Created on Nov 23, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.resourceindex.ziplines; + +import it.unimi.dsi.mg4j.util.FrontCodedStringList; + +import java.io.BufferedReader; +import java.io.FileReader; +import java.io.IOException; +import java.util.ArrayList; +import java.util.HashMap; +import java.util.Iterator; + +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.exception.ResourceIndexNotAvailableException; +import org.archive.wayback.resourceindex.SearchResultSource; +import org.archive.wayback.resourceindex.cdx.CDXFormatToSearchResultAdapter; +import org.archive.wayback.resourceindex.cdx.format.CDXFormat; +import org.archive.wayback.resourceindex.cdx.format.CDXFormatException; +import org.archive.wayback.util.AdaptedIterator; +import org.archive.wayback.util.CloseableIterator; +import org.archive.wayback.util.flatfile.FlatFile; + +/** + * A set of Ziplines files, which are CDX files specially compressed into a + * series of GZipMembers such that: + * + * 1) each member is exactly 128K, padded using a GZip comment header + * 2) each member contains complete lines: no line spans two GZip members + * + * If the data put into these files is sorted, then the data within the files + * can be uncompressed when needed, minimizing the total data to be uncompressed + * + * This SearchResultSource assumes a set of alphabetically partitioned Ziplined + * CDX files, so that each file is sorted, and no regions overlap. + * + * This class takes 2 files as input: + * 1) a specially constructed map of the first N bytes of data from each GZip + * member, and the filename and offset of that GZip member. + * 2) a mapping of filenames to URLs + * + * Data from #1 is actually stored in a serialized + * + * + * + * @author brad + * + */ +public class ZiplinesSearchResultSource implements SearchResultSource { + + /** + * Local path containing map of URL,TIMESTAMP,CHUNK,OFFSET for each 128K chunk + */ + private String chunkIndexPath = null; + private FlatFile chunkIndex = null; + /** + * Local path containing URL for each CHUNK + */ + private String chunkMapPath = null; + private HashMap<String,String> chunkMap = null; + private CDXFormat format = null; + + public ZiplinesSearchResultSource() { + } + public ZiplinesSearchResultSource(CDXFormat format) { + this.format = format; + } + public void init() throws IOException { + chunkMap = new HashMap<String, String>(); + FlatFile ff = new FlatFile(chunkMapPath); + Iterator<String> lines = ff.getSequentialIterator(); + while(lines.hasNext()) { + String line = lines.next(); + String[] parts = line.split("\\s"); + if(parts.length != 2) { + throw new IOException("Bad line(" + line +") in (" + + chunkMapPath + ")"); + } + chunkMap.put(parts[0],parts[1]); + } + chunkIndex = new FlatFile(chunkIndexPath); + } + protected CloseableIterator<CaptureSearchResult> adaptIterator(Iterator<String> itr) + throws IOException { + return new AdaptedIterator<String,CaptureSearchResult>(itr, + new CDXFormatToSearchResultAdapter(format)); + } + + /* (non-Javadoc) + * @see org.archive.wayback.resourceindex.SearchResultSource#cleanup(org.archive.wayback.util.CloseableIterator) + */ + public void cleanup(CloseableIterator<CaptureSearchResult> c) + throws IOException { + c.close(); + } + + /* (non-Javadoc) + * @see org.archive.wayback.resourceindex.SearchResultSource#getPrefixIterator(java.lang.String) + */ + public CloseableIterator<CaptureSearchResult> getPrefixIterator( + String prefix) throws ResourceIndexNotAvailableException { + try { + return adaptIterator(getStringPrefixIterator(prefix)); + } catch (IOException e) { + throw new ResourceIndexNotAvailableException(e.getMessage()); + } + } + + public Iterator<String> getStringPrefixIterator(String prefix) throws ResourceIndexNotAvailableException, IOException { + Iterator<String> itr = chunkIndex.getRecordIteratorLT(prefix); + ArrayList<ZiplinedBlock> blocks = new ArrayList<ZiplinedBlock>(); + boolean first = true; + while(itr.hasNext()) { + String blockDescriptor = itr.next(); + String parts[] = blockDescriptor.split("\t"); + if(parts.length != 3) { + throw new ResourceIndexNotAvailableException("Bad line(" + + blockDescriptor + ")"); + } + // only compare the correct length: + String prefCmp = prefix; + String blockCmp = parts[0]; +// if(prefCmp.length() < blockCmp.length()) { +// blockCmp = blockCmp.substring(0,prefCmp.length()); +// } else { +// prefCmp = prefCmp.substring(0,blockCmp.length()); +// } + if(first) { + // always add first: + first = false; +// } else if(blockCmp.compareTo(prefCmp) > 0) { + } else if(!blockCmp.startsWith(prefCmp)) { + // all done; + break; + } + // add this and keep lookin... + String url = chunkMap.get(parts[1]); + long offset = Long.parseLong(parts[2]); + blocks.add(new ZiplinedBlock(url, offset)); + } + return new StringPrefixIterator(new ZiplinesChunkIterator(blocks),prefix); + } + + /* (non-Javadoc) + * @see org.archive.wayback.resourceindex.SearchResultSource#getPrefixReverseIterator(java.lang.String) + */ + public CloseableIterator<CaptureSearchResult> getPrefixReverseIterator( + String prefix) throws ResourceIndexNotAvailableException { + throw new ResourceIndexNotAvailableException("unsupported op"); + } + + /* (non-Javadoc) + * @see org.archive.wayback.resourceindex.SearchResultSource#shutdown() + */ + public void shutdown() throws IOException { + // no-op.. + } + /** + * @return the format + */ + public CDXFormat getFormat() { + return format; + } + /** + * @param format the format to set + */ + public void setFormat(CDXFormat format) { + this.format = format; + } + /** + * @return the chunkIndexPath + */ + public String getChunkIndexPath() { + return chunkIndexPath; + } + /** + * @param chunkIndexPath the chunkIndexPath to set + */ + public void setChunkIndexPath(String chunkIndexPath) { + this.chunkIndexPath = chunkIndexPath; + } + /** + * @return the chunkMapPath + */ + public String getChunkMapPath() { + return chunkMapPath; + } + /** + * @param chunkMapPath the chunkMapPath to set + */ + public void setChunkMapPath(String chunkMapPath) { + this.chunkMapPath = chunkMapPath; + } + +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSource.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSourceTest.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSourceTest.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSourceTest.java 2009-12-09 06:50:07 UTC (rev 2938) @@ -0,0 +1,64 @@ +/* ZiplinesSearchResultSourceTest + * + * $Id$: + * + * Created on Nov 23, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.resourceindex.ziplines; + +import java.util.Iterator; + +import org.archive.wayback.resourceindex.cdx.format.CDXFormat; +import org.archive.wayback.resourceindex.cdx.format.CDXFormatException; + +import junit.framework.TestCase; + +/** + * @author brad + * + */ +public class ZiplinesSearchResultSourceTest extends TestCase { + + /** + * Test method for {@link org.archive.wayback.resourceindex.ziplines.ZiplinesSearchResultSource#getPrefixIterator(java.lang.String)}. + * @throws CDXFormatException + */ + public void testGetPrefixIterator() throws Exception { + CDXFormat format = new CDXFormat(" CDX N b a m s k r M V g"); + ZiplinesSearchResultSource zsrs = new ZiplinesSearchResultSource(format); +// zsrs.setChunkIndexPath("/home/brad/zipline-test/part-00005-frag.cdx.zlm"); +// zsrs.setChunkMapPath("/home/brad/zipline-test/manifest.txt"); + zsrs.setChunkIndexPath("/home/brad/ALL.summary"); + zsrs.setChunkMapPath("/home/brad/ALL.loc"); + zsrs.init(); + Iterator<String> i = zsrs.getStringPrefixIterator("krunch.com/ "); + int max = 100; + int done = 0; + while(i.hasNext()) { + System.out.println(i.next()); + if(done++ > max) { + break; + } + } + } + +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/ziplines/ZiplinesSearchResultSourceTest.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-09 06:48:35
|
Revision: 2937 http://archive-access.svn.sourceforge.net/archive-access/?rev=2937&view=rev Author: bradtofel Date: 2009-12-09 06:48:28 +0000 (Wed, 09 Dec 2009) Log Message: ----------- FEATURE: added method to allow searching for greatest line <= search term Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java 2009-12-09 06:47:35 UTC (rev 2936) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/util/flatfile/FlatFile.java 2009-12-09 06:48:28 UTC (rev 2937) @@ -126,6 +126,39 @@ fh.seek(min); return min; } + public long findKeyOffsetLT(RandomAccessFile fh, String key) throws IOException { + int blockSize = 8192; + long fileSize = fh.length(); + long min = 0; + long max = (long) fileSize / blockSize; + long mid; + String line; + while (max - min > 1) { + mid = min + (long)((max - min) / 2); + fh.seek(mid * blockSize); + if(mid > 0) line = fh.readLine(); // probably a partial line + line = fh.readLine(); + if (key.compareTo(line) > 0) { + min = mid; + } else { + max = mid; + } + } + // find the right line + min = min * blockSize; + fh.seek(min); + if(min > 0) line = fh.readLine(); + long last = min; + while(true) { + min = fh.getFilePointer(); + line = fh.readLine(); + if(line == null) break; + if(line.compareTo(key) >= 0) break; + last = min; + } + fh.seek(last); + return last; + } /** * @return Returns the lastMatchOffset. */ @@ -157,6 +190,16 @@ return itr; } + public Iterator<String> getRecordIteratorLT(final String prefix) throws IOException { + RecordIterator itr = null; + RandomAccessFile raf = new RandomAccessFile(file,"r"); + long offset = findKeyOffsetLT(raf,prefix); + lastMatchOffset = offset; + BufferedReader br = new BufferedReader(new FileReader(raf.getFD())); + itr = new RecordIterator(br); + return itr; + } + /** * * @param prefix This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-09 06:47:47
|
Revision: 2936 http://archive-access.svn.sourceforge.net/archive-access/?rev=2936&view=rev Author: bradtofel Date: 2009-12-09 06:47:35 +0000 (Wed, 09 Dec 2009) Log Message: ----------- Hackery to get live web caching Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/URLCacher.java Added Paths: ----------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/ARCCachingProxy.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/FileRegion.java Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/ARCCachingProxy.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/ARCCachingProxy.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/ARCCachingProxy.java 2009-12-09 06:47:35 UTC (rev 2936) @@ -0,0 +1,157 @@ +/* ARCCachingProxy + * + * $Id$: + * + * Created on Dec 8, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.liveweb; + +import java.io.FileNotFoundException; +import java.io.IOException; +import java.io.OutputStream; +import java.io.PrintWriter; +import java.io.RandomAccessFile; +import java.net.URL; + +import javax.servlet.ServletException; +import javax.servlet.http.HttpServletRequest; +import javax.servlet.http.HttpServletResponse; + +import org.apache.log4j.Logger; +import org.archive.io.arc.ARCLocation; +import org.archive.io.arc.ARCRecord; +import org.archive.wayback.core.CaptureSearchResult; +import org.archive.wayback.core.Resource; +import org.archive.wayback.exception.LiveDocumentNotAvailableException; +import org.archive.wayback.resourcestore.resourcefile.ArcResource; +import org.archive.wayback.webapp.ServletRequestContext; + +/** + * @author brad + * + */ +public class ARCCachingProxy extends ServletRequestContext { + + private final static String EXPIRES_HEADER = "Expires"; + + private final static String ARC_RECORD_CONTENT_TYPE = "application/x-arc-record"; + private static final Logger LOGGER = Logger.getLogger( + ARCCachingProxy.class.getName()); + private ARCCacheDirectory arcCacheDir = null; + private URLCacher cacher = null; + private long expiresMS = 60 * 60 * 1000; + /* (non-Javadoc) + * @see org.archive.wayback.webapp.ServletRequestContext#handleRequest(javax.servlet.http.HttpServletRequest, javax.servlet.http.HttpServletResponse) + */ + @Override + public boolean handleRequest(HttpServletRequest httpRequest, + HttpServletResponse httpResponse) throws ServletException, + IOException { + + StringBuffer sb = httpRequest.getRequestURL(); + String query = httpRequest.getQueryString(); + if(query != null) { + sb.append("?").append(query); + } + URL url = new URL(sb.toString()); + FileRegion r = null; + try { + r = getLiveResource(url); + httpResponse.setStatus(httpResponse.SC_OK); + httpResponse.setContentLength((int)r.getLength()); + httpResponse.setContentType(ARC_RECORD_CONTENT_TYPE); + httpResponse.setDateHeader("Expires", System.currentTimeMillis() + expiresMS); + r.copyToOutputStream(httpResponse.getOutputStream()); + + } catch (LiveDocumentNotAvailableException e) { + + e.printStackTrace(); + httpResponse.sendError(httpResponse.SC_NOT_FOUND); + } +// httpResponse.setContentType("text/plain"); +// PrintWriter pw = httpResponse.getWriter(); +// pw.println("PathInfo:" + httpRequest.getPathInfo()); +// pw.println("RequestURI:" + httpRequest.getRequestURI()); +// pw.println("RequestURL:" + httpRequest.getRequestURL()); +// pw.println("QueryString:" + httpRequest.getQueryString()); +// pw.println("PathTranslated:" + httpRequest.getPathTranslated()); +// pw.println("ServletPath:" + httpRequest.getServletPath()); +// pw.println("ContextPath:" + httpRequest.getContextPath()); +// if(r != null) { +// pw.println("CachePath:" + r.file.getAbsolutePath()); +// pw.println("CacheStart:" + r.start); +// pw.println("CacheEnd:" + r.end); +// } else { +// pw.println("FAILED CACHE!"); +// } + + return true; + } + + + private FileRegion getLiveResource(URL url) + throws LiveDocumentNotAvailableException, IOException { + + Resource resource = null; + + LOGGER.info("Caching URL(" + url.toString() + ")"); + FileRegion region = cacher.cache2(arcCacheDir, url.toString()); + if(region != null) { + LOGGER.info("Cached URL(" + url.toString() + ") in " + + "ARC(" + region.file.getAbsolutePath() + ") at (" + + region.start + " - " + region.end + ")"); + + } else { + throw new IOException("No location!"); + } + + return region; +} + + /** + * @return the arcCacheDir + */ + public ARCCacheDirectory getArcCacheDir() { + return arcCacheDir; + } + + /** + * @param arcCacheDir the arcCacheDir to set + */ + public void setArcCacheDir(ARCCacheDirectory arcCacheDir) { + this.arcCacheDir = arcCacheDir; + } + + /** + * @return the cacher + */ + public URLCacher getCacher() { + return cacher; + } + + /** + * @param cacher the cacher to set + */ + public void setCacher(URLCacher cacher) { + this.cacher = cacher; + } +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/ARCCachingProxy.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id Added: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/FileRegion.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/FileRegion.java (rev 0) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/FileRegion.java 2009-12-09 06:47:35 UTC (rev 2936) @@ -0,0 +1,62 @@ +/* FileRegion + * + * $Id$: + * + * Created on Dec 8, 2009. + * + * Copyright (C) 2006 Internet Archive. + * + * This file is part of Wayback. + * + * Wayback is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or + * any later version. + * + * Wayback is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU Lesser Public License for more details. + * + * You should have received a copy of the GNU Lesser Public License + * along with Wayback; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +package org.archive.wayback.liveweb; + +import java.io.File; +import java.io.IOException; +import java.io.OutputStream; +import java.io.RandomAccessFile; + +/** + * @author brad + * + */ +public class FileRegion { + File file = null; + long start = -1; + long end = -1; + public long getLength() { + return end - start; + } + public void copyToOutputStream(OutputStream o) throws IOException { + long left = end - start; + int BUFF_SIZE = 4096; + byte buf[] = new byte[BUFF_SIZE]; + RandomAccessFile raf = new RandomAccessFile(file, "r"); + raf.seek(start); + while(left > 0) { + int amtToRead = (int) Math.min(left, BUFF_SIZE); + int amtRead = raf.read(buf, 0, amtToRead); + if(amtRead < 0) { + throw new IOException("Not enough to read! EOF before expected region end"); + } + o.write(buf,0,amtRead); + left -= amtRead; + } + raf.close(); + } + +} Property changes on: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/FileRegion.java ___________________________________________________________________ Added: svn:keywords + Author Date Revision Id Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/URLCacher.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/URLCacher.java 2009-12-01 23:21:59 UTC (rev 2935) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/liveweb/URLCacher.java 2009-12-09 06:47:35 UTC (rev 2936) @@ -156,21 +156,52 @@ writer.write(url,mime,ip,captureDate.getTime(),len,fis); writer.checkSize(); -// long newSize = writer.getPosition(); -// long oSize = writer.getFile().length(); + long newSize = writer.getPosition(); + long oSize = writer.getFile().length(); + final long arcEndOffset = oSize; LOGGER.info("Wrote " + url + " at " + arcPath + ":" + arcOffset); + LOGGER.info("NewSize:" + newSize + " oSize: " + oSize); fis.close(); return new ARCLocation() { private String filename = arcPath; private long offset = arcOffset; + private long endOffset = arcEndOffset; public String getName() { return this.filename; } - public long getOffset() { return this.offset; } + public long getEndOffset() { return this.endOffset; } + }; } + private FileRegion storeFile2(File file, ARCWriter writer, String url, + ExtendedGetMethod method) throws IOException { + + FileInputStream fis = new FileInputStream(file); + int len = (int) file.length(); + String mime = method.getMime(); + String ip = method.getRemoteIP(); + Date captureDate = method.getCaptureDate(); + + writer.checkSize(); + final long arcOffset = writer.getPosition(); + final String arcPath = writer.getFile().getAbsolutePath(); + writer.write(url,mime,ip,captureDate.getTime(),len,fis); + writer.checkSize(); + long newSize = writer.getPosition(); + long oSize = writer.getFile().length(); + final long arcEndOffset = oSize; + LOGGER.info("Wrote " + url + " at " + arcPath + ":" + arcOffset); + LOGGER.info("NewSize:" + newSize + " oSize: " + oSize); + fis.close(); + FileRegion fr = new FileRegion(); + fr.file = writer.getFile(); + fr.start = arcOffset; + fr.end = oSize; + return fr; + } + /** * Retrieve urlString, and store using ARCWriter, returning * ARCLocation where the document was stored. @@ -219,7 +250,44 @@ } return location; } + public FileRegion cache2(ARCCacheDirectory cache, String urlString) + throws LiveDocumentNotAvailableException, IOException, URIException { + // localize URL + File tmpFile = getTmpFile(); + ExtendedGetMethod method; + try { + method = urlToFile(urlString,tmpFile); + } catch (LiveDocumentNotAvailableException e) { + LOGGER.info("Attempted to get " + urlString + " failed..."); + tmpFile.delete(); + throw e; + } catch (URIException e) { + tmpFile.delete(); + throw e; + } catch (IOException e) { + tmpFile.delete(); + throw e; + } + + // store URL + FileRegion region = null; + ARCWriter writer = null; + try { + writer = cache.getWriter(); + region = storeFile2(tmpFile, writer, urlString, method); + } catch(IOException e) { + e.printStackTrace(); + throw e; + } finally { + if(writer != null) { + cache.returnWriter(writer); + } + tmpFile.delete(); + } + return region; +} + /** * @param args */ This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2935 http://archive-access.svn.sourceforge.net/archive-access/?rev=2935&view=rev Author: bradtofel Date: 2009-12-01 23:21:59 +0000 (Tue, 01 Dec 2009) Log Message: ----------- IOException(Exception) constructor not available in java 5. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXFormatIndex.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXFormatIndex.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXFormatIndex.java 2009-12-01 23:19:20 UTC (rev 2934) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/resourceindex/cdx/CDXFormatIndex.java 2009-12-01 23:21:59 UTC (rev 2935) @@ -58,7 +58,7 @@ try { cdx = new CDXFormat(CDX_HEADER_MAGIC); } catch (CDXFormatException e1) { - throw new IOException(e1); + throw new IOException(e1.getMessage()); } } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-01 23:19:29
|
Revision: 2934 http://archive-access.svn.sourceforge.net/archive-access/?rev=2934&view=rev Author: bradtofel Date: 2009-12-01 23:19:20 +0000 (Tue, 01 Dec 2009) Log Message: ----------- IOException(Exception) constructor not available in java 5. Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlSAXRewriteReplayRenderer.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/AfterBodyStartTagJSPExecRule.java trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/BeforeBodyEndTagJSPExecRule.java Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlSAXRewriteReplayRenderer.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlSAXRewriteReplayRenderer.java 2009-12-01 20:54:41 UTC (rev 2933) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/archivalurl/ArchivalUrlSAXRewriteReplayRenderer.java 2009-12-01 23:19:20 UTC (rev 2934) @@ -100,7 +100,8 @@ url = new URL(result.getOriginalUrl()); } catch (MalformedURLException e1) { // TODO: this shouldn't happen... - throw new IOException(e1); + e1.printStackTrace(); + throw new IOException(e1.getMessage()); } // To make sure we get the length, we have to buffer it all up... @@ -132,7 +133,7 @@ delegator.handleParseComplete(context); } catch (ParserException e) { e.printStackTrace(); - throw new IOException(e); + throw new IOException(e.getMessage()); } // At this point, baos contains the utf-8 encoded bytes of our result: Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/AfterBodyStartTagJSPExecRule.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/AfterBodyStartTagJSPExecRule.java 2009-12-01 20:54:41 UTC (rev 2933) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/AfterBodyStartTagJSPExecRule.java 2009-12-01 23:19:20 UTC (rev 2934) @@ -79,7 +79,8 @@ try { super.emit(context, node); } catch (ServletException e) { - throw new IOException(e); + e.printStackTrace(); + throw new IOException(e.getMessage()); } } } Modified: trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/BeforeBodyEndTagJSPExecRule.java =================================================================== --- trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/BeforeBodyEndTagJSPExecRule.java 2009-12-01 20:54:41 UTC (rev 2933) +++ trunk/archive-access/projects/wayback/wayback-core/src/main/java/org/archive/wayback/replay/html/rules/BeforeBodyEndTagJSPExecRule.java 2009-12-01 23:19:20 UTC (rev 2934) @@ -54,7 +54,8 @@ try { super.emit(context, node); } catch (ServletException e) { - throw new IOException(e); + e.printStackTrace(); + throw new IOException(e.getMessage()); } } } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-01 20:54:53
|
Revision: 2933 http://archive-access.svn.sourceforge.net/archive-access/?rev=2933&view=rev Author: bradtofel Date: 2009-12-01 20:54:41 +0000 (Tue, 01 Dec 2009) Log Message: ----------- BUGFIX: was mis-referencing jsBlockHandler Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlSaxReplay.xml Modified: trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlSaxReplay.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlSaxReplay.xml 2009-12-01 20:51:38 UTC (rev 2932) +++ trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlSaxReplay.xml 2009-12-01 20:54:41 UTC (rev 2933) @@ -160,7 +160,7 @@ </bean> <bean class="org.archive.wayback.replay.html.rules.AttributeModifyingRule"> <property name="modifyAttributeName" value="ONCLICK" /> - <property name="transformer" ref="jsBlockRewriter" /> + <property name="transformer" ref="jsBlockHandler" /> </bean> <bean class="org.archive.wayback.replay.html.rules.AttributeModifyingRule"> <property name="modifyAttributeName" value="style" /> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
From: <bra...@us...> - 2009-12-01 20:51:48
|
Revision: 2932 http://archive-access.svn.sourceforge.net/archive-access/?rev=2932&view=rev Author: bradtofel Date: 2009-12-01 20:51:38 +0000 (Tue, 01 Dec 2009) Log Message: ----------- Patch to allow dev testing of Wayback under jetty: Modified Paths: -------------- trunk/archive-access/projects/wayback/wayback-webapp/pom.xml Modified: trunk/archive-access/projects/wayback/wayback-webapp/pom.xml =================================================================== --- trunk/archive-access/projects/wayback/wayback-webapp/pom.xml 2009-11-20 05:50:23 UTC (rev 2931) +++ trunk/archive-access/projects/wayback/wayback-webapp/pom.xml 2009-12-01 20:51:38 UTC (rev 2932) @@ -22,6 +22,11 @@ </archive> </configuration> </plugin> + <plugin> + <groupId>org.mortbay.jetty</groupId> + <artifactId>maven-jetty-plugin</artifactId> + <version>6.1.22</version> + </plugin> </plugins> </build> <dependencies> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |
Revision: 2931 http://archive-access.svn.sourceforge.net/archive-access/?rev=2931&view=rev Author: alexoz Date: 2009-11-20 05:50:23 +0000 (Fri, 20 Nov 2009) Log Message: ----------- BUGFIX: Don't NPE when there's no default rule, instead explain the problem. Modified Paths: -------------- trunk/archive-access/projects/access-control/access-control/src/main/java/org/archive/accesscontrol/AccessControlClient.java Modified: trunk/archive-access/projects/access-control/access-control/src/main/java/org/archive/accesscontrol/AccessControlClient.java =================================================================== --- trunk/archive-access/projects/access-control/access-control/src/main/java/org/archive/accesscontrol/AccessControlClient.java 2009-11-12 22:22:32 UTC (rev 2930) +++ trunk/archive-access/projects/access-control/access-control/src/main/java/org/archive/accesscontrol/AccessControlClient.java 2009-11-20 05:50:23 UTC (rev 2931) @@ -59,6 +59,11 @@ throw new RobotsUnavailableException(e); } } + if (rule == null) { + throw new RuntimeException("No applicable rule found." + + "Please make sure you have a default rule set" + + " on the root SURT '(' in the oracle."); + } return rule.getPolicy(); } This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |