|
From: <st...@ar...> - 2006-04-06 21:59:08
|
Excellent Oskar!
Do you want us to host your firefox extension at archive-access? If so,
we can set up a subproject for it and give you access.
St.Ack
Oskar Grenholm wrote:
> Hi everyone!
>
> Let me first introduce me to those of you who don't know me already.
> My name is Oskar Grenholm and I work as a programmer at The National Library
> of Sweden. I mainly work with things related to our web archive here.
>
> Lately I have made some minor improvemtents to the way the proxy-mode works in
> the Open Wayback Machine. Those changes have made it possible to surf not
> only the most recent copy of a page in the web archive, but instead any copy
> available.
> This can be done with just the Wayback Machine, but to aid (and perhaps
> simplify) the surfing I have also started working on a Firefox extension that
> will help the user with common tasks often encountered when surfing a web
> archive. Among the things this WAX Toolbar does is providing a search field
> for searching the Wayback Machine for different URL:s OR do a full-text
> search from a NutchWAX index (if one is available of course). You can also
> use the toolbar to switch between proxy-mode and the regular Internet, and
> when in proxy-mode easily go back and forth in time.
>
> The changes made to the Wayback are not many. The main idea is that you have a
> BDB index that holds mappings between id:s (a unique id if the toolbar was
> used, otherwise the ip-address the request was made from) and a preferred
> time to surf at. This timestamp is set either when you choose a page to visit
> from the search interface in the WB or by the WAX Toolbar.
> Then for each request made to the proxy the WB will look up this timestamp and
> return the page that is the closest in time.
>
> Patches for these changes are attached to this e-mail. Four of the files are
> earlier existing files that have been modified somewhat and two of them are
> new (BDBMapper.java and Redirect.jsp).
>
> Attached is also a tar-file containing the source for the Firefox extension.
> If you untar this and enter the directory you can just run 'ant' and a file
> named WaxToolbar.xpi will be built. That is the actual Firefox extension and
> it can be installed as any other extension (i,e,. double-clicking it from
> within Firefox).
> When the extension is installed (and after a re-start of Firefox) a new
> toolbar will be there. In the Tools menu there will also be a WAX Toolbar
> Configuration option. Using this you can set the proxy to use (the WB) and a
> server running NutchWAX.
>
> Finally I have attached an example of a web.xml that can be used when running
> the WB with these new changes and the WAX Toolbar. In it some new stuff has
> been added, namely a parameter specifying the redirect path (the Redirect.jsp
> mentioned above) and a servlet called xmlquery that runs in parallell with
> the normal query interface and is used by the extension to find the times a
> page has been archived.
>
> So, let the feedback begin!
>
> Regards, Oskar.
> ------------------------------------------------------------------------
>
> Index: BDBMap.java
> ===================================================================
> RCS file: BDBMap.java
> diff -N BDBMap.java
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ BDBMap.java 1 Jan 1970 00:00:00 -0000
> @@ -0,0 +1,94 @@
> +/*
> + * Created on 2006-apr-05
> + *
> + * Copyright (C) 2006 Royal Library of Sweden.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +package org.archive.wayback.core;
> +
> +import java.io.File;
> +import java.io.UnsupportedEncodingException;
> +
> +import com.sleepycat.je.Database;
> +import com.sleepycat.je.DatabaseConfig;
> +import com.sleepycat.je.DatabaseEntry;
> +import com.sleepycat.je.DatabaseException;
> +import com.sleepycat.je.Environment;
> +import com.sleepycat.je.EnvironmentConfig;
> +import com.sleepycat.je.LockMode;
> +import com.sleepycat.je.OperationStatus;
> +
> +public class BDBMap {
> +
> + protected Environment env = null;
> + protected Database db = null;
> + protected String name;
> + protected String dir;
> +
> + public BDBMap(String name, String dir) {
> + this.name = name;
> + this.dir = dir;
> + init();
> + }
> +
> + protected void init() {
> + try {
> + EnvironmentConfig envConf = new EnvironmentConfig();
> + envConf.setAllowCreate(true);
> + File envDir = new File(dir);
> + if (!envDir.exists())
> + envDir.mkdirs();
> + env = new Environment(envDir, envConf);
> +
> + DatabaseConfig dbConf = new DatabaseConfig();
> + dbConf.setAllowCreate(true);
> + dbConf.setSortedDuplicates(false);
> + db = env.openDatabase(null, name, dbConf);
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + }
> + }
> +
> + public void put(String keyStr, String valueStr) {
> + try {
> + DatabaseEntry key = new DatabaseEntry(keyStr.getBytes("UTF-8"));
> + DatabaseEntry data = new DatabaseEntry(valueStr.getBytes("UTF-8"));
> + db.put(null, key, data);
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + } catch (UnsupportedEncodingException e) {
> + e.printStackTrace();
> + }
> + }
> +
> + public String get(String keyStr) {
> + String result = null;
> + try {
> + DatabaseEntry key = new DatabaseEntry(keyStr.getBytes("UTF-8"));
> + DatabaseEntry data = new DatabaseEntry();
> + if (db.get(null, key, data, LockMode.DEFAULT) == OperationStatus.SUCCESS) {
> + byte[] bytes = data.getData();
> + result = new String(bytes, "UTF-8");
> + }
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + } catch (UnsupportedEncodingException e) {
> + e.printStackTrace();
> + }
> + return result;
> + }
> +
> +}
> ------------------------------------------------------------------------
>
> Index: ResultURIConverter.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/proxy/ResultURIConverter.java,v
> retrieving revision 1.3
> diff -u -r1.3 ResultURIConverter.java
> --- ResultURIConverter.java 1 Dec 2005 02:08:34 -0000 1.3
> +++ ResultURIConverter.java 6 Apr 2006 11:36:25 -0000
> @@ -41,10 +41,19 @@
> * @version $Date: 2005/12/01 02:08:34 $, $Revision: 1.3 $
> */
> public class ResultURIConverter implements ReplayResultURIConverter {
> - /* (non-Javadoc)
> +
> + private static final String REDIRECT_PATH_PROPERTY = "proxy.redirectpath";
> +
> + private String redirectPath;
> +
> + /* (non-Javadoc)
> * @see org.archive.wayback.ReplayResultURIConverter#init(java.util.Properties)
> */
> public void init(Properties p) throws ConfigurationException {
> + redirectPath = (String) p.get(REDIRECT_PATH_PROPERTY);
> + if (redirectPath == null || redirectPath.length() <= 0) {
> + throw new ConfigurationException("Failed to find " + REDIRECT_PATH_PROPERTY);
> + }
> }
>
> /* (non-Javadoc)
> @@ -52,10 +61,12 @@
> */
> public String makeReplayURI(SearchResult result) {
> String finalUrl = result.get(WaybackConstants.RESULT_URL);
> + String finalTime = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
> if(!finalUrl.startsWith(WaybackConstants.HTTP_URL_PREFIX)) {
> finalUrl = WaybackConstants.HTTP_URL_PREFIX + finalUrl;
> }
> - return finalUrl;
> + //return finalUrl;
> + return redirectPath + "?url=" + finalUrl + "&time=" + finalTime;
> }
>
> /**
> @@ -70,6 +81,7 @@
> */
> public String makeRedirectReplayURI(SearchResult result, String url) {
> String finalUrl = url;
> + String finalTime = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
> try {
>
> UURI origURI = UURIFactory.getInstance(url);
> @@ -86,6 +98,7 @@
> if(!finalUrl.startsWith(WaybackConstants.HTTP_URL_PREFIX)) {
> finalUrl = WaybackConstants.HTTP_URL_PREFIX + finalUrl;
> }
> - return finalUrl;
> + //return finalUrl;
> + return redirectPath + "?url=" + finalUrl + "&time=" + finalTime;
> }
> }
> ------------------------------------------------------------------------
>
> Index: Timestamp.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/core/Timestamp.java,v
> retrieving revision 1.7
> diff -u -r1.7 Timestamp.java
> --- Timestamp.java 16 Feb 2006 03:14:42 -0000 1.7
> +++ Timestamp.java 6 Apr 2006 11:34:06 -0000
> @@ -56,6 +56,11 @@
>
> private final static String[] months = { "Jan", "Feb", "Mar", "Apr", "May",
> "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" };
> +
> + // Acts as a mapping between an ID and a timestamp to surf at.
> + // The dir should probably be configurable somehow.
> + private static String BDB_DIR = System.getProperty("java.io.tmpdir") + "/wayback/bdb";
> + private static BDBMap idToTimestamp = new BDBMap("IdToTimestamp", BDB_DIR);
>
> private String dateStr = null;
> private Date date = null;
> @@ -430,6 +435,7 @@
> public static Timestamp currentTimestamp() {
> return new Timestamp(new Date());
> }
> +
> /**
> * @return Timestamp object representing the latest possible date.
> */
> @@ -437,12 +443,20 @@
> return currentTimestamp();
> }
>
> -
> /**
> * @return Timestamp object representing the earliest possible date.
> */
> public static Timestamp earliestTimestamp() {
> return new Timestamp(SSE_1996);
> }
> +
> + public static String getTimestampForId(String ip) {
> + String dateStr = idToTimestamp.get(ip);
> + return (dateStr != null) ? dateStr : currentTimestamp().getDateStr();
> + }
> +
> + public static void addTimestampForId(String ip, String time) {
> + idToTimestamp.put(ip, time);
> + }
>
> }
> ------------------------------------------------------------------------
>
> Index: Redirect.jsp
> ===================================================================
> RCS file: Redirect.jsp
> diff -N Redirect.jsp
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ Redirect.jsp 1 Jan 1970 00:00:00 -0000
> @@ -0,0 +1,14 @@
> +<%@ page import="org.archive.wayback.core.Timestamp" %>
> +
> +<%
> + String url = request.getParameter("url");
> + String time = request.getParameter("time");
> +
> + // Put time-mapping for this id, or if no id, the ip-addr.
> + String id = request.getHeader("Proxy-Id");
> + if(id == null) id = request.getRemoteAddr();
> + Timestamp.addTimestampForId(id, time);
> +
> + // Now redirect to the page the user wanted.
> + response.sendRedirect(url);
> +%>
> ------------------------------------------------------------------------
>
> Index: ReplayFilter.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/proxy/ReplayFilter.java,v
> retrieving revision 1.4
> diff -u -r1.4 ReplayFilter.java
> --- ReplayFilter.java 18 Jan 2006 02:04:12 -0000 1.4
> +++ ReplayFilter.java 6 Apr 2006 11:36:02 -0000
> @@ -84,10 +84,15 @@
> referer = "";
> }
> wbRequest.put(WaybackConstants.REQUEST_REFERER_URL,referer);
> -
> - wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE,
> - Timestamp.currentTimestamp().getDateStr());
> -
> +
> + // Original
> + //wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp.currentTimestamp().getDateStr());
> +
> + // Get the id from the request. If no id, use the ip-address instead.
> + // Then get the timestamp (or rather datestr) matching this id.
> + String id = httpRequest.getHeader("Proxy-Id");
> + if(id == null) id = httpRequest.getRemoteAddr();
> + wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp.getTimestampForId(id));
>
> return wbRequest;
> }
> ------------------------------------------------------------------------
>
> Index: QueryServlet.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/query/QueryServlet.java,v
> retrieving revision 1.5
> diff -u -r1.5 QueryServlet.java
> --- QueryServlet.java 7 Mar 2006 23:22:20 -0000 1.5
> +++ QueryServlet.java 6 Apr 2006 11:38:30 -0000
> @@ -25,7 +25,9 @@
> package org.archive.wayback.query;
>
> import java.io.IOException;
> +import java.text.ParseException;
> import java.util.Enumeration;
> +import java.util.Iterator;
> import java.util.Properties;
>
> import javax.servlet.ServletConfig;
> @@ -39,7 +41,9 @@
> import org.archive.wayback.QueryRenderer;
> import org.archive.wayback.ReplayResultURIConverter;
> import org.archive.wayback.ResourceIndex;
> +import org.archive.wayback.core.SearchResult;
> import org.archive.wayback.core.SearchResults;
> +import org.archive.wayback.core.Timestamp;
> import org.archive.wayback.core.WaybackLogic;
> import org.archive.wayback.core.WaybackRequest;
> import org.archive.wayback.exception.BadQueryException;
> @@ -119,6 +123,14 @@
> if (wbRequest.get(WaybackConstants.REQUEST_TYPE).equals(
> WaybackConstants.REQUEST_URL_QUERY)) {
>
> + // Annotate the closest matching hit so that it can
> + // be retrieved later from the xml.
> + try {
> + annotateClosest(results, wbRequest, httpRequest);
> + } catch (ParseException e) {
> + e.printStackTrace();
> + }
> +
> renderer.renderUrlResults(httpRequest, httpResponse,
> wbRequest, results, uriConverter);
>
> @@ -144,4 +156,34 @@
>
> }
> }
> +
> + // Method annotating the searchresult closest in time to the timestamp
> + // belonging to this request.
> + private void annotateClosest(SearchResults results,
> + WaybackRequest wbRequest, HttpServletRequest request) throws ParseException {
> +
> + SearchResult closest = null;
> + long closestDistance = 0;
> + SearchResult cur = null;
> + String id = request.getHeader("Proxy-Id");
> + if(id == null) id = request.getRemoteAddr();
> + String requestsDate = Timestamp.getTimestampForId(id);
> + Timestamp wantTimestamp;
> + wantTimestamp = Timestamp.parseBefore(requestsDate);
> +
> + Iterator itr = results.iterator();
> + while (itr.hasNext()) {
> + cur = (SearchResult) itr.next();
> + long curDistance;
> + Timestamp curTimestamp = Timestamp.parseBefore(cur
> + .get(WaybackConstants.RESULT_CAPTURE_DATE));
> + curDistance = curTimestamp.absDistanceFromTimestamp(wantTimestamp);
> +
> + if ((closest == null) || (curDistance < closestDistance)) {
> + closest = cur;
> + closestDistance = curDistance;
> + }
> + }
> + closest.put("closest", "true");
> + }
> }
> ------------------------------------------------------------------------
>
> <?xml version="1.0"?>
> <!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
> "http://java.sun.com/dtd/web-app_2_3.dtd">
> <web-app>
>
> <!-- General Installation information
> -->
>
> <context-param>
> <param-name>installationname</param-name>
> <param-value>Local Proxy Installation</param-value>
> <description>
> This text will appear on the Wayback Configuration and Status page
> and may assist in determining which installation users are viewing
> via their web browser in environments with multiple Wayback
> installations.
> </description>
> </context-param>
>
>
> <!-- Local Arc Path Configuration:
> used by both indexpipeline and LocalARCResourceStore
> -->
>
> <context-param>
> <param-name>arcpath</param-name>
> <param-value>/tmp/wayback/arcs</param-value>
> <description>
> Directory where ARC files are found (possibly where Heritrix writes them.)
> This directory must exist.
> </description>
> </context-param>
>
>
>
> <!-- ResourceStore Configuration -->
>
> <context-param>
> <param-name>resourcestore.classname</param-name>
> <param-value>org.archive.wayback.localresourcestore.LocalARCResourceStore</param-value>
> <description>Class that implements ResourceStore for this Wayback</description>
> </context-param>
>
>
>
> <!-- ResourceIndex Configuration -->
>
> <context-param>
> <param-name>resourceindex.classname</param-name>
> <param-value>org.archive.wayback.cdx.LocalBDBResourceIndex</param-value>
> <description>Class that implements ResourceIndex for this Wayback</description>
> </context-param>
>
> <context-param>
> <param-name>resourceindex.indexpath</param-name>
> <param-value>/tmp/wayback/index</param-value>
> <description>
> LocalBDBResourceIndex specific directory to store the BDB files.
> This directory must exist.
> </description>
> </context-param>
>
> <context-param>
> <param-name>resourceindex.dbname</param-name>
> <param-value>DB1</param-value>
> <description>
> LocalBDBResourceIndex specific name for BDB database
> </description>
> </context-param>
>
>
> <!-- ResourceIndex Pipeline Configuration -->
>
> <context-param>
> <param-name>indexpipeline.workpath</param-name>
> <param-value>/tmp/wayback/pipeline</param-value>
> <description>
> LocalBDBResourceIndex specific directory to store flag files and
> temporary index data. This directory must exist.
> </description>
> </context-param>
>
> <context-param>
> <param-name>indexpipeline.runpipeline</param-name>
> <param-value>1</param-value>
> <description>
> if set to '1' then a background indexing thread will automatically
> update the BDB index when new ARC files are noticed in the 'arcpath'
> directory.
> </description>
> </context-param>
>
> <!-- Pipeline Filter Configuration
> this enables a trival (and very in-progress) UI for viewing the
> pipeline status.
> -->
>
> <filter>
> <filter-name>PipelineFilter</filter-name>
> <filter-class>org.archive.wayback.cdx.indexer.PipelineFilter</filter-class>
> <init-param>
> <param-name>pipeline.statusjsp</param-name>
> <param-value>jsp/PipelineUI/PipelineStatus.jsp</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>PipelineFilter</filter-name>
> <url-pattern>/pipeline</url-pattern>
> </filter-mapping>
>
>
>
>
> <!-- Query Servlet Configuration -->
>
> <servlet>
> <servlet-name>QueryServlet</servlet-name>
> <servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
> <init-param>
> <param-name>queryui.jsppath</param-name>
> <param-value>jsp/QueryUI</param-value>
> </init-param>
> </servlet>
> <servlet-mapping>
> <servlet-name>QueryServlet</servlet-name>
> <url-pattern>/query</url-pattern>
> </servlet-mapping>
>
> <!-- XMLQuery Servlet Configuration -->
>
> <servlet>
> <servlet-name>XMLQueryServlet</servlet-name>
> <servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
> <init-param>
> <param-name>queryui.jsppath</param-name>
> <param-value>jsp/QueryXMLUI</param-value>
> </init-param>
> </servlet>
> <servlet-mapping>
> <servlet-name>XMLQueryServlet</servlet-name>
> <url-pattern>/xmlquery</url-pattern>
> </servlet-mapping>
>
> <!-- QueryUI Configuration -->
>
> <context-param>
> <param-name>queryrenderer.classname</param-name>
> <param-value>org.archive.wayback.query.Renderer</param-value>
> <description>Implementation responsible for drawing Index Query results</description>
> </context-param>
>
> <context-param>
> <param-name>proxy.redirectpath</param-name>
> <param-value>/jsp/QueryUI/Redirect.jsp</param-value>
> </context-param>
>
>
>
> <!-- Replay Servlet Configuration -->
>
> <servlet>
> <servlet-name>ReplayServlet</servlet-name>
> <servlet-class>org.archive.wayback.replay.ReplayServlet</servlet-class>
> </servlet>
> <servlet-mapping>
> <servlet-name>ReplayServlet</servlet-name>
> <url-pattern>/replay</url-pattern>
> </servlet-mapping>
>
>
>
> <!-- Proxy RawReplayUI Configuration -->
>
> <context-param>
> <param-name>replayrenderer.classname</param-name>
> <param-value>org.archive.wayback.proxy.RawReplayRenderer</param-value>
> <description>Implementation responsible for drawing replayed resources and replay error messages</description>
> </context-param>
>
> <context-param>
> <param-name>replayui.jsppath</param-name>
> <param-value>jsp/ReplayUI</param-value>
> <description>
> RawReplayUI specific path to jsp pages. relative to webapp/
> </description>
> </context-param>
>
> <!-- Proxy URI Conversion Configuration -->
>
> <context-param>
> <param-name>replayuriconverter.classname</param-name>
> <param-value>org.archive.wayback.proxy.ResultURIConverter</param-value>
> <description>Class that implements translation of index results to Replayable URIs for this Wayback</description>
> </context-param>
>
> <!-- Proxy ReplayFilter Configuration -->
>
> <filter>
> <filter-name>ReplayFilter</filter-name>
> <filter-class>org.archive.wayback.proxy.ReplayFilter</filter-class>
>
> <init-param>
> <param-name>handler.url</param-name>
> <param-value>/replay</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>ReplayFilter</filter-name>
> <url-pattern>/*</url-pattern>
> </filter-mapping>
>
> </web-app>
|