|
From: Oskar G. <osk...@kb...> - 2006-04-06 14:19:26
|
Hi everyone! Let me first introduce me to those of you who don't know me already. My name is Oskar Grenholm and I work as a programmer at The National Library of Sweden. I mainly work with things related to our web archive here. Lately I have made some minor improvemtents to the way the proxy-mode works in the Open Wayback Machine. Those changes have made it possible to surf not only the most recent copy of a page in the web archive, but instead any copy available. This can be done with just the Wayback Machine, but to aid (and perhaps simplify) the surfing I have also started working on a Firefox extension that will help the user with common tasks often encountered when surfing a web archive. Among the things this WAX Toolbar does is providing a search field for searching the Wayback Machine for different URL:s OR do a full-text search from a NutchWAX index (if one is available of course). You can also use the toolbar to switch between proxy-mode and the regular Internet, and when in proxy-mode easily go back and forth in time. The changes made to the Wayback are not many. The main idea is that you have a BDB index that holds mappings between id:s (a unique id if the toolbar was used, otherwise the ip-address the request was made from) and a preferred time to surf at. This timestamp is set either when you choose a page to visit from the search interface in the WB or by the WAX Toolbar. Then for each request made to the proxy the WB will look up this timestamp and return the page that is the closest in time. Patches for these changes are attached to this e-mail. Four of the files are earlier existing files that have been modified somewhat and two of them are new (BDBMapper.java and Redirect.jsp). Attached is also a tar-file containing the source for the Firefox extension. If you untar this and enter the directory you can just run 'ant' and a file named WaxToolbar.xpi will be built. That is the actual Firefox extension and it can be installed as any other extension (i,e,. double-clicking it from within Firefox). When the extension is installed (and after a re-start of Firefox) a new toolbar will be there. In the Tools menu there will also be a WAX Toolbar Configuration option. Using this you can set the proxy to use (the WB) and a server running NutchWAX. Finally I have attached an example of a web.xml that can be used when running the WB with these new changes and the WAX Toolbar. In it some new stuff has been added, namely a parameter specifying the redirect path (the Redirect.jsp mentioned above) and a servlet called xmlquery that runs in parallell with the normal query interface and is used by the extension to find the times a page has been archived. So, let the feedback begin! Regards, Oskar. |
|
From: <st...@ar...> - 2006-04-06 21:59:08
|
Excellent Oskar!
Do you want us to host your firefox extension at archive-access? If so,
we can set up a subproject for it and give you access.
St.Ack
Oskar Grenholm wrote:
> Hi everyone!
>
> Let me first introduce me to those of you who don't know me already.
> My name is Oskar Grenholm and I work as a programmer at The National Library
> of Sweden. I mainly work with things related to our web archive here.
>
> Lately I have made some minor improvemtents to the way the proxy-mode works in
> the Open Wayback Machine. Those changes have made it possible to surf not
> only the most recent copy of a page in the web archive, but instead any copy
> available.
> This can be done with just the Wayback Machine, but to aid (and perhaps
> simplify) the surfing I have also started working on a Firefox extension that
> will help the user with common tasks often encountered when surfing a web
> archive. Among the things this WAX Toolbar does is providing a search field
> for searching the Wayback Machine for different URL:s OR do a full-text
> search from a NutchWAX index (if one is available of course). You can also
> use the toolbar to switch between proxy-mode and the regular Internet, and
> when in proxy-mode easily go back and forth in time.
>
> The changes made to the Wayback are not many. The main idea is that you have a
> BDB index that holds mappings between id:s (a unique id if the toolbar was
> used, otherwise the ip-address the request was made from) and a preferred
> time to surf at. This timestamp is set either when you choose a page to visit
> from the search interface in the WB or by the WAX Toolbar.
> Then for each request made to the proxy the WB will look up this timestamp and
> return the page that is the closest in time.
>
> Patches for these changes are attached to this e-mail. Four of the files are
> earlier existing files that have been modified somewhat and two of them are
> new (BDBMapper.java and Redirect.jsp).
>
> Attached is also a tar-file containing the source for the Firefox extension.
> If you untar this and enter the directory you can just run 'ant' and a file
> named WaxToolbar.xpi will be built. That is the actual Firefox extension and
> it can be installed as any other extension (i,e,. double-clicking it from
> within Firefox).
> When the extension is installed (and after a re-start of Firefox) a new
> toolbar will be there. In the Tools menu there will also be a WAX Toolbar
> Configuration option. Using this you can set the proxy to use (the WB) and a
> server running NutchWAX.
>
> Finally I have attached an example of a web.xml that can be used when running
> the WB with these new changes and the WAX Toolbar. In it some new stuff has
> been added, namely a parameter specifying the redirect path (the Redirect.jsp
> mentioned above) and a servlet called xmlquery that runs in parallell with
> the normal query interface and is used by the extension to find the times a
> page has been archived.
>
> So, let the feedback begin!
>
> Regards, Oskar.
> ------------------------------------------------------------------------
>
> Index: BDBMap.java
> ===================================================================
> RCS file: BDBMap.java
> diff -N BDBMap.java
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ BDBMap.java 1 Jan 1970 00:00:00 -0000
> @@ -0,0 +1,94 @@
> +/*
> + * Created on 2006-apr-05
> + *
> + * Copyright (C) 2006 Royal Library of Sweden.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public License
> + * as published by the Free Software Foundation; either version 2
> + * of the License, or (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +package org.archive.wayback.core;
> +
> +import java.io.File;
> +import java.io.UnsupportedEncodingException;
> +
> +import com.sleepycat.je.Database;
> +import com.sleepycat.je.DatabaseConfig;
> +import com.sleepycat.je.DatabaseEntry;
> +import com.sleepycat.je.DatabaseException;
> +import com.sleepycat.je.Environment;
> +import com.sleepycat.je.EnvironmentConfig;
> +import com.sleepycat.je.LockMode;
> +import com.sleepycat.je.OperationStatus;
> +
> +public class BDBMap {
> +
> + protected Environment env = null;
> + protected Database db = null;
> + protected String name;
> + protected String dir;
> +
> + public BDBMap(String name, String dir) {
> + this.name = name;
> + this.dir = dir;
> + init();
> + }
> +
> + protected void init() {
> + try {
> + EnvironmentConfig envConf = new EnvironmentConfig();
> + envConf.setAllowCreate(true);
> + File envDir = new File(dir);
> + if (!envDir.exists())
> + envDir.mkdirs();
> + env = new Environment(envDir, envConf);
> +
> + DatabaseConfig dbConf = new DatabaseConfig();
> + dbConf.setAllowCreate(true);
> + dbConf.setSortedDuplicates(false);
> + db = env.openDatabase(null, name, dbConf);
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + }
> + }
> +
> + public void put(String keyStr, String valueStr) {
> + try {
> + DatabaseEntry key = new DatabaseEntry(keyStr.getBytes("UTF-8"));
> + DatabaseEntry data = new DatabaseEntry(valueStr.getBytes("UTF-8"));
> + db.put(null, key, data);
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + } catch (UnsupportedEncodingException e) {
> + e.printStackTrace();
> + }
> + }
> +
> + public String get(String keyStr) {
> + String result = null;
> + try {
> + DatabaseEntry key = new DatabaseEntry(keyStr.getBytes("UTF-8"));
> + DatabaseEntry data = new DatabaseEntry();
> + if (db.get(null, key, data, LockMode.DEFAULT) == OperationStatus.SUCCESS) {
> + byte[] bytes = data.getData();
> + result = new String(bytes, "UTF-8");
> + }
> + } catch (DatabaseException e) {
> + e.printStackTrace();
> + } catch (UnsupportedEncodingException e) {
> + e.printStackTrace();
> + }
> + return result;
> + }
> +
> +}
> ------------------------------------------------------------------------
>
> Index: ResultURIConverter.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/proxy/ResultURIConverter.java,v
> retrieving revision 1.3
> diff -u -r1.3 ResultURIConverter.java
> --- ResultURIConverter.java 1 Dec 2005 02:08:34 -0000 1.3
> +++ ResultURIConverter.java 6 Apr 2006 11:36:25 -0000
> @@ -41,10 +41,19 @@
> * @version $Date: 2005/12/01 02:08:34 $, $Revision: 1.3 $
> */
> public class ResultURIConverter implements ReplayResultURIConverter {
> - /* (non-Javadoc)
> +
> + private static final String REDIRECT_PATH_PROPERTY = "proxy.redirectpath";
> +
> + private String redirectPath;
> +
> + /* (non-Javadoc)
> * @see org.archive.wayback.ReplayResultURIConverter#init(java.util.Properties)
> */
> public void init(Properties p) throws ConfigurationException {
> + redirectPath = (String) p.get(REDIRECT_PATH_PROPERTY);
> + if (redirectPath == null || redirectPath.length() <= 0) {
> + throw new ConfigurationException("Failed to find " + REDIRECT_PATH_PROPERTY);
> + }
> }
>
> /* (non-Javadoc)
> @@ -52,10 +61,12 @@
> */
> public String makeReplayURI(SearchResult result) {
> String finalUrl = result.get(WaybackConstants.RESULT_URL);
> + String finalTime = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
> if(!finalUrl.startsWith(WaybackConstants.HTTP_URL_PREFIX)) {
> finalUrl = WaybackConstants.HTTP_URL_PREFIX + finalUrl;
> }
> - return finalUrl;
> + //return finalUrl;
> + return redirectPath + "?url=" + finalUrl + "&time=" + finalTime;
> }
>
> /**
> @@ -70,6 +81,7 @@
> */
> public String makeRedirectReplayURI(SearchResult result, String url) {
> String finalUrl = url;
> + String finalTime = result.get(WaybackConstants.RESULT_CAPTURE_DATE);
> try {
>
> UURI origURI = UURIFactory.getInstance(url);
> @@ -86,6 +98,7 @@
> if(!finalUrl.startsWith(WaybackConstants.HTTP_URL_PREFIX)) {
> finalUrl = WaybackConstants.HTTP_URL_PREFIX + finalUrl;
> }
> - return finalUrl;
> + //return finalUrl;
> + return redirectPath + "?url=" + finalUrl + "&time=" + finalTime;
> }
> }
> ------------------------------------------------------------------------
>
> Index: Timestamp.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/core/Timestamp.java,v
> retrieving revision 1.7
> diff -u -r1.7 Timestamp.java
> --- Timestamp.java 16 Feb 2006 03:14:42 -0000 1.7
> +++ Timestamp.java 6 Apr 2006 11:34:06 -0000
> @@ -56,6 +56,11 @@
>
> private final static String[] months = { "Jan", "Feb", "Mar", "Apr", "May",
> "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec" };
> +
> + // Acts as a mapping between an ID and a timestamp to surf at.
> + // The dir should probably be configurable somehow.
> + private static String BDB_DIR = System.getProperty("java.io.tmpdir") + "/wayback/bdb";
> + private static BDBMap idToTimestamp = new BDBMap("IdToTimestamp", BDB_DIR);
>
> private String dateStr = null;
> private Date date = null;
> @@ -430,6 +435,7 @@
> public static Timestamp currentTimestamp() {
> return new Timestamp(new Date());
> }
> +
> /**
> * @return Timestamp object representing the latest possible date.
> */
> @@ -437,12 +443,20 @@
> return currentTimestamp();
> }
>
> -
> /**
> * @return Timestamp object representing the earliest possible date.
> */
> public static Timestamp earliestTimestamp() {
> return new Timestamp(SSE_1996);
> }
> +
> + public static String getTimestampForId(String ip) {
> + String dateStr = idToTimestamp.get(ip);
> + return (dateStr != null) ? dateStr : currentTimestamp().getDateStr();
> + }
> +
> + public static void addTimestampForId(String ip, String time) {
> + idToTimestamp.put(ip, time);
> + }
>
> }
> ------------------------------------------------------------------------
>
> Index: Redirect.jsp
> ===================================================================
> RCS file: Redirect.jsp
> diff -N Redirect.jsp
> --- /dev/null 1 Jan 1970 00:00:00 -0000
> +++ Redirect.jsp 1 Jan 1970 00:00:00 -0000
> @@ -0,0 +1,14 @@
> +<%@ page import="org.archive.wayback.core.Timestamp" %>
> +
> +<%
> + String url = request.getParameter("url");
> + String time = request.getParameter("time");
> +
> + // Put time-mapping for this id, or if no id, the ip-addr.
> + String id = request.getHeader("Proxy-Id");
> + if(id == null) id = request.getRemoteAddr();
> + Timestamp.addTimestampForId(id, time);
> +
> + // Now redirect to the page the user wanted.
> + response.sendRedirect(url);
> +%>
> ------------------------------------------------------------------------
>
> Index: ReplayFilter.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/proxy/ReplayFilter.java,v
> retrieving revision 1.4
> diff -u -r1.4 ReplayFilter.java
> --- ReplayFilter.java 18 Jan 2006 02:04:12 -0000 1.4
> +++ ReplayFilter.java 6 Apr 2006 11:36:02 -0000
> @@ -84,10 +84,15 @@
> referer = "";
> }
> wbRequest.put(WaybackConstants.REQUEST_REFERER_URL,referer);
> -
> - wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE,
> - Timestamp.currentTimestamp().getDateStr());
> -
> +
> + // Original
> + //wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp.currentTimestamp().getDateStr());
> +
> + // Get the id from the request. If no id, use the ip-address instead.
> + // Then get the timestamp (or rather datestr) matching this id.
> + String id = httpRequest.getHeader("Proxy-Id");
> + if(id == null) id = httpRequest.getRemoteAddr();
> + wbRequest.put(WaybackConstants.REQUEST_EXACT_DATE, Timestamp.getTimestampForId(id));
>
> return wbRequest;
> }
> ------------------------------------------------------------------------
>
> Index: QueryServlet.java
> ===================================================================
> RCS file: /cvsroot/archive-access/archive-access/projects/wayback/src/java/org/archive/wayback/query/QueryServlet.java,v
> retrieving revision 1.5
> diff -u -r1.5 QueryServlet.java
> --- QueryServlet.java 7 Mar 2006 23:22:20 -0000 1.5
> +++ QueryServlet.java 6 Apr 2006 11:38:30 -0000
> @@ -25,7 +25,9 @@
> package org.archive.wayback.query;
>
> import java.io.IOException;
> +import java.text.ParseException;
> import java.util.Enumeration;
> +import java.util.Iterator;
> import java.util.Properties;
>
> import javax.servlet.ServletConfig;
> @@ -39,7 +41,9 @@
> import org.archive.wayback.QueryRenderer;
> import org.archive.wayback.ReplayResultURIConverter;
> import org.archive.wayback.ResourceIndex;
> +import org.archive.wayback.core.SearchResult;
> import org.archive.wayback.core.SearchResults;
> +import org.archive.wayback.core.Timestamp;
> import org.archive.wayback.core.WaybackLogic;
> import org.archive.wayback.core.WaybackRequest;
> import org.archive.wayback.exception.BadQueryException;
> @@ -119,6 +123,14 @@
> if (wbRequest.get(WaybackConstants.REQUEST_TYPE).equals(
> WaybackConstants.REQUEST_URL_QUERY)) {
>
> + // Annotate the closest matching hit so that it can
> + // be retrieved later from the xml.
> + try {
> + annotateClosest(results, wbRequest, httpRequest);
> + } catch (ParseException e) {
> + e.printStackTrace();
> + }
> +
> renderer.renderUrlResults(httpRequest, httpResponse,
> wbRequest, results, uriConverter);
>
> @@ -144,4 +156,34 @@
>
> }
> }
> +
> + // Method annotating the searchresult closest in time to the timestamp
> + // belonging to this request.
> + private void annotateClosest(SearchResults results,
> + WaybackRequest wbRequest, HttpServletRequest request) throws ParseException {
> +
> + SearchResult closest = null;
> + long closestDistance = 0;
> + SearchResult cur = null;
> + String id = request.getHeader("Proxy-Id");
> + if(id == null) id = request.getRemoteAddr();
> + String requestsDate = Timestamp.getTimestampForId(id);
> + Timestamp wantTimestamp;
> + wantTimestamp = Timestamp.parseBefore(requestsDate);
> +
> + Iterator itr = results.iterator();
> + while (itr.hasNext()) {
> + cur = (SearchResult) itr.next();
> + long curDistance;
> + Timestamp curTimestamp = Timestamp.parseBefore(cur
> + .get(WaybackConstants.RESULT_CAPTURE_DATE));
> + curDistance = curTimestamp.absDistanceFromTimestamp(wantTimestamp);
> +
> + if ((closest == null) || (curDistance < closestDistance)) {
> + closest = cur;
> + closestDistance = curDistance;
> + }
> + }
> + closest.put("closest", "true");
> + }
> }
> ------------------------------------------------------------------------
>
> <?xml version="1.0"?>
> <!DOCTYPE web-app PUBLIC "-//Sun Microsystems, Inc.//DTD Web Application 2.3//EN"
> "http://java.sun.com/dtd/web-app_2_3.dtd">
> <web-app>
>
> <!-- General Installation information
> -->
>
> <context-param>
> <param-name>installationname</param-name>
> <param-value>Local Proxy Installation</param-value>
> <description>
> This text will appear on the Wayback Configuration and Status page
> and may assist in determining which installation users are viewing
> via their web browser in environments with multiple Wayback
> installations.
> </description>
> </context-param>
>
>
> <!-- Local Arc Path Configuration:
> used by both indexpipeline and LocalARCResourceStore
> -->
>
> <context-param>
> <param-name>arcpath</param-name>
> <param-value>/tmp/wayback/arcs</param-value>
> <description>
> Directory where ARC files are found (possibly where Heritrix writes them.)
> This directory must exist.
> </description>
> </context-param>
>
>
>
> <!-- ResourceStore Configuration -->
>
> <context-param>
> <param-name>resourcestore.classname</param-name>
> <param-value>org.archive.wayback.localresourcestore.LocalARCResourceStore</param-value>
> <description>Class that implements ResourceStore for this Wayback</description>
> </context-param>
>
>
>
> <!-- ResourceIndex Configuration -->
>
> <context-param>
> <param-name>resourceindex.classname</param-name>
> <param-value>org.archive.wayback.cdx.LocalBDBResourceIndex</param-value>
> <description>Class that implements ResourceIndex for this Wayback</description>
> </context-param>
>
> <context-param>
> <param-name>resourceindex.indexpath</param-name>
> <param-value>/tmp/wayback/index</param-value>
> <description>
> LocalBDBResourceIndex specific directory to store the BDB files.
> This directory must exist.
> </description>
> </context-param>
>
> <context-param>
> <param-name>resourceindex.dbname</param-name>
> <param-value>DB1</param-value>
> <description>
> LocalBDBResourceIndex specific name for BDB database
> </description>
> </context-param>
>
>
> <!-- ResourceIndex Pipeline Configuration -->
>
> <context-param>
> <param-name>indexpipeline.workpath</param-name>
> <param-value>/tmp/wayback/pipeline</param-value>
> <description>
> LocalBDBResourceIndex specific directory to store flag files and
> temporary index data. This directory must exist.
> </description>
> </context-param>
>
> <context-param>
> <param-name>indexpipeline.runpipeline</param-name>
> <param-value>1</param-value>
> <description>
> if set to '1' then a background indexing thread will automatically
> update the BDB index when new ARC files are noticed in the 'arcpath'
> directory.
> </description>
> </context-param>
>
> <!-- Pipeline Filter Configuration
> this enables a trival (and very in-progress) UI for viewing the
> pipeline status.
> -->
>
> <filter>
> <filter-name>PipelineFilter</filter-name>
> <filter-class>org.archive.wayback.cdx.indexer.PipelineFilter</filter-class>
> <init-param>
> <param-name>pipeline.statusjsp</param-name>
> <param-value>jsp/PipelineUI/PipelineStatus.jsp</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>PipelineFilter</filter-name>
> <url-pattern>/pipeline</url-pattern>
> </filter-mapping>
>
>
>
>
> <!-- Query Servlet Configuration -->
>
> <servlet>
> <servlet-name>QueryServlet</servlet-name>
> <servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
> <init-param>
> <param-name>queryui.jsppath</param-name>
> <param-value>jsp/QueryUI</param-value>
> </init-param>
> </servlet>
> <servlet-mapping>
> <servlet-name>QueryServlet</servlet-name>
> <url-pattern>/query</url-pattern>
> </servlet-mapping>
>
> <!-- XMLQuery Servlet Configuration -->
>
> <servlet>
> <servlet-name>XMLQueryServlet</servlet-name>
> <servlet-class>org.archive.wayback.query.QueryServlet</servlet-class>
> <init-param>
> <param-name>queryui.jsppath</param-name>
> <param-value>jsp/QueryXMLUI</param-value>
> </init-param>
> </servlet>
> <servlet-mapping>
> <servlet-name>XMLQueryServlet</servlet-name>
> <url-pattern>/xmlquery</url-pattern>
> </servlet-mapping>
>
> <!-- QueryUI Configuration -->
>
> <context-param>
> <param-name>queryrenderer.classname</param-name>
> <param-value>org.archive.wayback.query.Renderer</param-value>
> <description>Implementation responsible for drawing Index Query results</description>
> </context-param>
>
> <context-param>
> <param-name>proxy.redirectpath</param-name>
> <param-value>/jsp/QueryUI/Redirect.jsp</param-value>
> </context-param>
>
>
>
> <!-- Replay Servlet Configuration -->
>
> <servlet>
> <servlet-name>ReplayServlet</servlet-name>
> <servlet-class>org.archive.wayback.replay.ReplayServlet</servlet-class>
> </servlet>
> <servlet-mapping>
> <servlet-name>ReplayServlet</servlet-name>
> <url-pattern>/replay</url-pattern>
> </servlet-mapping>
>
>
>
> <!-- Proxy RawReplayUI Configuration -->
>
> <context-param>
> <param-name>replayrenderer.classname</param-name>
> <param-value>org.archive.wayback.proxy.RawReplayRenderer</param-value>
> <description>Implementation responsible for drawing replayed resources and replay error messages</description>
> </context-param>
>
> <context-param>
> <param-name>replayui.jsppath</param-name>
> <param-value>jsp/ReplayUI</param-value>
> <description>
> RawReplayUI specific path to jsp pages. relative to webapp/
> </description>
> </context-param>
>
> <!-- Proxy URI Conversion Configuration -->
>
> <context-param>
> <param-name>replayuriconverter.classname</param-name>
> <param-value>org.archive.wayback.proxy.ResultURIConverter</param-value>
> <description>Class that implements translation of index results to Replayable URIs for this Wayback</description>
> </context-param>
>
> <!-- Proxy ReplayFilter Configuration -->
>
> <filter>
> <filter-name>ReplayFilter</filter-name>
> <filter-class>org.archive.wayback.proxy.ReplayFilter</filter-class>
>
> <init-param>
> <param-name>handler.url</param-name>
> <param-value>/replay</param-value>
> </init-param>
> </filter>
> <filter-mapping>
> <filter-name>ReplayFilter</filter-name>
> <url-pattern>/*</url-pattern>
> </filter-mapping>
>
> </web-app>
|