From: <bra...@us...> - 2008-08-14 03:26:26
|
Revision: 2552 http://archive-access.svn.sourceforge.net/archive-access/?rev=2552&view=rev Author: bradtofel Date: 2008-08-14 03:26:36 +0000 (Thu, 14 Aug 2008) Log Message: ----------- DOC: updated for 1.4 release.. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-14 03:25:42 UTC (rev 2551) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-08-14 03:26:36 UTC (rev 2552) @@ -79,435 +79,164 @@ <section name="Wayback Configuration Overview"> - <p> - The wayback software provides Search and Replay access to documents - contained in a WaybackCollection. Search access allows users to - query a collection to locate documents, and is presently limited - to URL based queries. Replay access allows users to view archived - content in collections within a web browser. A WaybackCollection is - a combination of a ResourceStore, which contains the actual archived - documents, and a ResourceIndex, which provides URL based search of the - documents in the ResourceStore. - </p> - <p> - The Wayback machine is configured using Spring IOC, to specify and - configure concrete implementations of several basic modules. For - information about using Spring, please see - <a href="http://www.springframework.org/docs/reference/beans.html"> - this page - </a>. - </p> - </section> - - - - <section name="Defining WaybackCollections"> - <p> - The XML configuration template for a Wayback collection follows: - <pre> - -<bean id="localbdbcollection" - class="org.archive.wayback.webapp.WaybackCollection"> - <property name="resourceStore" ... /> - <property name="resourceIndex" ... /> - <property name="shutdownables" ... /> -</bean> - - </pre> - </p> <p> - The resourceStore property refers to a bean implementing - <a href="resource_store.html">org.archive.wayback.ResourceStore</a>. + The wayback software provides Query and Replay access to archived + documents. Query access allows users to locate particular documents + within the collection by URL and date. Replay access allows users to + view archived pages within their web browsers. Some Replay modes + require altering the original pages so embedded content is also loaded + from the wayback service, and not from the live web. </p> <p> - The resourceIndex property refers to a bean implementing - <a href="resource_index.html">org.archive.wayback.ResourceIndex</a>. + A WaybackCollection defines a set of archived documents and an index + which allows documents to be located within the collection. A + WaybackCollection may be exposed to end users through one or more + AccessPoints, which define: + <ul> + <li>the WaybackCollection itself</li> + <li>the URL where users can access the collection</li> + <li>how users can query the collection (the Query UI)</li> + <li>how documents are returned to users so they appear correctly in + their web browsers (the Replay UI)</li> + <li>the look and feel of the wayback user interface</li> + <li>who can access the documents in the collection</li> + <li>which documents from the collection are available</li> + </ul> </p> <p> - The shutdownables property refers to a list of beans implementing org.archive.wayback.Shutdownable, typically worker Threads performing automatic updates of the Collection. + Wayback is configured using Spring IOC, to specify and configure + concrete implementations of several basic modules. For information + about using Spring, please see + <a href="http://www.springframework.org/docs/reference/beans.html"> + this page + </a>. </p> - </section> - - <section name="org.archive.wayback.ResourceIndex implementations"> - - - <subsection name="LocalResourceIndex"> + <subsection name="AccessPoint configuration options"> <p> - This ResourceIndex implementation allows wayback to search one of - several index formats hosted on the same machine as the wayback - application. See below for details on which specific index formats - are available. + An AccessPoint's configuration must specify the following + implementations: + <ul> + <li><a href="WaybackCollection_Configuration"><b>collection</b></a> + the specific WaybackCollection being exposed via this + AccessPoint. + </li> + <li><a href="Query_UI"><b>query</b></a> responsible for generating + user visible content in response to user Queries, HTML, XML, + etc.</li> + <li><a href="Replay_Modes"><b>replay</b></a> responsible for + determining the appropriate ReplayRenderer implementation based + on the users request and the particular document to be + Replayed.</li> + <li><b>uriConverter</b> responsible for constructing Replay URLs + from records matching users queries. See Replay Modes below. + </li> + <li><b>parser</b> - responsible for translating incoming requests + into WaybackRequests. See Replay Modes below.</li> + </ul> </p> <p> - The XML configuration template for a LocalResourceIndex follows: - <pre> - -<property name="resourceIndex"> - <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> - <property name="source" ... /> - <property name="maxRecords" value="10000" /> - <property name="dedupeRecords" value="false" /> - </bean> -</property> - - </pre> - </p> - <p> - <b> - maxRecords - </b> - specifies the maximum number of records to process, and thus that can - be returned, during a single query. - </p> - <p> - <b> - dedupeRecords - </b> - set to true if you are using WARC files created by Heritrix 1.12 or - higher and configured the duplicate reduction features. See the - section Duplicate Reduction below for more information. - </p> - <p> - <b> - source - </b> - defines the format to be used for storing and searching records in - the ResourceIndex. There are several possible implementations - available: + An AccessPoint's configuration may optionally specify the following: <ul> + <li><a href="Exception_Rendering"><b>exception</b></a> - an + implementation responsible for generating error pages to users + </li> <li> - <b> - BDBIndex - </b> - This implementation is good for smaller scale installations, up - to 10's of millions of documents, and allows for fast incremental - updates to the index. It also allows for automated index updating. - <pre> - -<bean class="org.archive.wayback.resourceindex.bdb.BDBIndex" - init-method="init"> - <property name="bdbName" value="DB1" /> - <property name="bdbPath" value="/tmp/wayback/index/" /> - <property name="updater"> - <bean class="org.archive.wayback.resourceindex.bdb.BDBIndexUpdater"> - <property name="incoming" value="/tmp/wayback/index-data/incoming/" /> - <property name="failed" value="/tmp/wayback/index-data/failed/" /> - <property name="merged" value="/tmp/wayback/index-data/merged/" /> - <property name="runInterval" value="10000" /> - </bean> - </property> -</bean> - - </pre> - The <b>updater</b> property is optional. If used, a background - index merging thread will be started. Every <b>runInterval</b> - milliseconds, the thread will look for new files in the - <b>incoming</b> directory. Any files present are assumed to be - in CDX file format, and will be merged into the index and - immediately available for access. Files that are not successfully - merged with the index are left in place (or moved to the - <b>failed</b> directory, if it is specified.) Files that are - successfully merged are deleted (or moved to the <b>merged</b> - directory, if it is specified.) - <br></br> + <a href="Adding_Additional_Configurations_to_an_AccessPoint"> + <b>configs</b> + </a> - a Properties associating arbitrary key-value pairs which + are accessible to .jsp files responsible for generating the UI </li> <li> - <b> - CDXIndex - </b> - This implementation is good for larger scale installations, - bounded mostly by the size of the index you can (first create, - and later) store on a single machine. Using the command line tool - <b>arc-indexer</b> or <b>warc-indexer</b>, and the standard UNIX - <b>sort</b> tool (see note below on LC_ALL), you create a sorted - flat text file that is searched on each request. Building these - sorted files, and updating the index are manual operations - presently. - <pre> - -<bean id="cdxsearchresultsource" class="org.archive.wayback.resourceindex.cdx.CDXIndex"> - <property name="path" value="/tmp/wayback/cdx-index/index.cdx" /> -</bean> - - </pre> + <a href="Excluding_Documents_within_an_AccessPoint"> + <b>exclusionFactory</b> + </a> - an implementation specifying what documents should be + accessible within this AccessPoint </li> <li> - <b> - CompositeSearchResultSource - </b> - This implementation allows for searching multiple CDXIndex text - files for each request. For optimal search efficiency, multiple - index files should be merged (sort -mu) prior to production use, - but this implementation allows a trade-off in simplified index - management for a decrease in search performance. - <pre> - -<bean id="compositecdxresultsource" class="org.archive.wayback.resourceindex.CompositeSearchResultSource"> - <property name="CDXSources"> - <list> - <value>/tmp/wayback/cdx-index/index.cdx.1</value> - <value>/tmp/wayback/cdx-index/index.cdx.2</value> - </list> - </property> -</bean> - - </pre> + <a href="Restricting_who_can_interact_with_an_AccessPoint"> + <b>authentication</b> + </a> - an implementation specifying who is allowed to connect to + this AccessPoint </li> + <li><b>urlRoot</b> - a String URL prefix under which all UI + elements should be referenced. + </li> + <li><b>locale</b> - A specific Locale to use for all requests + within this AccessPoint, overriding the users preferred Locale + as specified by their web browser. + </li> </ul> </p> - - </subsection> - - - <subsection name="RemoteResourceIndex configuration"> <p> - This ResourceIndex option allows hosting of a ResourceIndex on a - machine other than the machine hosting the Wayback webapp. + AccessPoints can be used to provide different levels and types of + access to the same collection for different users. For example, you + can provide both Proxy and Archival URL mode access to a single + collection by defining 2 AccessPoints with different Replay User + Interfaces but the same WaybackCollection. Using AccessPoints, you can + also provide different levels of access to a collection. For example, + users within a particular subnet may be able to access all documents + within a collection via one AccessPoint, but users outside that subnet + may be restricted to viewing documents allowed by a web sites current + robots.txt file. </p> <p> - The XML configuration template for a RemoteResourceIndex follows: - <pre> - -<bean id="remoteindex" class="org.archive.wayback.resourceindex.RemoteResourceIndex" init-method="init"> - <property name="searchUrlBase" value="http://wayback-index.archive.org:8080/wayback/xmlquery" /> -</bean> - - </pre> - <b>searchUrlBase</b> indicates the URL prefix to which OpenSearchQuery - parameters are appended to access a Wayback AccessPoint running a - LocalResourceIndex on a remote host to the Wayback application. + Please refer to + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml">wayback.xml</a> + within the wayback .war file for detailed example AccessPoint + configurations. </p> - </subsection> - - - <subsection name="NutchResourceIndex configuration"> + <subsection name="WaybackCollection Configuration"> <p> - This ResourceIndex option allows the wayback to query a Nutch - full-text search engine. This ResourceIndex option is highly - experimental. For help setting up a NutchResourceIndex, please see - <a href="http://archive-access.sourceforge.net/projects/nutch/wayback.html"> - this page. - </a> + A WaybackCollection's configuration must specify the following + implementations: + <ul> + <li><a href="resource_store.html">resourceStore</a> the specific + implementation used to specific set of documents within this + collection, and how to access them for Replay requests.</li> + <li><a href="resource_index.html">resourceIndex</a> the specific + implementation responsible for locating documents within the + collection.</li> + </ul> </p> <p> - The XML configuration template for a NutchResourceIndex follows: - <pre> - - <property name="remotenutchindex"> - <bean class="org.archive.wayback.resourceindex.NutchResourceIndex" init-method="init"> - <property name="searchUrlBase" value="http://webteam-ws.us.archive.org:8080/katrina/opensearch" /> - <property name="maxRecords" value="100" /> - </bean> - </property> - - </pre> - <b>searchUrlBase</b> indicates the URL prefix to which OpenSearchQuery - parameters are appended to access a Nutch servers XML query interface. - + A WaybackCollection's configuration may optionally specify the + following: + <ul> + <li>shutdownables - an List of one or more beans implementing + org.archive.wayback.Shutdownable needed to maintain this + WaybackCollection, typically Daemon Threads which perform + automatic indexing operations on the resourceStore and the + resourceIndex.</li> + </ul> </p> + <p> + For more information on WaybackCollection configuration options and + automatic indexing, please refer to the following documentation pages + and to the example Spring .xml configuration files within the wayback + .war: + <ul> + <li><a href="resource_store.html">ResourceStore configuration and + automatic indexing</a></li> + <li><a href="resource_index.html">ResourceIndex configuration</a></li> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/BDBCollection.xml">BDBCollection.xml</a></li> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/CDXCollection.xml">CDXCollection.xml</a></li> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/RemoteCollection.xml">RemoteCollection.xml</a></li> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/NutchCollection.xml">NutchCollection.xml</a></li> + </ul> + </p> </subsection> </section> - - - <section name="Defining AccessPoints for WaybackCollections"> + <section name="Replay Modes"> <p> - Once you have defined one or more WaybackCollections, you need to - specify how those collections are exposed to end users. Collections are - exposed by defining an AccessPoint for that collection. + There are presently 3 Replay modes supported by the Wayback software, + Archival URL mode, Proxy mode, and an experimental DomainPrefix mode. </p> - <p> - An AccessPoint is a combination of a WaybackCollection, a Query User - Interface, a Replay User Interface, and a URL by which users interact - with that AccessPoint. AccessPoints can also describe mechanisms for - excluding documents, and for limiting what users are allowed to - interact with the AccessPoint. - </p> - <p> - AccessPoints can be used to provide different levels and types of - access to the same collection for different users. For example, you - can provide both Proxy and Archival URL mode access to a single - collection by defining 2 AccessPoints with different Replay User - Interfaces but the same WaybackCollection. Using AccessPoints, you can - also provide different levels of access to a collection. For example, - users within a particular subnet may be able to access all documents - within a collection via one AccessPoint, but users outside that subnet - may be restricted to viewing documents allowed by a web sites current - robots.txt file. - </p> - <p> - The XML configuration template for an AccessPoint follows: - <pre> - -<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint"> - <property name="collection" ... /> - <property name="query" ... /> - <property name="replay" ... /> - <property name="parser" ... /> - <property name="uriConverter" ... /> - <property name="exclusionFactory" ... /> - <property name="authentication" ... /> - <property name="configs" ... /> -</bean> - - </pre> - </p> - <p> - Required property configurations: - <ul> - <li> - <b> - collection - </b> - is a reference to the WaybackCollection for this AccessPoint. - </li> - <li> - <b> - query - </b> - defines what .jsp files to use to render results for queries to - this AccessPoint. See the section "Query .jsp configuration" for - more information. - </li> - <li> - <b> - replay - </b> - defines what Replay User Interface to use for this AccessPoint. See - the section "Setting up the Replay User Interface within an - AccessPoint" for more information. - </li> - <li> - <b> - parser - </b> - defines how incoming requests are parsed and subsequently processed, - and is usually dependent on the Replay User Interface being used - with this AccessPoint.See the section "Setting up the Replay User - Interface within an AccessPoint" for more information. - </li> - <li> - <b> - uriConverter - </b> - defines how public URLs are constructed to provide Replay access - to this AccessPoint. This is usually dependant on the Replay User - Interface used with this AccessPoint. See the section "Setting up - the Replay User Interface within an AccessPoint" for more - information. - </li> - </ul> - </p> - <p> - Optional property configurations: - <ul> - <li> - <b> - exclusionFactory - </b> - defines how documents are excluded within this AccessPoint. See the - section "Excluding Documents within an AccessPoint" for more - information. - </li> - <li> - <b> - authentication - </b> - defines who is allowed to interact with this AccessPoint. See the - section "Limiting Access to an AccessPoint" for more information. - </li> - <li> - <b> - configs - </b> - Allows additional customizations within this AccessPoint. See the - section "Adding Additional Configurations to an AccessPoint" for - more information. - </li> - </ul> - </p> - </section> - - - <section name="Query .jsp configuration"> - <p> - Wayback provides query results to a .jsp handler page, which is - responsible for rendering final output to users. The actual .jsp file - invoked for the various response types can be configured as described - below. Included with the Wayback package are several reference .jsp - implementations, including one which outputs XML. This XML interface is - used by the Wayback software in distributed index configurations, but - can also be used as an extension point for further user interface - customizations. - </p> - <br></br> - <p> - The XML configuration template for the query Renderer follows below, - including the default configuration for each value. The values indicate - the path to the .jsp file that will be executed to generate the output - for each class of query. - <pre> - -<bean class="org.archive.wayback.query.Renderer"> - <property name="errorJsp" value="/jsp/HTMLError.jsp" /> - <property name="xmlErrorJsp" value="/jsp/XMLError.jsp" /> - <property name="captureJsp" value="/jsp/HTMLResults.jsp" /> - <property name="urlJsp" value="/jsp/HTMLResults.jsp" /> - <property name="xmlJsp" value="/jsp/XMLResults.jsp" /> -</bean> - - </pre> - The following list indicates when each .jsp is executed: - <ul> - <li> - <b> - errorJsp - </b> - will be executed when any type of expected error condition occurs - during handling of a request. - </li> - <li> - <b> - xmlErrorJsp - </b> - will be executed when any type of expected error condition occurs - during handling of a request indicating that xml response data is - desired. - </li> - <li> - <b> - captureJsp - </b> - will be executed when results listing captures for a specific, - single URL are requested in HTML format. - </li> - <li> - <b> - urlJsp - </b> - will be executed when results listing captures for multiple URLs, - each URL having one or more captures, are requested in HTML format. - </li> - <li> - <b> - xmlJsp - </b> - will be executed when results are requested in XML format. - </li> - </ul> - </p> - </section> - - <section name="Setting up the Replay User Interface within an AccessPoint"> - <p> - There are presently 2 Replay modes supported by the Wayback software, - Archival URL mode, and Proxy mode. - </p> - <subsection name="Archival URL"> + <subsection name="Archival URL Replay Mode"> <p> Archival URL Replay mode uses a modified URL to designate - documents stored in ARC files. The general form of an + documents stored in ARC/WARC files. The general form of an Archival URL is: <br></br> <div> @@ -519,7 +248,7 @@ where <ul> <li> - <b>HOSTNAME</b> is the host where the Wayback Machine is + <b>HOSTNAME</b> is the host where the Wayback software is running. </li> <li> @@ -528,9 +257,9 @@ the Access Point. See below for example CONTEXT mappings. </li> <li> - <b>CONTEXT</b> is the context where the Wayback Machine - webapp has been deployed, plus the name of the Access Point. See - below for example CONTEXT mappings. + <b>CONTEXT</b> is the context where the Wayback webapp has been + deployed, plus the name of the Access Point. See below for + example CONTEXT mappings. </li> <li> <b>TIMESTAMP</b> is 0 to 14 digits of a date, possibly @@ -724,24 +453,11 @@ </table> </p> <p> - The properties <b>replay</b>, <b>parser</b>, and <b>uriConverter</b> + The properties <b>parser</b> and <b>uriConverter</b> for Archival URL Access Points must be set to the following implementations: <pre> - <property name="replay"> - <bean class="org.archive.wayback.archivalurl.ArchivalUrlReplayDispatcher"> - <property name="serverSideRendering" value="false" /> - <property name="jspInserts"> - <list> - <value>/replay/ArchiveComment.jsp</value> - <value>/replay/ClientSideJSInsert.jsp</value> - <value>/replay/Timeline.jsp</value> - </list> - </property> - </bean> - </property> - <property name="parser"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlRequestParser" init-method="init"> @@ -772,55 +488,6 @@ </tr> <tr> <td> - serverSideRendering - </td> - <td> - required - </td> - <td> - When set to true, all URL rewriting occurs on the server, - eliminating the need for client side Javascript rewriting. If this - option is set to false, then the <i>ClientSideJSInsert.jsp</i> - <b>jspInsert</b> should be used. If this option is true, and - you're attempting to set up an entirely JavaScript free - installation which includes an embedded Timeline in replayed - HTML documents, you can use the <i>JSLessTimeline.jsp</i> - <b>jspInsert</b>. - </td> - </tr> - <tr> - <td> - jspInserts - </td> - <td> - optional - </td> - <td> - If any values are included here, then those .jsp files will be - invoked for every replayed document, and the resulting output - will be included in replayed HTML pages. The example included - here will result in: - <ul> - <li> - An HTML comment embedded inside replayed web pages indicating - the dates the document was captured and the date it was served - by wayback. - </li> - <li> - A reference to a javascript file, client-rewrite.js, which - will attempt to modify URLs within the users browser to make - them direct back into wayback. - </li> - <li> - A timeline banner embedded in the top of HTML pages that - allows navigation between other versions of the currently - viewed document. - </li> - </ul> - </td> - </tr> - <tr> - <td> maxRecords </td> <td> @@ -857,36 +524,39 @@ </tr> </table> <p> - Note that the old <b>jsInserts</b> configuration has been deprecated, - in favor of including references to JavaScript files using jspInserts. - Also note that the use of the ClientSideJSInsert.jsp is required when - serverSideRendering is set to false. + For additional configuration examples and information about + ArchivalUrl Replay mode, please see the file + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/ArchivalUrlReplay.xml">ArchivalUrlReplay.xml</a> </p> </subsection> - <subsection name="Proxy"> + <subsection name="Proxy Replay Mode"> <p> Wayback can be configured to act as an HTTP proxy server. To utilize this mode, the wayback webapp must be deployed as the ROOT context, and client browser must be configured to proxy all HTTP requests through the Wayback Machine application. Instead of retrieving documents from the live web, the Wayback Machine will retrieve - documents from the local repository of ARC files. + documents from the configured WaybackCollection. </p> <p> Proxy Replay mode does not suffer from the shortcomings of - the inserted Javascript that the Archival URL mode uses, - but it has one major drawback: there is no way to - specify which version of a captured document should - be replayed. Only the URL to be replayed is sent from the - client browser to the Wayback Machine - no date information - is sent with the request. + the inserted Javascript that the Archival URL mode uses, all URLs + function as they did originally, but there can be another drawback + to using this feature: no date information is sent with each request. + Wayback attempts to address this problem by associating the date + clicked on query pages when a Replay session is begun, with the + users IP address. This can fail to work properly in situations where + multiple users are behind a NAT system which causes them to appear to + have the same IP address. </p> <p> - In Proxy Replay mode, the Wayback Machine will return the - most recent version captured of any requested page. This - behavior can be changed by using the experimental Firefox-specific - plugin developed by Oskar Grenholm. You can find out more about + Additionally, there is an experimental Firefox-specific plugin + developed by Oskar Grenholm, which sends a provides a novel interface + to navigate between different captured versions of a page within + Proxy mode, and also sends a special HTTP header which allows Wayback + to uniquely associate the correct date with browsers, even those + behind a NAT system. You can find out more about this plugin and download it <a href="http://archive-access.sourceforge.net/projects/waxtoolbar/"> here @@ -905,17 +575,15 @@ <pre> <bean name="8090" parent="8080:wayback"> - <property name="useServerName" value="true" /> - <property name="replay"> - <bean class="org.archive.wayback.proxy.ProxyReplayDispatcher" /> - </property> + <property name="urlRoot" value="http://wayback.somehost.org/" /> + <property name="replay"> ref="proxyreplay" /> <property name="uriconverter"> <bean class="org.archive.wayback.proxy.RedirectResultURIConverter"> - <property name="redirectURI" value="http://wayback.somehost.org:8090/jsp/Redirect.jsp" /> + <property name="redirectURI" value="http://wayback.somehost.org/jsp/Redirect.jsp" /> </bean> </property> <property name="parser"> - <bean class="org.archive.wayback.proxy.ProxyRequestParser" init-method="init"> + <bean class="org.archive.wayback.proxy.ProxyRequestParser" > <property name="localhostNames"> <list> <value>wayback.somehost.org</value> @@ -934,13 +602,256 @@ primary name of the machine running the Wayback application, then you may need to also specify the hostname used for the Wayback application in the <b>localhostNames</b> configuration list. - </p> + </p> + <p> + For additional configuration examples and information about + Proxy Replay mode, please see the file + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/ProxyReplay.xml">ProxyReplay.xml</a> + </p> </subsection> + <subsection name="DomainPrefix Replay Mode"> + <p> + Wayback includes an additional, experimental Replay mode which is + similar to Archival URL mode, in that any document can be refernced + as a global URL, without any browser configuration requirements. This + mode requires deploying the Wayback webapp in ROOT context, and a + special DNS wildcard aliasing, so that all hostnames with a common + suffix will be directed to your host running Wayback. + </p> + <p> + The general form of a DomainPrefix URL is: + <br></br> + <div> + <code> + http://TIMESTAMP.ARCHIVE-HOSTNAME.WAYBACK-HOSTNAME:PORT/ARCHIVE-PATH + </code> + </div> + </p> + <p> + Here is an example DomainPrefix URL, on an assumed host + <b>wayback.somehost.org</b>, with a wayback webapp deployed as + <b>ROOT</b>, via the Access Point named <b>8081</b> (which indicates the + port Wayback requests will be recieved on) for the + page <b>http://www.yahoo.com/foo.gif</b> on Dec 31, 1999 at 12:00:00 UTC. + <br></br> + <div> + <code> + http://19991231120000.www.yahoo.com.wayback.somehost.org:8081/foo.gif + </code> + </div> + </p> + <p> + This mode performs all URL rewriting on the server side, so needs no + client-side Javascript to execute, and also does not suffer from some + of the request leakage problems present in Archival URL mode. It + presently is somewhat naive about rewriting links within returned + documents, and will also rewrite URLs in the text of pages + (not desired), as well as URLs referenced within the page (desired). + </p> + <p> + For additional configuration examples and information about + Domain Prefix Replay mode, please see the files + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/wayback.xml">wayback.xml</a> + and + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/DomainPrefixReplay.xml">DomainPrefixReplay.xml</a> + . + </p> + </subsection> </section> + <section name="Wayback UI customization options"> + <p> + Wayback provides several opportunities for customizing the user + interface presented to users, which can be grouped into 4 categories: + <ul> + <li>Query UI rendering .jsp files.</li> + <li>Replay insert .jsp files.</li> + <li>Exception rendering .jsp files.</li> + <li>Localization .properties files.</li> + </ul> + </p> + <subsection name="Query UI"> + <p> + All content returned by Wayback in response to Query requests is + generated by .jsp files, which are executed and provided access to + the results found within the ResourceIndex. Wayback is distributed + with several sample implementations. + </p> + <p> + To alter the default behavior, you may either provide your own .jsp + files, and configure the Renderer to use them instead of the + default .jsp files, or the default .jsp files may be modified + directly. + <ul> + <li> + <b>captureJsp</b> - used when the request indicates that + a listing of all dates available for a single URL should be + returned. Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLCaptureResults.jsp">/WEB-INF/query/HTMLCaptureResults.jsp</a>. + An alternate implementation, /WEB-INF/query/CalendarResults.jsp + will generate HTML output similar to the global Wayback Machine + service. + </li> + <li> + <b>urlJsp</b> - used when the request indicates that a summary + of captures available for a number of URLs should be returned. + Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/HTMLUrlResults.jsp">/WEB-INF/query/HTMLUrlResults.jsp</a> + </li> + <li> + <b>xmlCaptureJsp</b> - used when the request indicates that + a listing of all dates available for a single URL should be + returned in XML format. Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/XMLCaptureResults.jsp">/WEB-INF/query/XMLCaptureResults.jsp</a>. + </li> + <li> + <b>xmlUrlJsp</b> - used when the request indicates that a + summary of captures available for a number of URLs should be + returned in XML format. + Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/query/XMLUrlResults.jsp">/WEB-INF/query/XMLUrlResults.jsp</a> + </li> + </ul> + </p> + </subsection> + <subsection name="Replay Inserts"> + <p> + Wayback allows for embedding additional content within replayed HTML + pages in all Replay modes. This is accomplished by executing one or + more .jsp files with access to context information about the request, + the results, and the actual Resource being returned. The output of + each .jsp file is included within the returned page. + </p> + <p> + Wayback is distributed with several example .jsp insert files that + can be used as is, modified to suit installation requirements, or + used as examples for more elaborate customizations: + <ul> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/ArchiveComment.jsp">/WEB-INF/replay/ArchiveComment.jsp</a> + inserts an HTML comment indicating when the document was + captured and retrieved. + </li> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/ClientSideJSInsert.jsp">/WEB-INF/replay/ClientSideJSInsert.jsp</a> + inserts some Javascript into the returned HTML page that updates + links, images, and other embedded content, attempting to make + all URL references within the page point back into the Wayback + service. + </li> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/DebugBanner.jsp">/WEB-INF/replay/DebugBanner.jsp</a> + Not intended for production use, but a slightly more complex + jsp insert example that demonstrates how to access various + request context data, and is sometimes useful for debugging. + </li> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Disclaimer.jsp">/WEB-INF/replay/Disclaimer.jsp</a> + Inserts a small banner at the top of replayed HTML pages, + alerting users that they are viewing an archived page, and + providing some information about the particular capture. + </li> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/JSLessTimeline.jsp">/WEB-INF/replay/JSLessTimeline.jsp</a> + Inserts a banner in the top of replayed documents which allows + users to navigate directly between other captures of the current + page they are viewing. This version does not use Javascript to + place the banner, so it will appear in all HTML pages within a + frameset. + </li> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Timeline.jsp">/WEB-INF/replay/Timeline.jsp</a> + Inserts a banner in the top of replayed documents which allows + users to navigate directly between other captures of the current + page they are viewing. This version uses Javascript to + place the banner, attempting to only place the banner in the + largest frame within a frameset. + </li> + </ul> + </p> + </subsection> + <subsection name="Exception Rendering"> + <p> + Wayback is distributed with a default ExceptionRenderer that allows + customization of several types of anticipated exceptions that can + occur through normal operations. The BaseExceptionRenderer allows + installations to provide alternate .jsp files which are executed, and + the output of these .jsp files are returned to end users. To alter + the default behavior, you may either provide your own .jsp files, and + configure the BaseExceptionRenderer to use them instead of the + default .jsp files, or the default .jsp files may be modified + directly. + <ul> + <li> + <b>xmlErrorJsp</b> - used when the request indicates that XML + data should be returned. Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/exception/XMLError.jsp">/WEB-INF/exception/XMLError.jsp</a> + </li> + <li> + <b>errorJsp</b> - used for HTML Replay exceptions, and for all + Query exceptions. Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/exception/HTMLError.jsp">/WEB-INF/exception/HTMLError.jsp</a> + </li> + <li> + <b>imageErrorJsp</b> - used when the request appears to be an + embedded Replay request that expects an image to be returned. + Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/exception/HTMLError.jsp">/WEB-INF/exception/HTMLError.jsp</a> + which produced HTML output. This may be desirable over + returning an actual image, since web browsers will usually show + any HTML alternate text associated with the image in place of + the image when image data is not returned. Wayback also + includes a 1x1 pixel gif, error_image.gif, which can be used to + display a gray box in place of images requests that result in + an exception. + </li> + <li> + <b>javascriptErrorJsp</b> - used when the request appears to be an + embedded Replay request that expects Javascript content to be + returned. Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/exception/JavaScriptError.jsp">/WEB-INF/exception/JavaScriptError.jsp</a> + </li> + <li> + <b>cssErrorJsp</b> - used when the request appears to be an + embedded Replay request that expects CSS content to be returned. + Default is + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/exception/CSSError.jsp">/WEB-INF/exception/CSSError.jsp</a> + </li> + </ul> + </p> + </subsection> + <subsection name="Localization .properties files."> + <p> + Wayback is packaged with a set of reference implementation .jsp files + for generating Query, Replay, and Exception user interface pages. + References to actual user visible text is abstracted within these + .jsp files so the specific text to display in various pages are read + from a .properties file. Wayback will automatically search for a + Locale-specific .properties file from which these text values should + be loaded, allowing the language presented to users to be changed. + </p> + <p> + By default, Wayback will use the language preference indicated by the + users web browser to find an appropriate .properties files, + defaulting to the standard English text if the users preferred + language is not available. Particular AccessPoints can be forced to a + particular Locale using the AccessPoint.locale property. + </p> + <p> + Several language customization .property files have already been + contributed by users in the community and are now included with the + standard Wayback distribution. We plan for a completely new and + improved UI implementation for version 1.6, and plan a more active + outreach program to create customizations in as many languages as + possible once this new UI is completed, and the required text + elements are determined. + </p> + </subsection> + </section> + <section name="Excluding Documents within an AccessPoint"> <subsection name="Excluding Documents with live Robots.txt"> Documents may be excluded from access within an Access Point by @@ -1193,7 +1104,9 @@ <p> The <b>-identity</b> option causes the tools to skip canonicalization of URLs. See the documentation for the <b>url-client</b> tool, and - the URL Canonicalization section below for more information. + the <a href="resource_index.html#URL_Canonicalization"> + URL Canonicalization + </a> section for more information. </p> </subsection> @@ -1224,7 +1137,7 @@ <i> LOCATION_URL </i> - is the absolute URL where the ArcProxy can be + is the absolute URL where the FileProxy can be accessed. ex. <b> http://wayback-webapp.your-archive.org:8080/locationdb/locationDB @@ -1271,7 +1184,10 @@ canonicalization function, but can also be used, if the canonicalization function is altered, to update an existing CDX index, without recreating CDX files from original ARCs. See the - seciond URL Canonicalization for more information. + section + <a href="resource_index.html#URL_Canonicalization"> + URL Canonicalization + </a> for more information. </p> <p> <code> @@ -1297,30 +1213,31 @@ </section> - <section name="ArcProxy and LocationDB application"> + <section name="FileProxy and LocationDB application"> <p> - The Wayback software includes an additional application, the ArcProxy, + The Wayback software includes an additional application, the FileProxy, which can simplify some distributed ResourceStore implementations. The - ArcProxy application exposes two external services, one used to - configure the underlying database mapping ARC filenames to the actual, - fully qualified HTTP 1.1 URL, and a second service which reverse proxies - incoming HTTP 1.1 range requests to appropriate back-end storage nodes. + FileProxy application exposes two external services, one used to + configure the underlying database mapping ARC/WRC filenames to the + actual, fully qualified HTTP 1.1 URL or local path, and a second + service which reverse proxies incoming HTTP 1.1 range requests to + appropriate back-end storage nodes. </p> <p> - The <b>arcproxy</b> reverse proxy service allows one or more HttpARCResourceStore - instances to configure a single URL prefix where all ARC files are - assumed to be located. This reverse proxy then uses a BDB JE to find the - actual current location of the ARC file, and forward the request to the - actual host holding the ARC file. + The <b>fileproxy</b> reverse proxy service allows one or more + SimpleResourceStore instances to configure a single URL prefix where + all ARC/WARC files are assumed to be located. This reverse proxy then + uses a BDB JE to find the actual current location of the ARC/WARC file, + and forward the request to the actual host holding the ARC/WARC file. </p> <p> The <b>locationdb</b> service allows population and management of the - BDB JE database(the <i>locationDB</i>) used by the <b>arcproxy</b> + BDB JE database(the <i>locationDB</i>) used by the <b>fileproxy</b> service. There is also a command line tool, <b>location-client</b> described elsewhere in this document which provides command line access to the management of the locationDB. @@ -1328,210 +1245,27 @@ <p> Adding the following configuration to wayback.xml will expose the - arcproxy and locationdb services: + fileproxy and locationdb services: </p> <pre> -<bean id="filelocationdb" class="org.archive.wayback.resourcestore.http.FileLocationDB" +<bean id="filelocationdb" class="org.archive.wayback.resourcestore.locationdb.BDBResourceFileLocationDB" init-method="init"> - <property name="bdbPath" value="/tmp/wayback/arc-db" /> + <property name="bdbPath" value="/tmp/wayback/file-db/db/" /> <property name="bdbName" value="DB1" /> - <property name="logPath" value="/tmp/wayback/arc-db.log" /> + <property name="logPath" value="/tmp/wayback/file-db/db.log" /> </bean> -<bean name="8080:arcproxy" class="org.archive.wayback.resourcestore.http.ArcProxyServlet"> +<bean name="8080:fileproxy" class="org.archive.wayback.resourcestore.locationdb.FileProxyServlet"> <property name="locationDB" ref="filelocationdb" /> </bean> -<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.http.FileLocationDBServlet"> +<bean name="8080:locationdb" class="org.archive.wayback.resourcestore.locationdb.ResourceFileLocationDBServlet"> <property name="locationDB" ref="filelocationdb" /> </bean> </pre> </section> - <section name="URL Canonicalization"> - <subsection name="Introduction and Concepts"> - <p> - Sometimes URLs found in the field can have multiple forms, for - example: - <pre> - http://www.example.com/img/foo.gif - http://www.example.com/docs/../img/foo.gif - </pre> - are both valid representations of the exact same URL. Another, less - certain example would be: - <pre> - http://www.example.com/Interview.html - http://www.example.com/interview.html - </pre> - which differ only in the capitalization of the letter "i". On some - operating systems, these two URLs legitimately specify two distinct - documents. On Windows platforms, they refer to the same document. If - the document on a web server is actually named "Interview.html", but - a web designer creates a web page that refers to this document using - the lowercase "interview.html", then the link will work, and they and - the web site visitors may never notice the difference. The same - situation on a different operating system would probably not work - (although some web server plugins and modules will also correct this - problem transparently) and the web designer would probably notice and - correct the problem. In practice, we have found that it is very rare - for the two URLs above with different capitalization to refer to - different documents, and they can be treated as equivalent in most - situations. - </p> - <p> - Another example, which occurs far more often in the real world, - involves web servers injecting a session ID inside paths to documents - hosted on that web server. These session IDs allow the web server to - track individual user's states. Here are some example URLs - demonstrating path session ID injection: - <pre> - http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page1.aspx - http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page2.aspx - http://www.example.com/(S(a63098d96360a63098d96360))/page3.aspx - </pre> - In these examples, the first two URLs are using one session ID, and - the third uses a different session ID. If <b>page3.aspx</b> refers to - <b>page1.aspx</b> using an anchor like this: - <pre> - <a href="page1.aspx">page1</a> - </pre> - and a user visiting <b>page3.aspx</b> clicks the link to page1, then - the wayback will recieve a request for the URL: - <pre> - http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx - </pre> - If page1.aspx was captured using the different session ID, then the - wayback will be unable to locate this document in the index, even - though it was captured. - </p> - <p> - This session ID problem can be mitigated by <i>canonicalizing</i> the - URLs as they are placed in the index, so the index would contain the - following URLs, instead of the original form, which the crawler - captured: - <pre> - http://www.example.com/page1.aspx - http://www.example.com/page2.aspx - http://www.example.com/page3.aspx - </pre> - If the same canonicalization scheme is used to transform incoming - requests, before attempting to lookup URLs in the index, then the - software is able to locate and return the documents correctly. - </p> - </subsection> - <subsection name="Current Status within Wayback"> - <p> - Currently the Wayback includes only a single reference implementation - of a canonicalization scheme, which is currently called - <b>AggressiveUrlCanonicalizer</b>. This implementation provides the - following canonicalization: - <ul> - <li> - <b>www# removal</b> - http://www.example.com => example.com, - http://www13.example.com => example.com - </li> - <li> - <b>user info removal</b> - http://us...@ex... => example.com, - http://user:pas...@ex... => example.com, - </li> - <li> - <b>session ID removal</b> - http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx - ... [truncated message content] |