[Archive-access-cvs] SF.net SVN: archive-access:[3298] trunk/archive-access/projects/wayback/ dist/

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 3298
          http://archive-access.svn.sourceforge.net/archive-access/?rev=3298&view=rev
Author:   bradtofel
Date:     2010-10-22 22:35:14 +0000 (Fri, 22 Oct 2010)

Log Message:
-----------
PRE 1.6.0 doc update

Modified Paths:
--------------
    trunk/archive-access/projects/wayback/dist/src/site/site.xml
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml

Added Paths:
-----------
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml

Modified: trunk/archive-access/projects/wayback/dist/src/site/site.xml
===================================================================

--- trunk/archive-access/projects/wayback/dist/src/site/site.xml	2010-10-22 22:34:24 UTC (rev 3297)
+++ trunk/archive-access/projects/wayback/dist/src/site/site.xml	2010-10-22 22:35:14 UTC (rev 3298)
@@ -28,9 +28,8 @@
     <menu name="Overview">
       <item name="Requirements" href="requirements.html"/>
       <item name="Downloads" href="downloads.html"/>
-      <item name="User Manual" href="user_manual.html"/>
       <item name="Administrator Manual" href="administrator_manual.html"/>
-      <item name="Developer Manual" href="developer_manual.html"/>
+      <item name="Hadoop CDX Generation" href="hadoop.html"/>
       <item name="Release Notes" href="release_notes.html"/>
       <item name="FAQ" href="/faq.html"/>
       <item name="API" href="./apidocs"/>

Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml	                        (rev 0)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml	2010-10-22 22:35:14 UTC (rev 3298)
@@ -0,0 +1,287 @@
+<?xml version="1.0" encoding="utf-8"?>
+<document>
+  <properties>
+    <title>Access Point Naming</title>
+    <author email="brad at archive dot org">Brad Tofel</author>
+    <revision>$$Id$$</revision>
+  </properties>
+  
+  <body>
+
+
+
+    <section name="Overview">
+      <p>
+      Tomcat (or other servlet containers) are configured to listen on one or 
+      more ports, so each request received on one of those ports is targeted
+      to a particular webapp based on the name of the .war file deployed under
+      the <b>webapps/</b> directory. The targeted webapp is determined based on
+      the first directory in incoming requests.
+      </p>
+      <p>
+        If there are two webapps deployed under the <b>webapps/</b> directory,
+        called <b>webappA.war</b> and <b>webappB.war</b>, then an incoming
+        request <b>/webappA/file1</b> will be received by the webapp inside
+        <b>webappA.war</b> as the request <b>/file1</b>. An incoming request
+        for <b>webappB/images/foo.gif</b> will be received by the webapp inside
+        <b>webappB.war</b> as <b>/images/foo.gif</b>.
+      </p>
+      <p>
+        Tomcat (and other servlet containers) allow a special .war file to be
+        deployed under the <b>webapps/</b> directory called <b>ROOT.war</b>
+        which will receive requests not matching another webapp. If the above 
+        example also included a webapp deployed under the <b>webapps/</b> 
+        directory named <b>ROOT.war</b>, then requests starting with <b>webappA/</b>
+        will be received by <b>webappA.war</b>, requests starting with <b>webappB/</b>
+        will be received by <b>webappB.war</b>, and all other requests will be
+        receieved by the <b>ROOT.war</b> webapp.
+      </p>
+      <p>
+        If possible, deploying your webapp as <b>ROOT.war</b> will result in
+        somewhat cleaner public URLs, but this is not a requirement. The
+        examples below all include alternate URL configuration prefixes depending
+        on whether you deploy the Wayback .war file as either <b>ROOT.war</b> or
+        <b>wayback.war</b>.
+      </p>
+      <subsection name="AccessPoint Names">
+        <p>
+          Each AccessPoint Spring XML bean definition must include a <b>name</b>
+          property:
+          <br></br>
+          <code>
+
+&lt;bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint"&gt;
+   ...
+&lt;/bean&gt;
+
+          </code> 
+          <br></br>
+          The <b>name</b> property indicates how requests <b>that are received
+          by the Wayback webapp</b> are routed to the appropriate AccessPoint.
+          Wayback allows targeting AccessPoints based on:
+          <ul>
+            <li>hostname</li>
+            <li>port</li>
+            <li>first path <b>after</b> the optional webapp deployment name
+            (which is empty if you deploy your Wayback webapp as
+            <b>ROOT.war</b>)</li>
+          </ul>
+          using the AccessPoint bean <b>name</b> field composed of <b>hostname</b>:<b>port</b>:<b>first_path</b>.
+        </p>
+        <p>
+          If you have configured DNS to resolve multiple hostnames to the same
+          computer, you can use the <b>hostname:</b> to control AccessPoint
+          resolving based on virtual hosts.
+        </p>
+        <p>
+          Port is the only required configuration component within the 
+          AccessPoint <b>name</b> configuration. If you have multiple Tomcat
+          <b>Connector</b>s you can alter this AccessPoint name configuration to
+          target specific AccessPoints, otherwise, all your AccessPoint names 
+          will have the same port, likely one of 8080, or 80.
+        </p>
+        <p>
+          A more commonly useful AccessPoint name resolving component is the 
+          <b>first-path</b>, which allows you to easily expose multiple
+          collections within a single Wayback webapp deployment, without varying
+          hostnames, or ports (which often require network or system 
+          administrator assistance).
+        </p>
+      </subsection>
+      <subsection name="Example AccessPoint names and URLs">
+        <p>
+          The following table shows how urls will map to particular AccessPoints
+          assuming you have deployed the Wayback webapp as <b>ROOT.war</b>, on
+          a host with the name "access.example.org", using port 8080.
+          <table>
+            <tr>
+              <th>Access Point bean name</th>
+              <th>Archival URL prefix</th>
+              <th>Archival URL query example for <b>http://archive.org</b></th>
+            </tr>
+            <tr>
+              <td>8080:collectionA</td>
+              <td>http://access.example.org:8080/collectionA/</td>
+              <td>http://access.example.org:8080/collectionA/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>8080:collectionB</td>
+              <td>http://access.example.org:8080/collectionB/</td>
+              <td>http://access.example.org:8080/collectionB/*/http://archive.org/</td>
+            </tr>
+          </table>        
+        </p>
+        <p>
+          If you deployed your Wayback webapp with the name <b>wayback.war</b>
+          the following table shows how urls will map to particular
+          AccessPoints, on a host with the name "access.example.org", using port
+          8080.
+          <table>
+            <tr>
+              <th>Access Point bean name</th>
+              <th>Archival URL prefix</th>
+              <th>Archival URL query example for <b>http://archive.org</b></th>
+            </tr>
+            <tr>
+              <td>8080:collectionA</td>
+              <td>http://access.example.org:8080/wayback/collectionA/</td>
+              <td>http://access.example.org:8080/wayback/collectionA/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>8080:collectionB</td>
+              <td>http://access.example.org:8080/wayback/collectionB/</td>
+              <td>http://access.example.org:8080/wayback/collectionB/*/http://archive.org/</td>
+            </tr>
+          </table>        
+        </p>
+        <p>
+          If you have configured multiple <b>Connector</b>s for your Tomcat
+          server, listening on both port <b>80</b>, and port <b>8080</b>, and
+          you deploy <b>ROOT.war</b> you can target different AccessPoints by
+          port, as shown below. These examples assume your servers hostname is
+          still "access.example.org".
+          <table>
+            <tr>
+              <th>Access Point bean name</th>
+              <th>Archival URL prefix</th>
+              <th>Archival URL query example for <b>http://archive.org</b></th>
+            </tr>
+            <tr>
+              <td>80:collectionA</td>
+              <td>http://access.example.org/collectionA/</td>
+              <td>http://access.example.org/collectionA/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>8080:collectionB</td>
+              <td>http://access.example.org:8080/collectionB/</td>
+              <td>http://access.example.org:8080/collectionB/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>80:collectionC</td>
+              <td>http://access.example.org/collectionC/</td>
+              <td>http://access.example.org/collectionC/*/http://archive.org/</td>
+            </tr>
+          </table>        
+        </p>
+        <p>
+          If you have a very limited number of AccessPoints to expose, you can
+          do away with the <b>first-path</b> component, to achieve potentially
+          very uncluttered Archival URLs. Assuming multiple <b>Connector</b>s
+          for your Tomcat server, listening on both port <b>80</b>, and port
+          <b>8080</b>, and you deploy <b>ROOT.war</b> you can target different
+          AccessPoints by port alone, as shown below. These examples still
+          assume your servers hostname is "access.example.org".
+          <table>
+            <tr>
+              <th>Access Point bean name</th>
+              <th>Archival URL prefix</th>
+              <th>Archival URL query example for <b>http://archive.org</b></th>
+            </tr>
+            <tr>
+              <td>80</td>
+              <td>http://access.example.org/</td>
+              <td>http://access.example.org/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>8080</td>
+              <td>http://access.example.org:8080/</td>
+              <td>http://access.example.org:8080/*/http://archive.org/</td>
+            </tr>
+          </table>        
+        </p>
+        <p>
+          Getting somewhat fancy, you can use virtual hosts, doing away with 
+          non-standard ports, and use hostnames alone to specify AccessPoints.
+          This means getting your Tomcat to listen on port <b>80</b>, and
+          deploying the webapp as <b>ROOT.war</b>. You'd have to configure your
+          DNS so both "collection1.example.org" and "collection2.example.org"
+          point to the host running Wayback:
+          <table>
+            <tr>
+              <th>Access Point bean name</th>
+              <th>Archival URL prefix</th>
+              <th>Archival URL query example for <b>http://archive.org</b></th>
+            </tr>
+            <tr>
+              <td>collection1.example.org:80</td>
+              <td>http://collection1.example.org/</td>
+              <td>http://collection1.example.org/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>collection2.example.org:80</td>
+              <td>http://collection2.example.org/</td>
+              <td>http://collection2.example.org/*/http://archive.org/</td>
+            </tr>
+          </table>        
+        </p>
+      </subsection>
+      <subsection name="Getting really fancy">
+
+        <p>
+          Assuming you've deployed your webapp as <b>ROOT.war</b> and have Tomcat
+          listening on both port 80 and 8080, with the hostnames 
+          "collection1.example.org" and "collection2.example.org" both
+          pointing to the host running wayback:
+          <table>
+            <tr>
+              <th>Access Point bean name</th>
+              <th>Archival URL prefix</th>
+              <th>Archival URL query example for <b>http://archive.org</b></th>
+            </tr>
+            <tr>
+              <td>collection1.example.org:80</td>
+              <td>http://collection1.example.org/</td>
+              <td>http://collection1.example.org/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>collection1.example.org:8080:subset1</td>
+              <td>http://collection1.example.org:8080/subset1/</td>
+              <td>http://collection1.example.org:8080/subset1/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>collection1.example.org:8080:subset2</td>
+              <td>http://collection1.example.org:8080/subset2/</td>
+              <td>http://collection1.example.org:8080/subset2/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>collection2.example.org:8080</td>
+              <td>http://collection1.example.org:8080/</td>
+              <td>http://collection1.example.org:8080/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>collection2.example.org:80:internal</td>
+              <td>http://collection2.example.org/internal/</td>
+              <td>http://collection2.example.org/internal/*/http://archive.org/</td>
+            </tr>
+            <tr>
+              <td>collection2.example.org:80:public</td>
+              <td>http://collection2.example.org/public/</td>
+              <td>http://collection2.example.org/public/*/http://archive.org/</td>
+            </tr>
+          </table>        
+        </p>
+      </subsection>
+<!--
+      <subsection name="ArchivalURL Server-Relative URL rewriting">
+        <p>
+          As hard as we've tried to make Server-side rewrite "do the right
+          thing" in ArchivalURL Replay mode, sometimes things don't work out 
+          right. For example, if a page, <b>http://example.net/news/a.html</b>
+          contains some Javascript, that generates the following HTML with a
+          <b>document.write()</b> call:
+          <br></br>
+          <code>
+          
+&lt;img src="/foo.gif"&gt;&lt;/img&gt;
+          </code>
+          <br></br>
+          And you were running an AccessPoint at <b>http://archive.org/web/</b>,
+          the then page would be expecting that URL to resolve as 
+          <b>http://example.net/foo.gif</b>, but in fact, the page being
+          displayed as 
+        </p>
+      <subsection>
+-->
+    </section>
+  </body>
+</document>
\ No newline at end of file

Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml	2010-10-22 22:34:24 UTC (rev 3297)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml	2010-10-22 22:35:14 UTC (rev 3298)
@@ -12,7 +12,6 @@
 
     <section name="Requirements">
 
-
       <subsection name="Third Party Packages">
         <p>
           Please see the
@@ -53,7 +52,7 @@
 	      <p>
 	        Once you have downloaded the .tar.gz file from 
 	        sourceforge, you will need to unpack the file to access the
-	        webapp file, <b>wayback-webapp-1.4.0.war</b>.
+	        webapp file, <b>wayback-webapp-1.6.0.war</b>.
 	      </p>
 		    <p>
 	        Installation and configuration of this software involves the
@@ -66,7 +65,7 @@
 	            Waiting for Tomcat to unpack the .war file.
 	          </li>
 	          <li>
-	            Customizing base wayback.xml file.
+	            Customizing base wayback.xml and possibly other XML configuration files.
 	          </li>
 	          <li>
 	            Restarting tomcat.
@@ -84,18 +83,19 @@
         documents. Query access allows users to locate particular documents
         within the collection by URL and date. Replay access allows users to
         view archived pages within their web browsers. Some Replay modes 
-        require altering the original pages so embedded content is also loaded
-        from the wayback service, and not from the live web.
+        require altering the original pages and resources, so embedded and 
+        referenced content is also loaded from the Wayback service, and not
+        from the live web.
       </p>
       <p>
         A WaybackCollection defines a set of archived documents and an index
-        which allows documents to be located within the collection. A
+        which allows documents to be quickly located within the collection. A
         WaybackCollection may be exposed to end users through one or more
         AccessPoints, which define:
         <ul>
           <li>the WaybackCollection itself</li>
           <li>the URL where users can access the collection</li>
-          <li>how users can query the collection (the Query UI)</li>
+          <li>how query results are presented to users (the Query UI)</li>
           <li>how documents are returned to users so they appear correctly in
             their web browsers (the Replay UI)</li>
           <li>the look and feel of the wayback user interface</li>
@@ -104,12 +104,12 @@
         </ul>
       </p>
       <p>
-        Wayback is configured using Spring IOC, to specify and configure
-        concrete implementations of several basic modules. For information
-        about using Spring, please see
-        <a href="http://www.springframework.org/docs/reference/beans.html">
-          this page
-        </a>.
+        Wayback is configured using 
+        <a href="http://static.springsource.org/spring/docs/2.5.x/reference/beans.html#beans-basics">Spring IOC</a>,
+        to specify and configure concrete implementations of several basic
+        modules. Please see the
+        <a href="http://static.springsource.org/spring/docs/2.5.x/reference/beans.html#beans-basics">Spring website</a> for more information on 
+        configuring beans using Spring XML.
       </p>
       <subsection name="AccessPoint configuration options">
         <p>
@@ -121,8 +121,8 @@
                 AccessPoint.
             </li>
             <li><a href="Query_UI"><b>query</b></a> responsible for generating
-                user visible content in response to user Queries, HTML, XML,
-                etc.</li>
+                user visible content(HTML, XML, etc) in response to user
+                Queries.</li>
             <li><a href="Replay_Modes"><b>replay</b></a> responsible for 
                 determining the appropriate ReplayRenderer implementation based
                 on the users request and the particular document to be 
@@ -135,7 +135,9 @@
           </ul>
         </p>
         <p>
-          An AccessPoint's configuration may optionally specify the following:
+          An AccessPoint's configuration may optionally specify the following, 
+          but must specify at least one of replayPrefix, queryPrefix, or 
+          staticPrefix:
           <ul>
             <li><a href="Exception_Rendering"><b>exception</b></a> - an
                 implementation responsible for generating error pages to users
@@ -158,13 +160,38 @@
               </a> - an implementation specifying who is allowed to connect to
               this AccessPoint
             </li>
-            <li><b>urlRoot</b> - a String URL prefix under which all UI
-                elements should be referenced.
+            <li>
+              <b>replayPrefix</b> - a String URL prefix indicating the host,
+              port, and path to the correct Replay AccessPoint. If unspecified,
+              defaults to queryPrefix, then staticPrefix.
             </li>
+            <li>
+              <b>queryPrefix</b> - a String URL prefix indicating the host,
+              port, and path to the correct Query AccessPoint. If unspecified,
+              defaults to staticPrefix, then replayPrefix.
+            </li>
+            <li>
+              <b>staticPrefix</b> - a String URL prefix indicating the host,
+              port, and path to static content used within the UI. If
+              unspecified, defaults to queryPrefix, then replayPrefix.
+            </li>
+            <li>
+              <b>livewebPrefix</b> - a String URL prefix indicating the host,
+              port, and path to the correct Replay AccessPoint.
+            </li>
             <li><b>locale</b> - A specific Locale to use for all requests
                 within this AccessPoint, overriding the users preferred Locale
                 as specified by their web browser.
             </li>
+            <li>
+              <b>exactHostMatch</b> - true or false, if true, only returns 
+              results exactly matching a given request hostname (case insensitive). 
+              Default is false. 
+            </li>
+            <li>
+              <b>exactSchemeMatch</b> - true of false, if true, only returns 
+              results exactly matching a given request scheme. Default is true.
+            </li>
           </ul>
         </p>
         <p>
@@ -222,7 +249,9 @@
             <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/BDBCollection.xml">BDBCollection.xml</a></li>
             <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/CDXCollection.xml">CDXCollection.xml</a></li>
             <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/RemoteCollection.xml">RemoteCollection.xml</a></li>
+<!--
             <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/NutchCollection.xml">NutchCollection.xml</a></li>
+-->
           </ul>
         </p>
       </subsection>
@@ -257,13 +286,14 @@
               the Access Point. See below for example CONTEXT mappings.
             </li>
             <li>
-              <b>CONTEXT</b> is the context where the Wayback webapp has been
-              deployed, plus the name of the Access Point. See below for
-              example CONTEXT mappings.
+              <b>CONTEXT</b> is an optional context where the Wayback webapp
+              has been deployed, plus an optional name of the Access Point 
+              within the webapp. See below for example CONTEXT mappings.
             </li>
             <li>
               <b>TIMESTAMP</b> is 0 to 14 digits of a date, possibly
-              followed by an asterisk ('*'). The format of a 
+              followed by an asterisk ('*'), or one or more tags providing 
+              further specifics for the request. The format of a 
               TIMESTAMP is:
               <div>
                 <code>
@@ -304,6 +334,25 @@
                 Dec 31, 2004 23:01:00 (pm UTC) - 20041231230100
               </div>
               <br></br>
+              <p>
+                Following the date portion of a timestamp, the following flags
+                can be appended:
+                <ul>
+                  <li>
+                    <b>id_</b> Identity - perform no alterations of the original
+                    resource, return it as it was archived.
+                  </li>
+                  <li>
+                    <b>js_</b> Javascript - return document marked up as javascript. 
+                  </li>
+                  <li>
+                    <b>cs_</b> CSS - return document marked up as CSS.
+                  </li>
+                  <li>
+                    <b>im_</b> Image - return document as an image.
+                  </li>
+                </ul>
+              </p>
             </li>
             <li>
               <b>URL</b> represents the actual URL that should be 
@@ -312,17 +361,9 @@
           </ul>
           <br></br>
           <div>
-            Here is an example Archival URL, on an assumed host 
-            <b>wayback.somehost.org</b>, with a wayback webapp deployed as
-            <b>ROOT</b>, via the Access Point named <b>80:archive</b> for the 
-            page <b>http://www.yahoo.com/</b> on Dec 31, 1999 at 12:00:00 UTC.
-            <br></br>
-            <div>
-              <code>
-                http://wayback.somehost.org/archive/19991231120000/http://www.yahoo.com/
-              </code>
-            </div>
-            <br></br>
+            For some simple and more elaborate examples of how AccessPoint bean
+            names interact with Archival URLs, please refer to 
+            <a href="access_point_naming.html">Access Point Naming</a>.
           </div>
           <br></br>
           <div>
@@ -350,107 +391,15 @@
           </div>
           <br></br>
           <div>
-            There is a trade-off between these two approaches. The entirely
-            server-side rewriting requires more server resources, and is less 
-            tested than the JavaScript method. The JavaScript is also imperfect:
-            sometimes requests "leak" to the live web temporarily, before the 
-            Javascript has executed. With both methods, not all URLs are
-            rewritten correctly, especially URLs that are created by JavaScript
-            that was in the original page, and specialized file types containing
-            links like Flash and PDF documents.
+            Currently, we are recommending the entirely server-side rewriting
+            method, and are deprecating the original server-side plus Javascript
+            method, but this functionality is still available in Wayback. 
+            Neither method is perfect, not all URLs are rewritten correctly,
+            particularly URLs that are created by JavaScript in the original 
+            pages, and specialized file types containing links like Flash
+            and PDF documents.
           </div>
           <br></br>
-          <div>
-            The <b>name</b> of the Access Point bean in the Spring configuration
-            file determines the CONTEXT and PORT used in Archival URLs within
-            that Access Point. The Servlet context name where the Wayback 
-            application is deployed also factors into the CONTEXT used within
-            Archival URLs for each Access Point.
-          </div>
-          <br></br>
-          <div>
-            The following examples show the Archival URL prefix for the 
-            following two Access Points depending on the Wayback webapp being
-            deployed in two different contexts, "ROOT" and "wayback".
-          </div>
-          <br></br>
-          <div>
-            If the following Access Point definitions are present in the 
-            wayback.xml:
-            <pre>
-
-&lt;bean name=&quot;8080:wayback&quot; class=&quot;org.archive.wayback.webapp.AccessPoint&quot;&gt;
-  &lt;property name=&quot;collection&quot; ref=&quot;localcollection&quot; /&gt;
-  ...
-&lt;/bean&gt;
-
-&lt;bean name=&quot;8080:wayback2&quot; class=&quot;org.archive.wayback.webapp.AccessPoint&quot;&gt;
-  &lt;property name=&quot;collection&quot; ref=&quot;localcollection&quot; /&gt;
-  ...
-&lt;/bean&gt;
-
-            </pre>
-            then the following table shows the Archival URL prefixes to access
-            each collection on the host "wayback.somehost.org" assuming a
-            Tomcat Connector listening on port 8080:
-          </div>
-          <table>
-            <tr>
-              <th>
-                webapp deployed at
-              </th>
-              <th>
-                Access Point bean name
-              </th>
-              <th>
-                Archival URL prefix
-              </th>
-            </tr>
-            <tr>
-              <td>
-                ROOT
-              </td>
-              <td>
-                8080:wayback
-              </td>
-              <td>
-                http://wayback.somehost.org:8080/wayback/
-              </td>
-            </tr>
-            <tr>
-              <td>
-                ROOT
-              </td>
-              <td>
-                8080:wayback2
-              </td>
-              <td>
-                http://wayback.somehost.org:8080/wayback2/
-              </td>
-            </tr>
-            <tr>
-              <td>
-                wb-webapp
-              </td>
-              <td>
-                8080:wayback
-              </td>
-              <td>
-                http://wayback.somehost.org:8080/wb-webapp/wayback/
-              </td>
-            </tr>
-            <tr>
-              <td>
-                wb-webapp
-              </td>
-              <td>
-                8080:wayback2
-              </td>
-              <td>
-                http://wayback.somehost.org:8080/wb-webapp/wayback2/
-              </td>
-            </tr>
-          </table>
         </p>
         <p>
           The properties <b>parser</b> and <b>uriConverter</b>
@@ -468,7 +417,7 @@
 
     &lt;property name=&quot;uriConverter&quot;&gt;
       &lt;bean class=&quot;org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter&quot;&gt;
-        &lt;property name=&quot;replayURIPrefix&quot; value=&quot;http://wayback.somehost.org:8080/wb-webapp/wayback/&quot; /&gt;
+        &lt;property name=&quot;replayURIPrefix&quot; value=&quot;http://wayback.example.org:8080/collection/&quot; /&gt;
       &lt;/bean&gt;
     &lt;/property&gt;
 
@@ -519,7 +468,7 @@
             </td>
             <td>
               Points to the Archival URL prefix of the Access Point as
-              illustrated in the preceding table.
+              illustrated in <a href="access_point_naming.html">Access Point Naming</a> document.
             </td>
           </tr>
         </table>
@@ -533,11 +482,12 @@
       <subsection name="Proxy Replay Mode">
         <p>
           Wayback can be configured to act as an HTTP proxy server. To utilize
-          this mode, the wayback webapp must be deployed as the ROOT context,
-          and client browser must be configured to proxy all HTTP requests
-          through the Wayback Machine application. Instead of retrieving
-          documents from the live web, the Wayback Machine will retrieve
-          documents from the configured WaybackCollection.
+          this mode, the wayback webapp <b>must</b> be deployed as the ROOT
+          context, no other AccessPoints can use the port dedicated to the
+          Proxy AccessPoint, and client browsers must be configured to proxy
+          all HTTP requests through the Wayback Machine application. Instead of
+          retrieving documents from the live web, the Wayback Machine will 
+          retrieve documents from the configured WaybackCollection.
         </p>
         <p>
           Proxy Replay mode does not suffer from the shortcomings of
@@ -575,7 +525,7 @@
           <pre>
 
 &lt;bean name=&quot;8090&quot; parent=&quot;8080:wayback&quot;&gt;
-  &lt;property name=&quot;urlRoot&quot; value=&quot;http://wayback.somehost.org/&quot; /&gt;
+  &lt;property name=&quot;queryPrefix&quot; value=&quot;http://wayback.somehost.org/&quot; /&gt;
   &lt;property name=&quot;replay&quot;&gt; ref=&quot;proxyreplay&quot; /&gt;
   &lt;property name=&quot;uriconverter&quot;&gt;
     &lt;bean class=&quot;org.archive.wayback.proxy.RedirectResultURIConverter&quot;&gt;
@@ -769,6 +719,15 @@
               place the banner, attempting to only place the banner in the
               largest frame within a frameset.
             </li>
+            <li>
+              <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Toolbar.jsp">/WEB-INF/replay/Toolbar.jsp</a>
+              Inserts a fancier banner in the top of replayed documents which 
+              includes a graphic representaion of the number of captures over 
+              time and allows users to navigate directly between other captures
+              of the current page they are viewing. This version uses Javascript
+              to place the banner, attempting to only place the banner in the
+              largest frame within a frameset.
+            </li>
           </ul>
         </p>
       </subsection>
@@ -1092,7 +1051,7 @@
         </p>
       </subsection>
 
-      <subsection name="arc-indexer|warc-indexer">
+      <subsection name="cdx-indexer">
         <p>
           These tools create a CDX format index for the ARC/WARC file at
           PATH, either on STDOUT, or at the path specified by CDX_PATH. The
@@ -1100,8 +1059,7 @@
           files to generate CDX format ResourceIndex.
         </p>
           <pre>
-            bin/arc-indexer [-identity] PATH [CDX_PATH]
-            bin/warc-indexer [-identity] PATH [CDX_PATH]
+            bin/cdx-indexer [-identity] PATH [CDX_PATH]
           </pre>
         <p>
           Note that when manually constructing CDX files using these tools, you
@@ -1190,9 +1148,9 @@
           input URL.
         </p>
         <p>
-          This tool is required when using the <b>arc-indexer</b> or 
-          <b>warc-indexer</b> tools with the <b>-identity</b> option. Typical
-          usage involves generating an <i>identity</i> CDX index, then
+          This tool is required when using the <b>cdx-indexer</b> tool with the
+          <b>-identity</b> option. Typical usage involves generating an
+          <i>identity</i> CDX index, then
           passing the lines in that index through this tool to canonicalize the
           record URL key for queries. If the <i>identity</i> CDX files are
           kept, then canonicalization schemes can be swapped without

Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml	                        (rev 0)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml	2010-10-22 22:35:14 UTC (rev 3298)
@@ -0,0 +1,209 @@
+<?xml version="1.0" encoding="utf-8"?>
+
+<document>
+  <properties>
+    <title>Wayback Hadoop CDX generation</title>
+    <author email="brad at archive dot org">Brad Tofel</author>
+    <revision>$$Id$$</revision>
+  </properties>
+
+  <body>
+    <section name="Overview">
+      <p>
+        Wayback is distributed with an .jar file that
+        simplifies creation of large-scale CDX files using hadoop. This code is
+        experimental, and will primarily be useful only if your CDX files are
+        very large - more than a few hundred GB (or more, depending on your
+        hardware). If building or updating your CDX files is the
+        largest problem with your installation, this may help. At IA, we've
+        used this framework to build and deploy CDX files of more than 700GB,
+        containing billions of records, using a 24 node cluster in about 8
+        hours from start to finish. Just writing a 700GB file to disk at
+        50MB/sec takes around 4 hours, so the final deployment step takes
+        around half the time.
+      </p>
+    </section>
+    <section name="Requirements">
+      <p>
+        <ul>
+          <li>Existing hadoop cluster running Hadoop 0.20.2.</li>
+          <li>Per-resource CDX files existing in a viable Hadoop-FS (HDFS, S3, 
+              etc).</li>
+          <li>Perl, to create a split file based on a sample CDX.</li>
+        </ul>
+      </p>
+    </section>
+    <section name="Implementing">
+      <p>
+        Using hadoop to generate your CDX files requires the following 
+        high-level process:
+        <ul>
+          <li>
+            Integrating per-WARC CDX creation into your ingestion process.
+          </li>
+          <li>
+            Building a split file, to inform hadoop on how to efficiently
+            partition your data while sorting.
+          </li>
+          <li>
+            Building a manifest listing the specific per-WARC CDX files to sort.
+          </li>
+          <li>
+            Running the hadoop job, which produces a series of alphabetically
+            contiguous, partitioned CDX in your HDFS.  
+          </li>
+          <li>
+            Deploying the partitioned CDX files to your node running Wayback.
+          </li>
+        </ul>
+      </p>   
+      <subsection name="Process integration">
+        <p>
+          It is assumed you will integrate the Wayback indexing code, 
+          <b>cdx-indexer</b> into your standard file ingestion workflow. That 
+          is, whatever system is used to move data from your crawlers into your
+          permanent repository should be modified to also build a CDX file for
+          each W/ARC file, as it is ingested, and to store that CDX file in 
+          your HDFS. As an optimization, you can compress the per-WARC CDX files
+          before storing them in HDFS. If your per-W/ARC CDX files are named
+          with a trailing, <b>.gz</b> suffix, the Wayback hadoop code will
+          infer that these input files are compressed.
+        </p>
+      </subsection>
+      <subsection name="Building the split file">
+        <p>
+          CDX files are large sorted text files. Hadoop can be used to perform
+          large distributed sort operations, but to achieve an efficient total
+          ordering across your resulting data, you need to give hadoop some 
+          explicit instructions, in the form of the split file, indicating
+          how to distribute the data in your hadoop job.
+        </p>
+        <p>
+          The split file is a text file, with each line indicating a partition
+          point URL within the total possible URL space. The number of lines 
+          determines the number of chunks that will be built within hadoop, and
+          it should be based on the number of concurrent Reduce tasks you can 
+          run concurrently on your cluster.
+        </p>
+        <p>
+          If R is the number of reduce tasks you can run <i>at the same time</i>
+          on your hadoop cluster, you should use (R-5) as the second argument
+          to <b>cdx-sample</b>, which is distributed in the wayback .tar.gz 
+          distribution. 5 leaves a few spare reduce workers in case of node 
+          failure, and for speculative execution in case some of your nodes
+          are running slowly.
+        </p>
+        <p>
+          The more accurately the partition points evenly divide your particular
+          collections URLs, the more optimally your hadoop distributed 
+          processing will execute. It is assumed that if you are using this
+          hadoop to generate your CDX, you will already have built a sizable
+          CDX file for your collection. The <b>cdx-sample</b> tool will sample
+          an existing sorted CDX file for your collection, and produce a list
+          of URL partitions that can be used as the split file for your hadoop
+          processing. You should use the most recent sizable CDX built using
+          other methods with the <b>cdx-sample</b> tool. If you don't have a
+          previously built sorted CDX file for your collection, create
+          a sample sorted CDX file from 20 or 30 random per-WARC CDX files, as
+          described elsewhere, and use that with the <b>cdx-sample</b> tool. 
+        </p>
+        <p>
+          You might use something similar to the following command to build
+          your split file, assuming an previously built, sorted CDX file for
+          your collection called <b>existing.cdx</b>, and a total reducer
+          capacity of <b>20</b>:
+          <div>
+          <pre>
+cdx-sample existing.cdx 15 > split.txt
+hadoop fs -put split.txt /user/brad/input-split.txt
+          </pre>
+          </div>
+        </p>
+      </subsection>
+      <subsection name="Building the manifest">
+        <p>
+          The second input file you will need is your list of per-WARC 
+          (or per-ARC) CDX files to process.
+        </p>
+        <p>
+          This file can be built using the <b>hadoop fs -ls</b> command, and
+          should contain one line for each CDX file you want to sort into your
+          final CDX file.
+        </p>
+        <p>
+          This is an example line suitable for a manifest file:
+          <div>
+          <pre>
+hdfs:///cdx/COLL-A/COLLECTION-A-20080726045700-00019-ia400028.us.archive.org.warc.os.cdx.gz
+          </pre>
+          </div>
+        </p>
+        <p>
+          You might use something similar to the following command to build
+          your manifest:
+          <div>
+          <pre>
+hadoop fs -ls /cdx/collectionA | perl -ane 'print "hdfs://$F[-1]\n";' | grep cdx.gz > manifest.txt
+hadoop fs -put manifest.txt /user/brad/input-manifest.txt
+          </pre>
+          </div>
+        </p>
+      </subsection>
+      <subsection name="Running the job">
+        <p>
+          This is actually the simplest part! You just need to run:
+          <div>
+          <pre>
+hadoop jar PATH_TO_WAYBACK_HADOOP_JAR cdxsort -m MAPS [--compress-output] SPLIT INPUT OUTPUT_DIR
+          </pre>
+          </div>
+          The --compress-output option will cause the resulting CDX files in HDFS to be compressed.
+        </p>
+        <p>
+          Here is an example usage:
+          <div>
+          <pre> 
+hadoop jar /home/brad/wayback-hadoop-jar-with-dependencies.jar cdxsort -m 470 --compress-output /user/brad/input-split.txt /user/brad/input-manifest.txt /user/brad/cdx-output
+          </pre>
+          </div>
+          indicating 470 map tasks, and that the resulting files should be
+          compressed. The number of map tasks to use should be roughly 1/3rd the
+          number of lines in your INPUT file.
+        </p>
+      </subsection>
+      <subsection name="Deploying the production Wayback CDX:">
+        <p>
+          The previous hadoop command will create alphabetically contiguous, 
+          sorted CDX files in your HDFS output directory(OUTPUT_DIR). To merge
+          them into a single CDX file which can be efficiently searched using 
+          Wayback, you need to dump them into a single, concatenated file.
+          For now, you have to use some shell code:
+          <div>
+          <pre>
+for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do
+   hadoop fs -cat $i
+done &gt; LOCAL_FILE
+          </pre>
+          </div>
+          where OUTPUT_DIR is the same as the one specified in your hadoop job,
+          and where LOCAL_FILE is where you want your target file to exist, on
+          the local computer.
+        </p>
+        <p>
+          If you did specified the --compress-output option with your 
+          "hadoop jar ..." command, you will need to add 'zcat' as follows:
+          <div>
+          <pre>
+for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do
+   hadoop fs -cat $i | zcat
+done &gt; LOCAL_FILE
+          </pre>
+          </div>
+        </p>
+        <p>
+          At this point, LOCAL_FILE is ready for use as a Wayback CDX.
+        </p>
+      </subsection>
+    </section>
+  </body>
+</document>
\ No newline at end of file

Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml	2010-10-22 22:34:24 UTC (rev 3297)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml	2010-10-22 22:35:14 UTC (rev 3298)
@@ -74,6 +74,16 @@
         </p>
     </section>
     <section name="News">
+        <subsection name="New Release - 1.6.0, 10/21/2010">
+          <p>
+            The long awaited 1.6.0 release is now available, with improved
+            server-side rewriting of HTML, CSS, Javascript, and SWF content.
+            This version includes other new features and bug fixes, which are
+            detailed on the <a href="release_notes.html">release notes</a> page.
+            Upgrading to this version will require changes to Wayback Spring XML
+            configuration.
+          </p>
+        </subsection>
         <subsection name="Maintenance Release - 1.4.2, 7/17/2009">
           <p>
             Release 1.4.2 fixes several problems discovered in the 1.4.1 

Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml
===================================================================
--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml	2010-10-22 22:34:24 UTC (rev 3297)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml	2010-10-22 22:35:14 UTC (rev 3298)
@@ -13,7 +13,7 @@
       <item name="License" href="/license.html"/>
       <item name="Requirements" href="requirements.html"/>
       <item name="Downloads" href="downloads.html"/>
-      <item name="User Manual" href="user_manual.html"/>
+      <item name="Administator Manual" href="administrator_manual.html"/>
       <item name="Release Notes" href="release_notes.html"/>
       <item name="Test" href="test.html"/>
       <item name="FAQ" href="/faq.html"/>


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[3298] trunk/archive-access/projects/wayback/ dist/

[Archive-access-cvs] SF.net SVN: archive-access:[3298] trunk/archive-access/projects/wayback/ dist/src/site