From: <bra...@us...> - 2010-10-22 22:35:21
|
Revision: 3298 http://archive-access.svn.sourceforge.net/archive-access/?rev=3298&view=rev Author: bradtofel Date: 2010-10-22 22:35:14 +0000 (Fri, 22 Oct 2010) Log Message: ----------- PRE 1.6.0 doc update Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/site.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml Added Paths: ----------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/site.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/site.xml 2010-10-22 22:34:24 UTC (rev 3297) +++ trunk/archive-access/projects/wayback/dist/src/site/site.xml 2010-10-22 22:35:14 UTC (rev 3298) @@ -28,9 +28,8 @@ <menu name="Overview"> <item name="Requirements" href="requirements.html"/> <item name="Downloads" href="downloads.html"/> - <item name="User Manual" href="user_manual.html"/> <item name="Administrator Manual" href="administrator_manual.html"/> - <item name="Developer Manual" href="developer_manual.html"/> + <item name="Hadoop CDX Generation" href="hadoop.html"/> <item name="Release Notes" href="release_notes.html"/> <item name="FAQ" href="/faq.html"/> <item name="API" href="./apidocs"/> Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml (rev 0) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/access_point_naming.xml 2010-10-22 22:35:14 UTC (rev 3298) @@ -0,0 +1,287 @@ +<?xml version="1.0" encoding="utf-8"?> +<document> + <properties> + <title>Access Point Naming</title> + <author email="brad at archive dot org">Brad Tofel</author> + <revision>$$Id$$</revision> + </properties> + + <body> + + + + <section name="Overview"> + <p> + Tomcat (or other servlet containers) are configured to listen on one or + more ports, so each request received on one of those ports is targeted + to a particular webapp based on the name of the .war file deployed under + the <b>webapps/</b> directory. The targeted webapp is determined based on + the first directory in incoming requests. + </p> + <p> + If there are two webapps deployed under the <b>webapps/</b> directory, + called <b>webappA.war</b> and <b>webappB.war</b>, then an incoming + request <b>/webappA/file1</b> will be received by the webapp inside + <b>webappA.war</b> as the request <b>/file1</b>. An incoming request + for <b>webappB/images/foo.gif</b> will be received by the webapp inside + <b>webappB.war</b> as <b>/images/foo.gif</b>. + </p> + <p> + Tomcat (and other servlet containers) allow a special .war file to be + deployed under the <b>webapps/</b> directory called <b>ROOT.war</b> + which will receive requests not matching another webapp. If the above + example also included a webapp deployed under the <b>webapps/</b> + directory named <b>ROOT.war</b>, then requests starting with <b>webappA/</b> + will be received by <b>webappA.war</b>, requests starting with <b>webappB/</b> + will be received by <b>webappB.war</b>, and all other requests will be + receieved by the <b>ROOT.war</b> webapp. + </p> + <p> + If possible, deploying your webapp as <b>ROOT.war</b> will result in + somewhat cleaner public URLs, but this is not a requirement. The + examples below all include alternate URL configuration prefixes depending + on whether you deploy the Wayback .war file as either <b>ROOT.war</b> or + <b>wayback.war</b>. + </p> + <subsection name="AccessPoint Names"> + <p> + Each AccessPoint Spring XML bean definition must include a <b>name</b> + property: + <br></br> + <code> + +<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint"> + ... +</bean> + + </code> + <br></br> + The <b>name</b> property indicates how requests <b>that are received + by the Wayback webapp</b> are routed to the appropriate AccessPoint. + Wayback allows targeting AccessPoints based on: + <ul> + <li>hostname</li> + <li>port</li> + <li>first path <b>after</b> the optional webapp deployment name + (which is empty if you deploy your Wayback webapp as + <b>ROOT.war</b>)</li> + </ul> + using the AccessPoint bean <b>name</b> field composed of <b>hostname</b>:<b>port</b>:<b>first_path</b>. + </p> + <p> + If you have configured DNS to resolve multiple hostnames to the same + computer, you can use the <b>hostname:</b> to control AccessPoint + resolving based on virtual hosts. + </p> + <p> + Port is the only required configuration component within the + AccessPoint <b>name</b> configuration. If you have multiple Tomcat + <b>Connector</b>s you can alter this AccessPoint name configuration to + target specific AccessPoints, otherwise, all your AccessPoint names + will have the same port, likely one of 8080, or 80. + </p> + <p> + A more commonly useful AccessPoint name resolving component is the + <b>first-path</b>, which allows you to easily expose multiple + collections within a single Wayback webapp deployment, without varying + hostnames, or ports (which often require network or system + administrator assistance). + </p> + </subsection> + <subsection name="Example AccessPoint names and URLs"> + <p> + The following table shows how urls will map to particular AccessPoints + assuming you have deployed the Wayback webapp as <b>ROOT.war</b>, on + a host with the name "access.example.org", using port 8080. + <table> + <tr> + <th>Access Point bean name</th> + <th>Archival URL prefix</th> + <th>Archival URL query example for <b>http://archive.org</b></th> + </tr> + <tr> + <td>8080:collectionA</td> + <td>http://access.example.org:8080/collectionA/</td> + <td>http://access.example.org:8080/collectionA/*/http://archive.org/</td> + </tr> + <tr> + <td>8080:collectionB</td> + <td>http://access.example.org:8080/collectionB/</td> + <td>http://access.example.org:8080/collectionB/*/http://archive.org/</td> + </tr> + </table> + </p> + <p> + If you deployed your Wayback webapp with the name <b>wayback.war</b> + the following table shows how urls will map to particular + AccessPoints, on a host with the name "access.example.org", using port + 8080. + <table> + <tr> + <th>Access Point bean name</th> + <th>Archival URL prefix</th> + <th>Archival URL query example for <b>http://archive.org</b></th> + </tr> + <tr> + <td>8080:collectionA</td> + <td>http://access.example.org:8080/wayback/collectionA/</td> + <td>http://access.example.org:8080/wayback/collectionA/*/http://archive.org/</td> + </tr> + <tr> + <td>8080:collectionB</td> + <td>http://access.example.org:8080/wayback/collectionB/</td> + <td>http://access.example.org:8080/wayback/collectionB/*/http://archive.org/</td> + </tr> + </table> + </p> + <p> + If you have configured multiple <b>Connector</b>s for your Tomcat + server, listening on both port <b>80</b>, and port <b>8080</b>, and + you deploy <b>ROOT.war</b> you can target different AccessPoints by + port, as shown below. These examples assume your servers hostname is + still "access.example.org". + <table> + <tr> + <th>Access Point bean name</th> + <th>Archival URL prefix</th> + <th>Archival URL query example for <b>http://archive.org</b></th> + </tr> + <tr> + <td>80:collectionA</td> + <td>http://access.example.org/collectionA/</td> + <td>http://access.example.org/collectionA/*/http://archive.org/</td> + </tr> + <tr> + <td>8080:collectionB</td> + <td>http://access.example.org:8080/collectionB/</td> + <td>http://access.example.org:8080/collectionB/*/http://archive.org/</td> + </tr> + <tr> + <td>80:collectionC</td> + <td>http://access.example.org/collectionC/</td> + <td>http://access.example.org/collectionC/*/http://archive.org/</td> + </tr> + </table> + </p> + <p> + If you have a very limited number of AccessPoints to expose, you can + do away with the <b>first-path</b> component, to achieve potentially + very uncluttered Archival URLs. Assuming multiple <b>Connector</b>s + for your Tomcat server, listening on both port <b>80</b>, and port + <b>8080</b>, and you deploy <b>ROOT.war</b> you can target different + AccessPoints by port alone, as shown below. These examples still + assume your servers hostname is "access.example.org". + <table> + <tr> + <th>Access Point bean name</th> + <th>Archival URL prefix</th> + <th>Archival URL query example for <b>http://archive.org</b></th> + </tr> + <tr> + <td>80</td> + <td>http://access.example.org/</td> + <td>http://access.example.org/*/http://archive.org/</td> + </tr> + <tr> + <td>8080</td> + <td>http://access.example.org:8080/</td> + <td>http://access.example.org:8080/*/http://archive.org/</td> + </tr> + </table> + </p> + <p> + Getting somewhat fancy, you can use virtual hosts, doing away with + non-standard ports, and use hostnames alone to specify AccessPoints. + This means getting your Tomcat to listen on port <b>80</b>, and + deploying the webapp as <b>ROOT.war</b>. You'd have to configure your + DNS so both "collection1.example.org" and "collection2.example.org" + point to the host running Wayback: + <table> + <tr> + <th>Access Point bean name</th> + <th>Archival URL prefix</th> + <th>Archival URL query example for <b>http://archive.org</b></th> + </tr> + <tr> + <td>collection1.example.org:80</td> + <td>http://collection1.example.org/</td> + <td>http://collection1.example.org/*/http://archive.org/</td> + </tr> + <tr> + <td>collection2.example.org:80</td> + <td>http://collection2.example.org/</td> + <td>http://collection2.example.org/*/http://archive.org/</td> + </tr> + </table> + </p> + </subsection> + <subsection name="Getting really fancy"> + + <p> + Assuming you've deployed your webapp as <b>ROOT.war</b> and have Tomcat + listening on both port 80 and 8080, with the hostnames + "collection1.example.org" and "collection2.example.org" both + pointing to the host running wayback: + <table> + <tr> + <th>Access Point bean name</th> + <th>Archival URL prefix</th> + <th>Archival URL query example for <b>http://archive.org</b></th> + </tr> + <tr> + <td>collection1.example.org:80</td> + <td>http://collection1.example.org/</td> + <td>http://collection1.example.org/*/http://archive.org/</td> + </tr> + <tr> + <td>collection1.example.org:8080:subset1</td> + <td>http://collection1.example.org:8080/subset1/</td> + <td>http://collection1.example.org:8080/subset1/*/http://archive.org/</td> + </tr> + <tr> + <td>collection1.example.org:8080:subset2</td> + <td>http://collection1.example.org:8080/subset2/</td> + <td>http://collection1.example.org:8080/subset2/*/http://archive.org/</td> + </tr> + <tr> + <td>collection2.example.org:8080</td> + <td>http://collection1.example.org:8080/</td> + <td>http://collection1.example.org:8080/*/http://archive.org/</td> + </tr> + <tr> + <td>collection2.example.org:80:internal</td> + <td>http://collection2.example.org/internal/</td> + <td>http://collection2.example.org/internal/*/http://archive.org/</td> + </tr> + <tr> + <td>collection2.example.org:80:public</td> + <td>http://collection2.example.org/public/</td> + <td>http://collection2.example.org/public/*/http://archive.org/</td> + </tr> + </table> + </p> + </subsection> +<!-- + <subsection name="ArchivalURL Server-Relative URL rewriting"> + <p> + As hard as we've tried to make Server-side rewrite "do the right + thing" in ArchivalURL Replay mode, sometimes things don't work out + right. For example, if a page, <b>http://example.net/news/a.html</b> + contains some Javascript, that generates the following HTML with a + <b>document.write()</b> call: + <br></br> + <code> + +<img src="/foo.gif"></img> + </code> + <br></br> + And you were running an AccessPoint at <b>http://archive.org/web/</b>, + the then page would be expecting that URL to resolve as + <b>http://example.net/foo.gif</b>, but in fact, the page being + displayed as + </p> + <subsection> +--> + </section> + </body> +</document> \ No newline at end of file Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2010-10-22 22:34:24 UTC (rev 3297) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2010-10-22 22:35:14 UTC (rev 3298) @@ -12,7 +12,6 @@ <section name="Requirements"> - <subsection name="Third Party Packages"> <p> Please see the @@ -53,7 +52,7 @@ <p> Once you have downloaded the .tar.gz file from sourceforge, you will need to unpack the file to access the - webapp file, <b>wayback-webapp-1.4.0.war</b>. + webapp file, <b>wayback-webapp-1.6.0.war</b>. </p> <p> Installation and configuration of this software involves the @@ -66,7 +65,7 @@ Waiting for Tomcat to unpack the .war file. </li> <li> - Customizing base wayback.xml file. + Customizing base wayback.xml and possibly other XML configuration files. </li> <li> Restarting tomcat. @@ -84,18 +83,19 @@ documents. Query access allows users to locate particular documents within the collection by URL and date. Replay access allows users to view archived pages within their web browsers. Some Replay modes - require altering the original pages so embedded content is also loaded - from the wayback service, and not from the live web. + require altering the original pages and resources, so embedded and + referenced content is also loaded from the Wayback service, and not + from the live web. </p> <p> A WaybackCollection defines a set of archived documents and an index - which allows documents to be located within the collection. A + which allows documents to be quickly located within the collection. A WaybackCollection may be exposed to end users through one or more AccessPoints, which define: <ul> <li>the WaybackCollection itself</li> <li>the URL where users can access the collection</li> - <li>how users can query the collection (the Query UI)</li> + <li>how query results are presented to users (the Query UI)</li> <li>how documents are returned to users so they appear correctly in their web browsers (the Replay UI)</li> <li>the look and feel of the wayback user interface</li> @@ -104,12 +104,12 @@ </ul> </p> <p> - Wayback is configured using Spring IOC, to specify and configure - concrete implementations of several basic modules. For information - about using Spring, please see - <a href="http://www.springframework.org/docs/reference/beans.html"> - this page - </a>. + Wayback is configured using + <a href="http://static.springsource.org/spring/docs/2.5.x/reference/beans.html#beans-basics">Spring IOC</a>, + to specify and configure concrete implementations of several basic + modules. Please see the + <a href="http://static.springsource.org/spring/docs/2.5.x/reference/beans.html#beans-basics">Spring website</a> for more information on + configuring beans using Spring XML. </p> <subsection name="AccessPoint configuration options"> <p> @@ -121,8 +121,8 @@ AccessPoint. </li> <li><a href="Query_UI"><b>query</b></a> responsible for generating - user visible content in response to user Queries, HTML, XML, - etc.</li> + user visible content(HTML, XML, etc) in response to user + Queries.</li> <li><a href="Replay_Modes"><b>replay</b></a> responsible for determining the appropriate ReplayRenderer implementation based on the users request and the particular document to be @@ -135,7 +135,9 @@ </ul> </p> <p> - An AccessPoint's configuration may optionally specify the following: + An AccessPoint's configuration may optionally specify the following, + but must specify at least one of replayPrefix, queryPrefix, or + staticPrefix: <ul> <li><a href="Exception_Rendering"><b>exception</b></a> - an implementation responsible for generating error pages to users @@ -158,13 +160,38 @@ </a> - an implementation specifying who is allowed to connect to this AccessPoint </li> - <li><b>urlRoot</b> - a String URL prefix under which all UI - elements should be referenced. + <li> + <b>replayPrefix</b> - a String URL prefix indicating the host, + port, and path to the correct Replay AccessPoint. If unspecified, + defaults to queryPrefix, then staticPrefix. </li> + <li> + <b>queryPrefix</b> - a String URL prefix indicating the host, + port, and path to the correct Query AccessPoint. If unspecified, + defaults to staticPrefix, then replayPrefix. + </li> + <li> + <b>staticPrefix</b> - a String URL prefix indicating the host, + port, and path to static content used within the UI. If + unspecified, defaults to queryPrefix, then replayPrefix. + </li> + <li> + <b>livewebPrefix</b> - a String URL prefix indicating the host, + port, and path to the correct Replay AccessPoint. + </li> <li><b>locale</b> - A specific Locale to use for all requests within this AccessPoint, overriding the users preferred Locale as specified by their web browser. </li> + <li> + <b>exactHostMatch</b> - true or false, if true, only returns + results exactly matching a given request hostname (case insensitive). + Default is false. + </li> + <li> + <b>exactSchemeMatch</b> - true of false, if true, only returns + results exactly matching a given request scheme. Default is true. + </li> </ul> </p> <p> @@ -222,7 +249,9 @@ <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/BDBCollection.xml">BDBCollection.xml</a></li> <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/CDXCollection.xml">CDXCollection.xml</a></li> <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/RemoteCollection.xml">RemoteCollection.xml</a></li> +<!-- <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/NutchCollection.xml">NutchCollection.xml</a></li> +--> </ul> </p> </subsection> @@ -257,13 +286,14 @@ the Access Point. See below for example CONTEXT mappings. </li> <li> - <b>CONTEXT</b> is the context where the Wayback webapp has been - deployed, plus the name of the Access Point. See below for - example CONTEXT mappings. + <b>CONTEXT</b> is an optional context where the Wayback webapp + has been deployed, plus an optional name of the Access Point + within the webapp. See below for example CONTEXT mappings. </li> <li> <b>TIMESTAMP</b> is 0 to 14 digits of a date, possibly - followed by an asterisk ('*'). The format of a + followed by an asterisk ('*'), or one or more tags providing + further specifics for the request. The format of a TIMESTAMP is: <div> <code> @@ -304,6 +334,25 @@ Dec 31, 2004 23:01:00 (pm UTC) - 20041231230100 </div> <br></br> + <p> + Following the date portion of a timestamp, the following flags + can be appended: + <ul> + <li> + <b>id_</b> Identity - perform no alterations of the original + resource, return it as it was archived. + </li> + <li> + <b>js_</b> Javascript - return document marked up as javascript. + </li> + <li> + <b>cs_</b> CSS - return document marked up as CSS. + </li> + <li> + <b>im_</b> Image - return document as an image. + </li> + </ul> + </p> </li> <li> <b>URL</b> represents the actual URL that should be @@ -312,17 +361,9 @@ </ul> <br></br> <div> - Here is an example Archival URL, on an assumed host - <b>wayback.somehost.org</b>, with a wayback webapp deployed as - <b>ROOT</b>, via the Access Point named <b>80:archive</b> for the - page <b>http://www.yahoo.com/</b> on Dec 31, 1999 at 12:00:00 UTC. - <br></br> - <div> - <code> - http://wayback.somehost.org/archive/19991231120000/http://www.yahoo.com/ - </code> - </div> - <br></br> + For some simple and more elaborate examples of how AccessPoint bean + names interact with Archival URLs, please refer to + <a href="access_point_naming.html">Access Point Naming</a>. </div> <br></br> <div> @@ -350,107 +391,15 @@ </div> <br></br> <div> - There is a trade-off between these two approaches. The entirely - server-side rewriting requires more server resources, and is less - tested than the JavaScript method. The JavaScript is also imperfect: - sometimes requests "leak" to the live web temporarily, before the - Javascript has executed. With both methods, not all URLs are - rewritten correctly, especially URLs that are created by JavaScript - that was in the original page, and specialized file types containing - links like Flash and PDF documents. + Currently, we are recommending the entirely server-side rewriting + method, and are deprecating the original server-side plus Javascript + method, but this functionality is still available in Wayback. + Neither method is perfect, not all URLs are rewritten correctly, + particularly URLs that are created by JavaScript in the original + pages, and specialized file types containing links like Flash + and PDF documents. </div> <br></br> - <div> - The <b>name</b> of the Access Point bean in the Spring configuration - file determines the CONTEXT and PORT used in Archival URLs within - that Access Point. The Servlet context name where the Wayback - application is deployed also factors into the CONTEXT used within - Archival URLs for each Access Point. - </div> - <br></br> - <div> - The following examples show the Archival URL prefix for the - following two Access Points depending on the Wayback webapp being - deployed in two different contexts, "ROOT" and "wayback". - </div> - <br></br> - <div> - If the following Access Point definitions are present in the - wayback.xml: - <pre> - -<bean name="8080:wayback" class="org.archive.wayback.webapp.AccessPoint"> - <property name="collection" ref="localcollection" /> - ... -</bean> - -<bean name="8080:wayback2" class="org.archive.wayback.webapp.AccessPoint"> - <property name="collection" ref="localcollection" /> - ... -</bean> - - </pre> - then the following table shows the Archival URL prefixes to access - each collection on the host "wayback.somehost.org" assuming a - Tomcat Connector listening on port 8080: - </div> - <table> - <tr> - <th> - webapp deployed at - </th> - <th> - Access Point bean name - </th> - <th> - Archival URL prefix - </th> - </tr> - <tr> - <td> - ROOT - </td> - <td> - 8080:wayback - </td> - <td> - http://wayback.somehost.org:8080/wayback/ - </td> - </tr> - <tr> - <td> - ROOT - </td> - <td> - 8080:wayback2 - </td> - <td> - http://wayback.somehost.org:8080/wayback2/ - </td> - </tr> - <tr> - <td> - wb-webapp - </td> - <td> - 8080:wayback - </td> - <td> - http://wayback.somehost.org:8080/wb-webapp/wayback/ - </td> - </tr> - <tr> - <td> - wb-webapp - </td> - <td> - 8080:wayback2 - </td> - <td> - http://wayback.somehost.org:8080/wb-webapp/wayback2/ - </td> - </tr> - </table> </p> <p> The properties <b>parser</b> and <b>uriConverter</b> @@ -468,7 +417,7 @@ <property name="uriConverter"> <bean class="org.archive.wayback.archivalurl.ArchivalUrlResultURIConverter"> - <property name="replayURIPrefix" value="http://wayback.somehost.org:8080/wb-webapp/wayback/" /> + <property name="replayURIPrefix" value="http://wayback.example.org:8080/collection/" /> </bean> </property> @@ -519,7 +468,7 @@ </td> <td> Points to the Archival URL prefix of the Access Point as - illustrated in the preceding table. + illustrated in <a href="access_point_naming.html">Access Point Naming</a> document. </td> </tr> </table> @@ -533,11 +482,12 @@ <subsection name="Proxy Replay Mode"> <p> Wayback can be configured to act as an HTTP proxy server. To utilize - this mode, the wayback webapp must be deployed as the ROOT context, - and client browser must be configured to proxy all HTTP requests - through the Wayback Machine application. Instead of retrieving - documents from the live web, the Wayback Machine will retrieve - documents from the configured WaybackCollection. + this mode, the wayback webapp <b>must</b> be deployed as the ROOT + context, no other AccessPoints can use the port dedicated to the + Proxy AccessPoint, and client browsers must be configured to proxy + all HTTP requests through the Wayback Machine application. Instead of + retrieving documents from the live web, the Wayback Machine will + retrieve documents from the configured WaybackCollection. </p> <p> Proxy Replay mode does not suffer from the shortcomings of @@ -575,7 +525,7 @@ <pre> <bean name="8090" parent="8080:wayback"> - <property name="urlRoot" value="http://wayback.somehost.org/" /> + <property name="queryPrefix" value="http://wayback.somehost.org/" /> <property name="replay"> ref="proxyreplay" /> <property name="uriconverter"> <bean class="org.archive.wayback.proxy.RedirectResultURIConverter"> @@ -769,6 +719,15 @@ place the banner, attempting to only place the banner in the largest frame within a frameset. </li> + <li> + <a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/replay/Toolbar.jsp">/WEB-INF/replay/Toolbar.jsp</a> + Inserts a fancier banner in the top of replayed documents which + includes a graphic representaion of the number of captures over + time and allows users to navigate directly between other captures + of the current page they are viewing. This version uses Javascript + to place the banner, attempting to only place the banner in the + largest frame within a frameset. + </li> </ul> </p> </subsection> @@ -1092,7 +1051,7 @@ </p> </subsection> - <subsection name="arc-indexer|warc-indexer"> + <subsection name="cdx-indexer"> <p> These tools create a CDX format index for the ARC/WARC file at PATH, either on STDOUT, or at the path specified by CDX_PATH. The @@ -1100,8 +1059,7 @@ files to generate CDX format ResourceIndex. </p> <pre> - bin/arc-indexer [-identity] PATH [CDX_PATH] - bin/warc-indexer [-identity] PATH [CDX_PATH] + bin/cdx-indexer [-identity] PATH [CDX_PATH] </pre> <p> Note that when manually constructing CDX files using these tools, you @@ -1190,9 +1148,9 @@ input URL. </p> <p> - This tool is required when using the <b>arc-indexer</b> or - <b>warc-indexer</b> tools with the <b>-identity</b> option. Typical - usage involves generating an <i>identity</i> CDX index, then + This tool is required when using the <b>cdx-indexer</b> tool with the + <b>-identity</b> option. Typical usage involves generating an + <i>identity</i> CDX index, then passing the lines in that index through this tool to canonicalize the record URL key for queries. If the <i>identity</i> CDX files are kept, then canonicalization schemes can be swapped without Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml (rev 0) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/hadoop.xml 2010-10-22 22:35:14 UTC (rev 3298) @@ -0,0 +1,209 @@ +<?xml version="1.0" encoding="utf-8"?> + +<document> + <properties> + <title>Wayback Hadoop CDX generation</title> + <author email="brad at archive dot org">Brad Tofel</author> + <revision>$$Id$$</revision> + </properties> + + <body> + <section name="Overview"> + <p> + Wayback is distributed with an .jar file that + simplifies creation of large-scale CDX files using hadoop. This code is + experimental, and will primarily be useful only if your CDX files are + very large - more than a few hundred GB (or more, depending on your + hardware). If building or updating your CDX files is the + largest problem with your installation, this may help. At IA, we've + used this framework to build and deploy CDX files of more than 700GB, + containing billions of records, using a 24 node cluster in about 8 + hours from start to finish. Just writing a 700GB file to disk at + 50MB/sec takes around 4 hours, so the final deployment step takes + around half the time. + </p> + </section> + <section name="Requirements"> + <p> + <ul> + <li>Existing hadoop cluster running Hadoop 0.20.2.</li> + <li>Per-resource CDX files existing in a viable Hadoop-FS (HDFS, S3, + etc).</li> + <li>Perl, to create a split file based on a sample CDX.</li> + </ul> + </p> + </section> + <section name="Implementing"> + <p> + Using hadoop to generate your CDX files requires the following + high-level process: + <ul> + <li> + Integrating per-WARC CDX creation into your ingestion process. + </li> + <li> + Building a split file, to inform hadoop on how to efficiently + partition your data while sorting. + </li> + <li> + Building a manifest listing the specific per-WARC CDX files to sort. + </li> + <li> + Running the hadoop job, which produces a series of alphabetically + contiguous, partitioned CDX in your HDFS. + </li> + <li> + Deploying the partitioned CDX files to your node running Wayback. + </li> + </ul> + </p> + <subsection name="Process integration"> + <p> + It is assumed you will integrate the Wayback indexing code, + <b>cdx-indexer</b> into your standard file ingestion workflow. That + is, whatever system is used to move data from your crawlers into your + permanent repository should be modified to also build a CDX file for + each W/ARC file, as it is ingested, and to store that CDX file in + your HDFS. As an optimization, you can compress the per-WARC CDX files + before storing them in HDFS. If your per-W/ARC CDX files are named + with a trailing, <b>.gz</b> suffix, the Wayback hadoop code will + infer that these input files are compressed. + </p> + </subsection> + <subsection name="Building the split file"> + <p> + CDX files are large sorted text files. Hadoop can be used to perform + large distributed sort operations, but to achieve an efficient total + ordering across your resulting data, you need to give hadoop some + explicit instructions, in the form of the split file, indicating + how to distribute the data in your hadoop job. + </p> + <p> + The split file is a text file, with each line indicating a partition + point URL within the total possible URL space. The number of lines + determines the number of chunks that will be built within hadoop, and + it should be based on the number of concurrent Reduce tasks you can + run concurrently on your cluster. + </p> + <p> + If R is the number of reduce tasks you can run <i>at the same time</i> + on your hadoop cluster, you should use (R-5) as the second argument + to <b>cdx-sample</b>, which is distributed in the wayback .tar.gz + distribution. 5 leaves a few spare reduce workers in case of node + failure, and for speculative execution in case some of your nodes + are running slowly. + </p> + <p> + The more accurately the partition points evenly divide your particular + collections URLs, the more optimally your hadoop distributed + processing will execute. It is assumed that if you are using this + hadoop to generate your CDX, you will already have built a sizable + CDX file for your collection. The <b>cdx-sample</b> tool will sample + an existing sorted CDX file for your collection, and produce a list + of URL partitions that can be used as the split file for your hadoop + processing. You should use the most recent sizable CDX built using + other methods with the <b>cdx-sample</b> tool. If you don't have a + previously built sorted CDX file for your collection, create + a sample sorted CDX file from 20 or 30 random per-WARC CDX files, as + described elsewhere, and use that with the <b>cdx-sample</b> tool. + </p> + <p> + You might use something similar to the following command to build + your split file, assuming an previously built, sorted CDX file for + your collection called <b>existing.cdx</b>, and a total reducer + capacity of <b>20</b>: + <div> + <pre> +cdx-sample existing.cdx 15 > split.txt +hadoop fs -put split.txt /user/brad/input-split.txt + </pre> + </div> + </p> + </subsection> + <subsection name="Building the manifest"> + <p> + The second input file you will need is your list of per-WARC + (or per-ARC) CDX files to process. + </p> + <p> + This file can be built using the <b>hadoop fs -ls</b> command, and + should contain one line for each CDX file you want to sort into your + final CDX file. + </p> + <p> + This is an example line suitable for a manifest file: + <div> + <pre> +hdfs:///cdx/COLL-A/COLLECTION-A-20080726045700-00019-ia400028.us.archive.org.warc.os.cdx.gz + </pre> + </div> + </p> + <p> + You might use something similar to the following command to build + your manifest: + <div> + <pre> +hadoop fs -ls /cdx/collectionA | perl -ane 'print "hdfs://$F[-1]\n";' | grep cdx.gz > manifest.txt +hadoop fs -put manifest.txt /user/brad/input-manifest.txt + </pre> + </div> + </p> + </subsection> + <subsection name="Running the job"> + <p> + This is actually the simplest part! You just need to run: + <div> + <pre> +hadoop jar PATH_TO_WAYBACK_HADOOP_JAR cdxsort -m MAPS [--compress-output] SPLIT INPUT OUTPUT_DIR + </pre> + </div> + The --compress-output option will cause the resulting CDX files in HDFS to be compressed. + </p> + <p> + Here is an example usage: + <div> + <pre> +hadoop jar /home/brad/wayback-hadoop-jar-with-dependencies.jar cdxsort -m 470 --compress-output /user/brad/input-split.txt /user/brad/input-manifest.txt /user/brad/cdx-output + </pre> + </div> + indicating 470 map tasks, and that the resulting files should be + compressed. The number of map tasks to use should be roughly 1/3rd the + number of lines in your INPUT file. + </p> + </subsection> + <subsection name="Deploying the production Wayback CDX:"> + <p> + The previous hadoop command will create alphabetically contiguous, + sorted CDX files in your HDFS output directory(OUTPUT_DIR). To merge + them into a single CDX file which can be efficiently searched using + Wayback, you need to dump them into a single, concatenated file. + For now, you have to use some shell code: + <div> + <pre> +for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do + hadoop fs -cat $i +done > LOCAL_FILE + </pre> + </div> + where OUTPUT_DIR is the same as the one specified in your hadoop job, + and where LOCAL_FILE is where you want your target file to exist, on + the local computer. + </p> + <p> + If you did specified the --compress-output option with your + "hadoop jar ..." command, you will need to add 'zcat' as follows: + <div> + <pre> +for i in `hadoop fs -ls OUTPUT_DIR | perl -ane 'print "$F[-1]\n";' | sort`; do + hadoop fs -cat $i | zcat +done > LOCAL_FILE + </pre> + </div> + </p> + <p> + At this point, LOCAL_FILE is ready for use as a Wayback CDX. + </p> + </subsection> + </section> + </body> +</document> \ No newline at end of file Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2010-10-22 22:34:24 UTC (rev 3297) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/index.xml 2010-10-22 22:35:14 UTC (rev 3298) @@ -74,6 +74,16 @@ </p> </section> <section name="News"> + <subsection name="New Release - 1.6.0, 10/21/2010"> + <p> + The long awaited 1.6.0 release is now available, with improved + server-side rewriting of HTML, CSS, Javascript, and SWF content. + This version includes other new features and bug fixes, which are + detailed on the <a href="release_notes.html">release notes</a> page. + Upgrading to this version will require changes to Wayback Spring XML + configuration. + </p> + </subsection> <subsection name="Maintenance Release - 1.4.2, 7/17/2009"> <p> Release 1.4.2 fixes several problems discovered in the 1.4.1 Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml 2010-10-22 22:34:24 UTC (rev 3297) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/navigation.xml 2010-10-22 22:35:14 UTC (rev 3298) @@ -13,7 +13,7 @@ <item name="License" href="/license.html"/> <item name="Requirements" href="requirements.html"/> <item name="Downloads" href="downloads.html"/> - <item name="User Manual" href="user_manual.html"/> + <item name="Administator Manual" href="administrator_manual.html"/> <item name="Release Notes" href="release_notes.html"/> <item name="Test" href="test.html"/> <item name="FAQ" href="/faq.html"/> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |