[Archive-access-cvs] SF.net SVN: archive-access:[2551] trunk/archive-access/projects/wayback/ dist

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Revision: 2551
          http://archive-access.svn.sourceforge.net/archive-access/?rev=2551&view=rev
Author:   bradtofel
Date:     2008-08-14 03:25:42 +0000 (Thu, 14 Aug 2008)

Log Message:
-----------
INITIAL REV: split out and updated documentation for ResourceIndex configuration options, including Canonicalization

Added Paths:
-----------
    trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml

Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml
===================================================================

--- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml	                        (rev 0)
+++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml	2008-08-14 03:25:42 UTC (rev 2551)
@@ -0,0 +1,377 @@
+<?xml version="1.0" encoding="ISO-8859-1"?>
+<document>
+  <properties>
+    <title>Resource Index Configuration</title>
+    <author email="brad at archive dot org">Brad Tofel</author>
+    <revision>$$Id$$</revision>
+  </properties>
+  
+  <body>
+    <section name="ResourceIndex configuration options">
+      <subsection name="Overview">
+        <p>
+          A ResourceIndex locates documents within a WaybackCollection through
+          a single method:
+          <pre>
+  public SearchResults query(final WaybackRequest request)
+    throws ResourceIndexNotAvailableException,
+    ResourceNotInArchiveException, BadQueryException,
+    AccessControlException;
+          </pre>
+          The ResourceIndex is responsible for deciding which SearchResults
+          subclass, <b>CaptureSearchResults</b> or <b>UrlSearchResults</b>, is
+          appropriate for the WaybackRequest argument, and for populating the
+          returned SearchResults object with matching records.
+        </p>
+        <p>
+          When the request indicates the user wishes to find specific captures 
+          of a single URL, CaptureSearchResults should be returned. When the
+          request may return results for multiple URLs, for example a query
+          attempting to locate all URLs beginning with a given prefix within
+          the WaybackCollection, a URLSearchResults object should be
+          returned.
+        </p>
+      </subsection>
+      <subsection name="LocalResourceIndex configuration options">
+        <p>
+          This ResourceIndex implementation assumes a local database of all
+          documents within the WaybackCollection. The type of database is
+          specified with the <b>source</b> property.
+        </p>
+        <p>
+          The following configuration is required for a LocalResourceIndex:
+          <ul>
+            <li>source - a bean implementing SearchResultSource, which can be
+                one of the following:
+                <ul>
+                  <li>BDBIndex - a BDBJE database holding records for all
+                      documents within the WaybackCollection. This 
+                      implementation allows for fast incremental updates to the
+                      index, and is required when using automatic indexing.
+                      This implementation scales well to 10's of millions of
+                      records.</li>
+                  <li>CDXIndex - a sorted flat file containing one line per
+                      document within the WaybackCollection. This 
+                      implementation requires that the CDX file be manually
+                      maintained, but scales to very large sizes, limited
+                      primarily by the size of file you can build and store.
+                      CDX files can be built using the command line tool
+                      <b>arc-indexer</b> or <b>warc-indexer</b>, and the
+                      standard UNIX <b>sort</b> tool.
+                  </li>
+                  <li>CompositeSearchResultSource - an implementation allowing
+                      aggregation of multiple SearchResultSources into a single
+                      logical SearchResultSource. Use of BDBIndex 
+                      SearchResultSources within this class is experimental,
+                      but this implementation has been used successfully in
+                      production installations to serve results from several
+                      CDXIndex files. For optimal search efficiency, multiple
+                      index files should be merged (sort -mu) prior to
+                      production use, but this implementation allows a
+                      trade-off in simplified index management for a decrease
+                      in search performance. A useful strategy for managing
+                      large scale collections is to use several CDX files of
+                      increasing size. Updates to the set of CDX files are
+                      always performed against the smallest CDX file, and
+                      occasionally this small file is merged with one of the
+                      larger files, minimizing the amount of data that needs to
+                      be read, sorted, and written back to disk to update the
+                      set of CDX files.</li>
+                </ul>
+            </li>
+          </ul>
+        </p>
+        <p>
+          The following configurations are optional for LocalResourceIndexes:
+          <ul>
+            <li>maxRecords - integer maximum number of records to process for a
+                single request. Useful to prevent a single request from using
+                too much Disk and CPU resources.</li>
+            <li>dedupeRecords - boolean value that should be set to <i>true</i>
+                when using deduplicated WARC records. This causes Wayback to
+                modify search results as they are read from the index, so
+                records indicating a resource was inspected but not saved are
+                accessible within the Wayback. Please see the
+                <a href="#Duplicate_Reduction">Duplicate Reduction</a> section
+                below for more information.</li>
+            <li>annotater - experimental hook for modifying or omitting records
+                as they are read from the index. For example, additional
+                metadata could be associated with each record from an external
+                datasource, and this extra metadata could then be exposed to
+                end users through a .jsp customization.</li>
+            <li>canonicalizer - an implementation of UrlCanonicalizer. See the
+                section labeled URL Canonicalization below for more
+                information.</li>
+          </ul>
+        </p>
+        <p>
+          For specific Spring configuration examples of these ResourceIndex
+          options, please refer to the following files distributed within the
+          wayback .war file:
+          <ul>
+            <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/BDBCollection.xml">BDBCollection.xml</a></li>
+            <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/CDXCollection.xml">CDXCollection.xml</a></li>
+          </ul>
+        </p>
+      </subsection>
+      <subsection name="RemoteResourceIndex configuration options">
+        <p>
+          This ResourceIndex implementation requests an external Wayback
+          installation to satisfy index requests, and can be useful for
+          distributed installations, as well as for experimenting with new
+          Wayback configurations and installations using an existing
+          ResourceIndex. For example, a development system can be configured to
+          use a production index remotely, minimizing the requirements and
+          setup required to test new configurations.
+        </p>
+        <p>
+          The actual index must be stored on another Wayback installation, and
+          is requested as XML through this implementation.
+        </p>
+        <p>
+          The following configuration is required for a RemoteResourceIndex:
+          <ul>
+            <li>searchUrlBase - the URL prefix indicating the AccessPoint 
+                actually holding the ResourceIndex.
+            </li>
+          </ul>
+        </p>
+        <p>
+          The following configurations are optional for LocalResourceIndexes:
+          <ul>
+            <li>canonicalizer - an implementation of UrlCanonicalizer. See the
+                section labeled URL Canonicalization below for more
+                information.</li>
+          </ul>
+        </p>
+        <p>
+          For a Spring configuration example of this ResourceIndex option,
+          please refer to the following files distributed within the wayback
+          .war file:
+          <ul>
+            <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/RemoteCollection.xml">RemoteCollection.xml</a></li>
+          </ul>
+        </p>
+      </subsection>
+      <subsection name="NutchResourceIndex configuration options">
+        <p>
+          This implementation, similar to the RemoteResourceIndex, accesses
+          index functionality from an external NutchWAX installation. This mode
+          of operation is considered experimental, and is not used within
+          Internet Archive production installations for performance reasons.
+          IA enables search within archived collections using NutchWax, but a
+          separate CDX index is used for Wayback Query and Replay
+          functionality. NutchWax is customized in these installations to
+          generate links within it's internal UI that direct to the appropriate
+          pages within the corresponding Wayback installation.
+        </p>
+        <p>
+          The following configuration is required for a RemoteResourceIndex:
+          <ul>
+            <li>searchUrlBase - the URL prefix indicating the opensearch API to
+                be used for queries.
+            </li>
+          </ul>
+        </p>
+        <p>
+          The following configurations are optional for LocalResourceIndexes:
+          <ul>
+            <li>maxRecords - integer maximum number of records to request from
+                the external NutchWax opensearch API.</li>
+          </ul>
+        </p>
+        <p>
+          For a Spring configuration example of this ResourceIndex option,
+          please refer to the following files distributed within the wayback
+          .war file:
+          <ul>
+            <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/NutchCollection.xml">NutchCollection.xml</a></li>
+          </ul>
+        </p>
+      </subsection>
+    </section>
+    <section name="URL Canonicalization">
+      <subsection name="Introduction and Concepts">
+        <p>
+          Sometimes URLs found in the field can have multiple forms, for 
+          example:
+          <pre>
+            http://www.example.com/img/foo.gif
+            http://www.example.com/docs/../img/foo.gif
+          </pre>
+          are both valid representations of the exact same URL. Another, less
+          certain example would be:
+          <pre>
+            http://www.example.com/Interview.html
+            http://www.example.com/interview.html
+          </pre>
+          which differ only in the capitalization of the letter "i". On some
+          operating systems, these two URLs legitimately specify two distinct 
+          documents. On Windows platforms, they refer to the same document. If
+          the document on a web server is actually named "Interview.html", but
+          a web designer creates a web page that refers to this document using
+          the lowercase "interview.html", then the link will work, and they and
+          the web site visitors may never notice the difference. The same 
+          situation on a different operating system would probably not work
+          (although some web server plugins and modules will also correct this
+          problem transparently) and the web designer would probably notice and
+          correct the problem. In practice, we have found that it is very rare
+          for the two URLs above with different capitalization to refer to
+          different documents, and they can be treated as equivalent in most
+          situations.
+        </p>
+        <p>
+          Another example, which occurs far more often in the real world, 
+          involves web servers injecting a session ID inside paths to documents
+          hosted on that web server. These session IDs allow the web server to
+          track individual user's states. Here are some example URLs 
+          demonstrating path session ID injection:
+          <pre>
+            http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page1.aspx
+            http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page2.aspx
+            http://www.example.com/(S(a63098d96360a63098d96360))/page3.aspx
+          </pre>
+          In these examples, the first two URLs are using one session ID, and 
+          the third uses a different session ID. If <b>page3.aspx</b> refers to 
+          <b>page1.aspx</b> using an anchor like this:
+          <pre>
+            &lt;a href="page1.aspx"&gt;page1&lt;/a&gt;
+          </pre>
+          and a user visiting <b>page3.aspx</b> clicks the link to page1, then
+          the wayback will recieve a request for the URL:
+          <pre>
+            http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx
+          </pre>
+          If page1.aspx was captured using the different session ID, then the
+          wayback will be unable to locate this document in the index, even
+          though it was captured.
+        </p>
+        <p>
+          This session ID problem can be mitigated by <i>canonicalizing</i> the
+          URLs as they are placed in the index, so the index would contain the
+          following URLs, instead of the original form, which the crawler 
+          captured:
+          <pre>
+            http://www.example.com/page1.aspx
+            http://www.example.com/page2.aspx
+            http://www.example.com/page3.aspx
+          </pre>
+          If the same canonicalization scheme is used to transform incoming
+          requests, before attempting to lookup URLs in the index, then the
+          software is able to locate and return the documents correctly.
+        </p>
+      </subsection>
+      <subsection name="Current Status within Wayback">
+        <p>
+          Currently the Wayback includes only a single reference implementation
+          of a canonicalization scheme, which is currently called
+          <b>AggressiveUrlCanonicalizer</b>. This implementation provides the
+          following canonicalization:
+          <ul>
+            <li>
+              <b>www# removal</b>
+	              http://www.example.com =&gt; example.com, 
+	              http://www13.example.com =&gt; example.com
+            </li>
+            <li>
+              <b>user info removal</b>
+	              http://us...@ex... =&gt; example.com, 
+	              http://user:pas...@ex... =&gt; example.com, 
+            </li>
+            <li>
+              <b>session ID removal</b>
+                http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx
+                  =&gt;
+                example.com/page1.aspx
+                <br></br>
+                <i>(and other common session ID path injection schemes)</i>
+            </li>
+            <li>
+               <b>path and CGI argument lowercasing</b>
+                http://www.example.com/Interviews.cgi?Interview=Left
+                  =&gt;
+                example.com/interviews.cgi?interview=left
+            </li>
+            <li>
+               <b>extra query argument delimiter removal</b>
+                http://www.example.com/Interviews.cgi?Interview=Left&amp;
+                  =&gt;
+                example.com/interviews.cgi?interview=left
+            </li>
+            <li>
+               <b>unneeded query specifier removal</b>
+                http://www.example.com/Interviews.cgi?
+                  =&gt;
+                example.com/interviews.cgi
+            </li>
+          </ul>
+          These heuristics generally lead to correcting many common URL lookup
+          problems, but in some cases, these operation do the wrong thing,
+          typically by making content which is actually different appear to be
+          the same thing.
+        </p>
+        <p>
+          At the IA, we have recently switched to building CDX files using the
+          <b>-identity</b> option on the <b>arc-indexer</b> and 
+          <b>warc-indexer</b> tools, and have added an additional step in our
+          CDX creation processes which uses the <b>url-client</b> tool before
+          sorting and merging CDX files. By keeping the original "identity" CDX
+          files, we have been able to test various URL canonicalization
+          strategies without the overhead of re-processing all the source
+          materials.
+        </p>
+      </subsection>
+      <subsection name="Future Directions within Wayback">
+        <p>
+          In upcoming wayback releases, we intend to provide more 
+          canonicalization implementations, including a configurable 
+          implementation that will allow broad customization capabilities.
+        </p>
+        <p>
+          We also intend to alter the format of wayback indexes significantly.
+          Using this new format will be optional, but once indexes are created
+          in the new format is created, other indexes with different
+          canonicalization strategies can be built from them without requiring
+          a complete reindex of the original ARC/WARC content.
+        </p>
+        <p>
+          The new format will also allow a degree of dynamic canonicalization
+          at run-time, meaning different strategies can be tested using the
+          same indexes, and site-specific canonicalization strategies may be
+          possible. 
+        </p>
+        <p>
+          We anticipate that allowing (advanced) users to easily change between
+          canonicalization strategies within the same wayback session will
+          promote better community understanding of the impacts of different
+          strategies, and will enable the community to build a set of best
+          practices for URL canonicalization.
+        </p>
+      </subsection>
+    </section>
+    <section name="Duplicate Reduction">
+      <p>
+        Heritrix 1.12 and above have the capability to write WARC files, which 
+        omit storing documents that have not changed since a previous visit. For
+        specifics on activating these features, please refer to the Heritrix 
+        documentation. When Heritrix is using these features, and notices that
+        a document has not changed since the last time it was visited, it 
+        creates an abbreviated WARC record, indicating that the document was
+        retrieved but not stored. In this abbreviated WARC record is an
+        indicator of the SHA1 digest of the document.
+      </p>
+      <p>
+        The wayback uses these identical SHA1 digests to map the location
+        (ARC/WARC + offset) of the original record that was stored to subsequent 
+        records that were not. When a request for a subsequent capture that was
+        not stored is received by wayback, it will return the content of the
+        previous stored record.
+      </p>
+      <p>
+        The matching of these digests occurs at query time, and is configured
+        by setting the "dedupeRecords" option of the LocalResourceIndex to 
+        "true".
+      </p>
+    </section>
+  </body>
+</document>
\ No newline at end of file


This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.




[Archive-access-cvs] SF.net SVN: archive-access:[2551] trunk/archive-access/projects/wayback/ dist

[Archive-access-cvs] SF.net SVN: archive-access:[2551] trunk/archive-access/projects/wayback/ dist/src/site/xdoc/resource_index.xml