From: <bra...@us...> - 2008-08-14 03:25:33
|
Revision: 2551 http://archive-access.svn.sourceforge.net/archive-access/?rev=2551&view=rev Author: bradtofel Date: 2008-08-14 03:25:42 +0000 (Thu, 14 Aug 2008) Log Message: ----------- INITIAL REV: split out and updated documentation for ResourceIndex configuration options, including Canonicalization Added Paths: ----------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml Added: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml (rev 0) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-08-14 03:25:42 UTC (rev 2551) @@ -0,0 +1,377 @@ +<?xml version="1.0" encoding="ISO-8859-1"?> +<document> + <properties> + <title>Resource Index Configuration</title> + <author email="brad at archive dot org">Brad Tofel</author> + <revision>$$Id$$</revision> + </properties> + + <body> + <section name="ResourceIndex configuration options"> + <subsection name="Overview"> + <p> + A ResourceIndex locates documents within a WaybackCollection through + a single method: + <pre> + public SearchResults query(final WaybackRequest request) + throws ResourceIndexNotAvailableException, + ResourceNotInArchiveException, BadQueryException, + AccessControlException; + </pre> + The ResourceIndex is responsible for deciding which SearchResults + subclass, <b>CaptureSearchResults</b> or <b>UrlSearchResults</b>, is + appropriate for the WaybackRequest argument, and for populating the + returned SearchResults object with matching records. + </p> + <p> + When the request indicates the user wishes to find specific captures + of a single URL, CaptureSearchResults should be returned. When the + request may return results for multiple URLs, for example a query + attempting to locate all URLs beginning with a given prefix within + the WaybackCollection, a URLSearchResults object should be + returned. + </p> + </subsection> + <subsection name="LocalResourceIndex configuration options"> + <p> + This ResourceIndex implementation assumes a local database of all + documents within the WaybackCollection. The type of database is + specified with the <b>source</b> property. + </p> + <p> + The following configuration is required for a LocalResourceIndex: + <ul> + <li>source - a bean implementing SearchResultSource, which can be + one of the following: + <ul> + <li>BDBIndex - a BDBJE database holding records for all + documents within the WaybackCollection. This + implementation allows for fast incremental updates to the + index, and is required when using automatic indexing. + This implementation scales well to 10's of millions of + records.</li> + <li>CDXIndex - a sorted flat file containing one line per + document within the WaybackCollection. This + implementation requires that the CDX file be manually + maintained, but scales to very large sizes, limited + primarily by the size of file you can build and store. + CDX files can be built using the command line tool + <b>arc-indexer</b> or <b>warc-indexer</b>, and the + standard UNIX <b>sort</b> tool. + </li> + <li>CompositeSearchResultSource - an implementation allowing + aggregation of multiple SearchResultSources into a single + logical SearchResultSource. Use of BDBIndex + SearchResultSources within this class is experimental, + but this implementation has been used successfully in + production installations to serve results from several + CDXIndex files. For optimal search efficiency, multiple + index files should be merged (sort -mu) prior to + production use, but this implementation allows a + trade-off in simplified index management for a decrease + in search performance. A useful strategy for managing + large scale collections is to use several CDX files of + increasing size. Updates to the set of CDX files are + always performed against the smallest CDX file, and + occasionally this small file is merged with one of the + larger files, minimizing the amount of data that needs to + be read, sorted, and written back to disk to update the + set of CDX files.</li> + </ul> + </li> + </ul> + </p> + <p> + The following configurations are optional for LocalResourceIndexes: + <ul> + <li>maxRecords - integer maximum number of records to process for a + single request. Useful to prevent a single request from using + too much Disk and CPU resources.</li> + <li>dedupeRecords - boolean value that should be set to <i>true</i> + when using deduplicated WARC records. This causes Wayback to + modify search results as they are read from the index, so + records indicating a resource was inspected but not saved are + accessible within the Wayback. Please see the + <a href="#Duplicate_Reduction">Duplicate Reduction</a> section + below for more information.</li> + <li>annotater - experimental hook for modifying or omitting records + as they are read from the index. For example, additional + metadata could be associated with each record from an external + datasource, and this extra metadata could then be exposed to + end users through a .jsp customization.</li> + <li>canonicalizer - an implementation of UrlCanonicalizer. See the + section labeled URL Canonicalization below for more + information.</li> + </ul> + </p> + <p> + For specific Spring configuration examples of these ResourceIndex + options, please refer to the following files distributed within the + wayback .war file: + <ul> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/BDBCollection.xml">BDBCollection.xml</a></li> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/CDXCollection.xml">CDXCollection.xml</a></li> + </ul> + </p> + </subsection> + <subsection name="RemoteResourceIndex configuration options"> + <p> + This ResourceIndex implementation requests an external Wayback + installation to satisfy index requests, and can be useful for + distributed installations, as well as for experimenting with new + Wayback configurations and installations using an existing + ResourceIndex. For example, a development system can be configured to + use a production index remotely, minimizing the requirements and + setup required to test new configurations. + </p> + <p> + The actual index must be stored on another Wayback installation, and + is requested as XML through this implementation. + </p> + <p> + The following configuration is required for a RemoteResourceIndex: + <ul> + <li>searchUrlBase - the URL prefix indicating the AccessPoint + actually holding the ResourceIndex. + </li> + </ul> + </p> + <p> + The following configurations are optional for LocalResourceIndexes: + <ul> + <li>canonicalizer - an implementation of UrlCanonicalizer. See the + section labeled URL Canonicalization below for more + information.</li> + </ul> + </p> + <p> + For a Spring configuration example of this ResourceIndex option, + please refer to the following files distributed within the wayback + .war file: + <ul> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/RemoteCollection.xml">RemoteCollection.xml</a></li> + </ul> + </p> + </subsection> + <subsection name="NutchResourceIndex configuration options"> + <p> + This implementation, similar to the RemoteResourceIndex, accesses + index functionality from an external NutchWAX installation. This mode + of operation is considered experimental, and is not used within + Internet Archive production installations for performance reasons. + IA enables search within archived collections using NutchWax, but a + separate CDX index is used for Wayback Query and Replay + functionality. NutchWax is customized in these installations to + generate links within it's internal UI that direct to the appropriate + pages within the corresponding Wayback installation. + </p> + <p> + The following configuration is required for a RemoteResourceIndex: + <ul> + <li>searchUrlBase - the URL prefix indicating the opensearch API to + be used for queries. + </li> + </ul> + </p> + <p> + The following configurations are optional for LocalResourceIndexes: + <ul> + <li>maxRecords - integer maximum number of records to request from + the external NutchWax opensearch API.</li> + </ul> + </p> + <p> + For a Spring configuration example of this ResourceIndex option, + please refer to the following files distributed within the wayback + .war file: + <ul> + <li><a href="https://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/wayback/wayback-webapp/src/main/webapp/WEB-INF/NutchCollection.xml">NutchCollection.xml</a></li> + </ul> + </p> + </subsection> + </section> + <section name="URL Canonicalization"> + <subsection name="Introduction and Concepts"> + <p> + Sometimes URLs found in the field can have multiple forms, for + example: + <pre> + http://www.example.com/img/foo.gif + http://www.example.com/docs/../img/foo.gif + </pre> + are both valid representations of the exact same URL. Another, less + certain example would be: + <pre> + http://www.example.com/Interview.html + http://www.example.com/interview.html + </pre> + which differ only in the capitalization of the letter "i". On some + operating systems, these two URLs legitimately specify two distinct + documents. On Windows platforms, they refer to the same document. If + the document on a web server is actually named "Interview.html", but + a web designer creates a web page that refers to this document using + the lowercase "interview.html", then the link will work, and they and + the web site visitors may never notice the difference. The same + situation on a different operating system would probably not work + (although some web server plugins and modules will also correct this + problem transparently) and the web designer would probably notice and + correct the problem. In practice, we have found that it is very rare + for the two URLs above with different capitalization to refer to + different documents, and they can be treated as equivalent in most + situations. + </p> + <p> + Another example, which occurs far more often in the real world, + involves web servers injecting a session ID inside paths to documents + hosted on that web server. These session IDs allow the web server to + track individual user's states. Here are some example URLs + demonstrating path session ID injection: + <pre> + http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page1.aspx + http://www.example.com/(S(4hqa0555fwsecu455xqckv45))/page2.aspx + http://www.example.com/(S(a63098d96360a63098d96360))/page3.aspx + </pre> + In these examples, the first two URLs are using one session ID, and + the third uses a different session ID. If <b>page3.aspx</b> refers to + <b>page1.aspx</b> using an anchor like this: + <pre> + <a href="page1.aspx">page1</a> + </pre> + and a user visiting <b>page3.aspx</b> clicks the link to page1, then + the wayback will recieve a request for the URL: + <pre> + http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx + </pre> + If page1.aspx was captured using the different session ID, then the + wayback will be unable to locate this document in the index, even + though it was captured. + </p> + <p> + This session ID problem can be mitigated by <i>canonicalizing</i> the + URLs as they are placed in the index, so the index would contain the + following URLs, instead of the original form, which the crawler + captured: + <pre> + http://www.example.com/page1.aspx + http://www.example.com/page2.aspx + http://www.example.com/page3.aspx + </pre> + If the same canonicalization scheme is used to transform incoming + requests, before attempting to lookup URLs in the index, then the + software is able to locate and return the documents correctly. + </p> + </subsection> + <subsection name="Current Status within Wayback"> + <p> + Currently the Wayback includes only a single reference implementation + of a canonicalization scheme, which is currently called + <b>AggressiveUrlCanonicalizer</b>. This implementation provides the + following canonicalization: + <ul> + <li> + <b>www# removal</b> + http://www.example.com => example.com, + http://www13.example.com => example.com + </li> + <li> + <b>user info removal</b> + http://us...@ex... => example.com, + http://user:pas...@ex... => example.com, + </li> + <li> + <b>session ID removal</b> + http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx + => + example.com/page1.aspx + <br></br> + <i>(and other common session ID path injection schemes)</i> + </li> + <li> + <b>path and CGI argument lowercasing</b> + http://www.example.com/Interviews.cgi?Interview=Left + => + example.com/interviews.cgi?interview=left + </li> + <li> + <b>extra query argument delimiter removal</b> + http://www.example.com/Interviews.cgi?Interview=Left& + => + example.com/interviews.cgi?interview=left + </li> + <li> + <b>unneeded query specifier removal</b> + http://www.example.com/Interviews.cgi? + => + example.com/interviews.cgi + </li> + </ul> + These heuristics generally lead to correcting many common URL lookup + problems, but in some cases, these operation do the wrong thing, + typically by making content which is actually different appear to be + the same thing. + </p> + <p> + At the IA, we have recently switched to building CDX files using the + <b>-identity</b> option on the <b>arc-indexer</b> and + <b>warc-indexer</b> tools, and have added an additional step in our + CDX creation processes which uses the <b>url-client</b> tool before + sorting and merging CDX files. By keeping the original "identity" CDX + files, we have been able to test various URL canonicalization + strategies without the overhead of re-processing all the source + materials. + </p> + </subsection> + <subsection name="Future Directions within Wayback"> + <p> + In upcoming wayback releases, we intend to provide more + canonicalization implementations, including a configurable + implementation that will allow broad customization capabilities. + </p> + <p> + We also intend to alter the format of wayback indexes significantly. + Using this new format will be optional, but once indexes are created + in the new format is created, other indexes with different + canonicalization strategies can be built from them without requiring + a complete reindex of the original ARC/WARC content. + </p> + <p> + The new format will also allow a degree of dynamic canonicalization + at run-time, meaning different strategies can be tested using the + same indexes, and site-specific canonicalization strategies may be + possible. + </p> + <p> + We anticipate that allowing (advanced) users to easily change between + canonicalization strategies within the same wayback session will + promote better community understanding of the impacts of different + strategies, and will enable the community to build a set of best + practices for URL canonicalization. + </p> + </subsection> + </section> + <section name="Duplicate Reduction"> + <p> + Heritrix 1.12 and above have the capability to write WARC files, which + omit storing documents that have not changed since a previous visit. For + specifics on activating these features, please refer to the Heritrix + documentation. When Heritrix is using these features, and notices that + a document has not changed since the last time it was visited, it + creates an abbreviated WARC record, indicating that the document was + retrieved but not stored. In this abbreviated WARC record is an + indicator of the SHA1 digest of the document. + </p> + <p> + The wayback uses these identical SHA1 digests to map the location + (ARC/WARC + offset) of the original record that was stored to subsequent + records that were not. When a request for a subsequent capture that was + not stored is received by wayback, it will return the content of the + previous stored record. + </p> + <p> + The matching of these digests occurs at query time, and is configured + by setting the "dedupeRecords" option of the LocalResourceIndex to + "true". + </p> + </section> + </body> +</document> \ No newline at end of file This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |