From: <bra...@us...> - 2008-11-06 22:51:27
|
Revision: 2630 http://archive-access.svn.sourceforge.net/archive-access/?rev=2630&view=rev Author: bradtofel Date: 2008-11-06 22:51:24 +0000 (Thu, 06 Nov 2008) Log Message: ----------- DOC: clarified dependency on using url-client with -identity option on arc/warc-indexer Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-10-29 00:01:33 UTC (rev 2629) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-11-06 22:51:24 UTC (rev 2630) @@ -1110,8 +1110,11 @@ </p> <p> The <b>-identity</b> option causes the tools to skip canonicalization - of URLs. See the documentation for the <b>url-client</b> tool, and - the <a href="resource_index.html#URL_Canonicalization"> + of URLs. When using this option, you will need to pass the CDX + records through the url-client tool before sorting them into a + production CDX index. See the documentation for the + <b>url-client</b> tool, and the + <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization </a> section for more information. </p> @@ -1182,15 +1185,19 @@ canonicalization function is applied to requested URLs. This tool will read space(" ") delimited lines from STDIN, and output the same lines on STDOUT, but with one column - altered. The column that is changed is assumed to be a URL, + altered. The column that is changed is assumed to be an URL, and the version output is the canonicalized form of the input URL. </p> <p> - This tool is mostly useful for debugging the - canonicalization function, but can also be used, if the - canonicalization function is altered, to update an existing - CDX index, without recreating CDX files from original ARCs. See the + This tool is required when using the <b>arc-indexer</b> or + <b>warc-indexer</b> tools with the <b>-identity</b> option. Typical + usage involves generating an <i>identity</i> CDX index, then + passing the lines in that index through this tool to canonicalize the + record URL key for queries. If the <i>identity</i> CDX files are + kept, then canonicalization schemes can be swapped without + reindexing the original ARC/WARC content. This tool can also be + useful for debugging the canonicalization function. See the section <a href="resource_index.html#URL_Canonicalization"> URL Canonicalization Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-10-29 00:01:33 UTC (rev 2629) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/resource_index.xml 2008-11-06 22:51:24 UTC (rev 2630) @@ -275,10 +275,14 @@ </li> <li> <b>user info removal</b> - http://us...@ex... => example.com, - http://user:pas...@ex... => example.com, + http://us...@ex... => example.com, + http://user:pas...@ex... => example.com, </li> <li> + <b>default port removal</b> + http://example.com:80 => example.com, + </li> + <li> <b>session ID removal</b> http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx => @@ -313,12 +317,12 @@ <p> At the IA, we have recently switched to building CDX files using the <b>-identity</b> option on the <b>arc-indexer</b> and - <b>warc-indexer</b> tools, and have added an additional step in our - CDX creation processes which uses the <b>url-client</b> tool before - sorting and merging CDX files. By keeping the original "identity" CDX - files, we have been able to test various URL canonicalization - strategies without the overhead of re-processing all the source - materials. + <b>warc-indexer</b> tools. The <b>-identity</b> option + <b>requires</b> passing records through the <b>url-client</b> + tool before sorting and merging into production CDX files. By keeping + the original "identity" CDX files, we have been able to test various + URL canonicalization strategies without the overhead of + re-processing all the ARC/WARC source materials. </p> </subsection> <subsection name="Future Directions within Wayback"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |