From: <bra...@us...> - 2009-07-18 00:35:01
|
Revision: 2778 http://archive-access.svn.sourceforge.net/archive-access/?rev=2778&view=rev Author: bradtofel Date: 2009-07-18 00:34:58 +0000 (Sat, 18 Jul 2009) Log Message: ----------- DOC: added mention of default port stripping Modified Paths: -------------- branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml Modified: branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml =================================================================== --- branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml 2009-07-18 00:29:00 UTC (rev 2777) +++ branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml 2009-07-18 00:34:58 UTC (rev 2778) @@ -275,10 +275,14 @@ </li> <li> <b>user info removal</b> - http://us...@ex... => example.com, - http://user:pas...@ex... => example.com, + http://us...@ex... => example.com, + http://user:pas...@ex... => example.com, </li> <li> + <b>default port removal</b> + http://example.com:80 => example.com, + </li> + <li> <b>session ID removal</b> http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx => @@ -313,12 +317,12 @@ <p> At the IA, we have recently switched to building CDX files using the <b>-identity</b> option on the <b>arc-indexer</b> and - <b>warc-indexer</b> tools, and have added an additional step in our - CDX creation processes which uses the <b>url-client</b> tool before - sorting and merging CDX files. By keeping the original "identity" CDX - files, we have been able to test various URL canonicalization - strategies without the overhead of re-processing all the source - materials. + <b>warc-indexer</b> tools. The <b>-identity</b> option + <b>requires</b> passing records through the <b>url-client</b> + tool before sorting and merging into production CDX files. By keeping + the original "identity" CDX files, we have been able to test various + URL canonicalization strategies without the overhead of + re-processing all the ARC/WARC source materials. </p> </subsection> <subsection name="Future Directions within Wayback"> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |