Revision: 2778
http://archive-access.svn.sourceforge.net/archive-access/?rev=2778&view=rev
Author: bradtofel
Date: 2009-07-18 00:34:58 +0000 (Sat, 18 Jul 2009)
Log Message:
-----------
DOC: added mention of default port stripping
Modified Paths:
--------------
branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml
Modified: branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml
===================================================================
--- branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml 2009-07-18 00:29:00 UTC (rev 2777)
+++ branches/wayback-1_4_2/dist/src/site/xdoc/resource_index.xml 2009-07-18 00:34:58 UTC (rev 2778)
@@ -275,10 +275,14 @@
</li>
<li>
<b>user info removal</b>
- http://us...@ex... => example.com,
- http://user:pas...@ex... => example.com,
+ http://us...@ex... => example.com,
+ http://user:pas...@ex... => example.com,
</li>
<li>
+ <b>default port removal</b>
+ http://example.com:80 => example.com,
+ </li>
+ <li>
<b>session ID removal</b>
http://www.example.com/(S(a63098d96360a63098d96360))/page1.aspx
=>
@@ -313,12 +317,12 @@
<p>
At the IA, we have recently switched to building CDX files using the
<b>-identity</b> option on the <b>arc-indexer</b> and
- <b>warc-indexer</b> tools, and have added an additional step in our
- CDX creation processes which uses the <b>url-client</b> tool before
- sorting and merging CDX files. By keeping the original "identity" CDX
- files, we have been able to test various URL canonicalization
- strategies without the overhead of re-processing all the source
- materials.
+ <b>warc-indexer</b> tools. The <b>-identity</b> option
+ <b>requires</b> passing records through the <b>url-client</b>
+ tool before sorting and merging into production CDX files. By keeping
+ the original "identity" CDX files, we have been able to test various
+ URL canonicalization strategies without the overhead of
+ re-processing all the ARC/WARC source materials.
</p>
</subsection>
<subsection name="Future Directions within Wayback">
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
|