From: <bra...@us...> - 2008-02-07 00:09:50
|
Revision: 2178 http://archive-access.svn.sourceforge.net/archive-access/?rev=2178&view=rev Author: bradtofel Date: 2008-02-06 16:09:52 -0800 (Wed, 06 Feb 2008) Log Message: ----------- DOC: Added basic info about duplicate reduction features. Modified Paths: -------------- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml Modified: trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml =================================================================== --- trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-02-07 00:09:12 UTC (rev 2177) +++ trunk/archive-access/projects/wayback/dist/src/site/xdoc/administrator_manual.xml 2008-02-07 00:09:52 UTC (rev 2178) @@ -289,6 +289,7 @@ <bean class="org.archive.wayback.resourceindex.LocalResourceIndex"> <property name="source" ... /> <property name="maxRecords" value="10000" /> + <property name="dedupeRecords" value="false" /> </bean> </property> @@ -301,9 +302,16 @@ specifies the maximum number of records to process, and thus that can be returned, during a single query. </p> - <br></br> <p> <b> + dedupeRecords + </b> + set to true if you are using WARC files created by Heritrix 1.12 or + higher and configured the duplicate reduction features. See the + section Duplicate Reduction below for more information. + </p> + <p> + <b> source </b> defines the format to be used for storing and searching records in @@ -1644,6 +1652,29 @@ </p> </subsection> </section> - + <section name="Duplicate Reduction"> + <p> + Heritrix 1.12 and above have the capability to write WARC files, which + omit storing documents that have not changed since a previous visit. For + specifics on activating these features, please refer to the Heritrix + documentation. When Heritrix is using these features, and notices that + a document has not changed since the last time it was visited, it + creates an abbreviated WARC record, indicating that the document was + retrieved but not stored. In this abbreviated WARC record is an + indicator of the SHA1 digest of the document. + </p> + <p> + The wayback uses these identical SHA1 digests to map the location + (ARC/WARC + offset) of the original record that was stored to subsequent + records that were not. When a request for a subsequent capture that was + not stored is received by wayback, it will return the content of the + previous stored record. + </p> + <p> + The matching of these digests occurs at query time, and is configured + by setting the "dedupeRecords" option of the LocalResourceIndex to + "true". + </p> + </section> </body> </document> This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site. |