From: John A. K. <joh...@us...> - 2005-08-23 17:36:04
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv30443 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: trivial changes (typos) plus test of xml2rfc-1.30 outputs Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** warc_file_format.html 18 Aug 2005 01:57:10 -0000 1.3 --- warc_file_format.html 23 Aug 2005 17:35:41 -0000 1.4 *************** *** 3,7 **** <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="description" content="The WARC File Format (Version 0.8 rev B)"> ! <meta name="generator" content="xml2rfc v1.29 (http://xml.resource.org/)"> <style type='text/css'> <!-- --- 3,7 ---- <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="description" content="The WARC File Format (Version 0.8 rev B)"> ! <meta name="generator" content="xml2rfc v1.30 (http://xml.resource.org/)"> <style type='text/css'> <!-- *************** *** 28,32 **** font-family: charcoal, monaco, geneva, "MS Sans Serif", helvetica, verdana, sans-serif; font-size: x-small ; background-color: #000000; } ! /* info code from SantaKlauss at http://www.madaboutstyle.com/tooltip2.html */ div#counter{margin-top: 100px} --- 28,32 ---- font-family: charcoal, monaco, geneva, "MS Sans Serif", helvetica, verdana, sans-serif; font-size: x-small ; background-color: #000000; } ! /* info code from SantaKlauss at http://www.madaboutstyle.com/tooltip2.html */ div#counter{margin-top: 100px} *************** *** 58,61 **** --- 58,63 ---- p.copyright { font-size: x-small ; } p.toc { font-size: small ; font-weight: bold ; margin-left: 3em ;} + table.toc { margin: 0 0 0 3em; padding: 0; border: 0; vertical-align: text-top; } + td.toc { font-size: small; font-weight: bold; vertical-align: text-top; } span.emph { font-style: italic; } *************** *** 95,108 **** td.author { font-weight: bold; margin-left: 4em; font-size: x-small ; } td.author-text { font-size: x-small; } ! table.data { vertical-align: top ; border-collapse: collapse ; border-style: solid solid solid solid ; border-color: black black black black ; font-size: small ; text-align: center ; } ! table.data th { font-weight: bold ; ! border-style: solid solid solid solid ; border-color: black black black black ; } ! table.data td { border-style: solid solid solid solid ; border-color: #333333 #333333 #333333 #333333 ; } hr { height: 1px } --- 97,119 ---- td.author { font-weight: bold; margin-left: 4em; font-size: x-small ; } td.author-text { font-size: x-small; } ! table.full { vertical-align: top ; border-collapse: collapse ; border-style: solid solid solid solid ; border-color: black black black black ; font-size: small ; text-align: center ; } ! table.headers, table.none { vertical-align: top ; border-collapse: collapse ; ! border-style: none; ! font-size: small ; text-align: center ; } ! table.full th { font-weight: bold ; ! border-style: solid ; border-color: black black black black ; } ! table.headers th { font-weight: bold ; ! border-style: none none solid none; ! border-color: black black black black ; } ! table.none th { font-weight: bold ; ! border-style: none; } ! table.full td { border-style: solid solid solid solid ; border-color: #333333 #333333 #333333 #333333 ; } + table.headers td, table.none td { border-style: none; } hr { height: 1px } *************** *** 178,202 **** <a href="#record_types">4.</a> Record Types<br /> ! <a href="#anchor4">4.1</a> 'warcinfo'<br /> ! <a href="#anchor5">4.2</a> 'response'<br /> ! <a href="#anchor6">4.3</a> 'resource'<br /> ! <a href="#anchor7">4.4</a> 'request'<br /> ! <a href="#anchor8">4.5</a> 'metadata'<br /> ! <a href="#anchor9">4.6</a> 'revisit'<br /> ! <a href="#anchor10">4.7</a> 'conversion'<br /> ! <a href="#anchor11">4.8</a> 'continuation'<br /> <a href="#anchor12">5.</a> Record Header<br /> ! <a href="#anchor13">5.1</a> Positional Parameters<br /> ! <a href="#anchor14">5.2</a> Named Parameters<br /> <a href="#anchor15">6.</a> --- 189,213 ---- <a href="#record_types">4.</a> Record Types<br /> ! <a href="#anchor4">4.1.</a> 'warcinfo'<br /> ! <a href="#anchor5">4.2.</a> 'response'<br /> ! <a href="#anchor6">4.3.</a> 'resource'<br /> ! <a href="#anchor7">4.4.</a> 'request'<br /> ! <a href="#anchor8">4.5.</a> 'metadata'<br /> ! <a href="#anchor9">4.6.</a> 'revisit'<br /> ! <a href="#anchor10">4.7.</a> 'conversion'<br /> ! <a href="#anchor11">4.8.</a> 'continuation'<br /> <a href="#anchor12">5.</a> Record Header<br /> ! <a href="#anchor13">5.1.</a> Positional Parameters<br /> ! <a href="#anchor14">5.2.</a> Named Parameters<br /> <a href="#anchor15">6.</a> *************** *** 204,226 **** <a href="#anchor16">7.</a> Truncated and Segmented Records<br /> ! <a href="#anchor17">7.1</a> Record Truncation<br /> ! <a href="#anchor18">7.2</a> Record Segmentation<br /> <a href="#anchor19">8.</a> WARC Application to Specific Protocols<br /> ! <a href="#anchor20">8.1</a> HTTP and HTTPS<br /> ! <a href="#anchor21">8.2</a> DNS<br /> ! <a href="#anchor22">8.3</a> Other Resources with URIs, and Other Protocols<br /> <a href="#anchor23">9.</a> Compression Recommendations<br /> ! <a href="#anchor24">9.1</a> Record-at-a-time Compression<br /> ! <a href="#anchor25">9.2</a> GZIP extra field: skip-lengths ('sl')<br /> ! <a href="#anchor26">9.3</a> GZIP WARC File Extension<br /> <a href="#anchor27">10.</a> --- 215,237 ---- <a href="#anchor16">7.</a> Truncated and Segmented Records<br /> ! <a href="#anchor17">7.1.</a> Record Truncation<br /> ! <a href="#anchor18">7.2.</a> Record Segmentation<br /> <a href="#anchor19">8.</a> WARC Application to Specific Protocols<br /> ! <a href="#anchor20">8.1.</a> HTTP and HTTPS<br /> ! <a href="#anchor21">8.2.</a> DNS<br /> ! <a href="#anchor22">8.3.</a> Other Resources with URIs, and Other Protocols<br /> <a href="#anchor23">9.</a> Compression Recommendations<br /> ! <a href="#anchor24">9.1.</a> Record-at-a-time Compression<br /> ! <a href="#anchor25">9.2.</a> GZIP extra field: skip-lengths ('sl')<br /> ! <a href="#anchor26">9.3.</a> GZIP WARC File Extension<br /> <a href="#anchor27">10.</a> *************** *** 232,254 **** <a href="#anchor30">13.</a> Acknowledgements<br /> ! <a href="#anchor31">A.</a> Consideratons in Choice of record-id<br /> ! <a href="#anchor32">B.</a> Examples of WARC Records<br /> ! <a href="#anchor33">B.1</a> Example of 'warcinfo' Record<br /> ! <a href="#anchor34">B.2</a> Example of 'request' Record<br /> ! <a href="#anchor35">B.3</a> Example of 'response' Record<br /> ! <a href="#anchor36">B.4</a> Example of 'resource' Record<br /> ! <a href="#anchor37">B.5</a> Example of 'metadata' Record<br /> ! <a href="#anchor38">B.6</a> Example of 'revisit' Record<br /> ! <a href="#anchor39">B.7</a> Example of 'conversion' Record<br /> ! <a href="#anchor40">B.8</a> Example of 'continuation' Record<br /> <a href="#rfc.references1">14.</a> --- 243,265 ---- <a href="#anchor30">13.</a> Acknowledgements<br /> ! <a href="#anchor31">Appendix A.</a> Consideratons in Choice of record-id<br /> ! <a href="#anchor32">Appendix B.</a> Examples of WARC Records<br /> ! <a href="#anchor33">Appendix B.1.</a> Example of 'warcinfo' Record<br /> ! <a href="#anchor34">Appendix B.2.</a> Example of 'request' Record<br /> ! <a href="#anchor35">Appendix B.3.</a> Example of 'response' Record<br /> ! <a href="#anchor36">Appendix B.4.</a> Example of 'resource' Record<br /> ! <a href="#anchor37">Appendix B.5.</a> Example of 'metadata' Record<br /> ! <a href="#anchor38">Appendix B.6.</a> Example of 'revisit' Record<br /> ! <a href="#anchor39">Appendix B.7.</a> Example of 'conversion' Record<br /> ! <a href="#anchor40">Appendix B.8.</a> Example of 'continuation' Record<br /> <a href="#rfc.references1">14.</a> *************** *** 269,273 **** simple text headers and an arbitary data block into one long file. The WARC format is a revision of the <a class="info" href="#ARC">ARC File ! Format<span> (</span><span class="info">Burner, M. and B. Kahle, “The ARC File Format,” September 1996.</span><span>)</span></a>[ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. --- 280,284 ---- simple text headers and an arbitary data block into one long file. The WARC format is a revision of the <a class="info" href="#ARC">ARC File ! Format<span> (</span><span class="info">Burner, M. and B. Kahle, “The ARC File Format,” September 1996.</span><span>)</span></a> [ARC] format that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. *************** *** 276,294 **** Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briey describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the discussion and experiences of the <a class="info" href="#IIPC">International ! Internet Preservation Consortium (IIPC)<span> (</span><span class="info">, “International Internet Preservation Consortium (IIPC),” .</span><span>)</span></a>[IIPC], whose members include the IA and the national libraries of a dozen countries. The revised ! format is expected to become the primary output format of the ! open-source <a class="info" href="#HERITRIX">Heritrix<span> (</span><span class="info">, “Heritrix Open Source Archival Web Crawler,” .</span><span>)</span></a>[HERITRIX] web crawler, and ! the input format for a wide array of cataloguing and access tools. </p> <p>The WARC format generalizes the older format to better support the ! harvesting, display, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbrieviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that --- 287,307 ---- Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the discussion and experiences of the <a class="info" href="#IIPC">International ! Internet Preservation Consortium (IIPC)<span> (</span><span class="info">, “International Internet Preservation Consortium (IIPC),” .</span><span>)</span></a> [IIPC], whose members include the IA and the national libraries of a dozen countries. The revised ! format is expected to be a standard way to structure, manage and ! store billions of collected web resources. For example, WARC will be ! an output format of harvesting software, such as the open-source ! <a class="info" href="#HERITRIX">Heritrix<span> (</span><span class="info">, “Heritrix Open Source Archival Web Crawler,” .</span><span>)</span></a> [HERITRIX] web crawler, and an input ! format for a wide array of cataloguing and access tools. </p> <p>The WARC format generalizes the older format to better support the ! harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that *************** *** 353,357 **** block = *OCTET </pre> - <p>Elements of this grammar are further specified and explained in sections that follow (and in the case of <span class="emph">anvl-fields</span>, also a separate document). --- 366,369 ---- *************** *** 367,371 **** tsp = 1*WSP </pre> - <p>The amount of whitespace between <span class="emph">header-line</span> tokens is variable. This gives archive builders the flexibility to add padding and later adjust --- 379,382 ---- *************** *** 375,379 **** </p> <p>After the <span class="emph">header-line</span> come any number of ! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a>[ANVL] that is very similar to that of email headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.</span><span>)</span></a>. Its format can be roughly summarized as the following: --- 386,390 ---- </p> <p>After the <span class="emph">header-line</span> come any number of ! named fields in a line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL] that is very similar to that of email headers <a class="info" href="#RFC0822">[RFC0822]<span> (</span><span class="info">Crocker, D., “Standard for the format of ARPA Internet text messages,” August 1982.</span><span>)</span></a>. Its format can be roughly summarized as the following: *************** *** 384,388 **** other-anvl = <see ANVL> </pre> - <p>This document defines a number of named fields which may appear in the <span class="emph">anvl-fields</span> area of the header. Note that --- 395,398 ---- *************** *** 424,428 **** appropriate and how they can be standardized is warranted.] </p> ! <a name="rfc.section.4.1"></a><h4><a name="anchor4">4.1</a> 'warcinfo'</h4> <p>A 'warcinfo' record describes the records that follow it, up through end of --- 434,440 ---- appropriate and how they can be standardized is warranted.] </p> ! <a name="anchor4"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.1"></a><h3>4.1. 'warcinfo'</h3> <p>A 'warcinfo' record describes the records that follow it, up through end of *************** *** 451,455 **** content block must be formally defined somewhere.] </p> ! <a name="rfc.section.4.2"></a><h4><a name="anchor5">4.2</a> 'response'</h4> <p>A 'response' record contains an entire protocol response, such as a full --- 463,469 ---- content block must be formally defined somewhere.] </p> ! <a name="anchor5"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.2"></a><h3>4.2. 'response'</h3> <p>A 'response' record contains an entire protocol response, such as a full *************** *** 461,465 **** 'IP-Address' and 'Related-Record-ID'. </p> ! <a name="rfc.section.4.3"></a><h4><a name="anchor6">4.3</a> 'resource'</h4> <p>A 'resource' record contains a resource, without full protocol response --- 475,481 ---- 'IP-Address' and 'Related-Record-ID'. </p> ! <a name="anchor6"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.3"></a><h3>4.3. 'resource'</h3> <p>A 'resource' record contains a resource, without full protocol response *************** *** 469,473 **** includes the named parameter 'Related-Record-ID'. </p> ! <a name="rfc.section.4.4"></a><h4><a name="anchor7">4.4</a> 'request'</h4> <p>A 'request' record holds the manner in which a primary record's content was --- 485,491 ---- includes the named parameter 'Related-Record-ID'. </p> ! <a name="anchor7"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.4"></a><h3>4.4. 'request'</h3> <p>A 'request' record holds the manner in which a primary record's content was *************** *** 476,480 **** 'Related-Record-ID'. </p> ! <a name="rfc.section.4.5"></a><h4><a name="anchor8">4.5</a> 'metadata'</h4> <p>A 'metadata' record contains content created in order to further describe, --- 494,500 ---- 'Related-Record-ID'. </p> ! <a name="anchor8"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.5"></a><h3>4.5. 'metadata'</h3> <p>A 'metadata' record contains content created in order to further describe, *************** *** 494,501 **** formally specified somewhere.] </p> ! <a name="rfc.section.4.6"></a><h4><a name="anchor9">4.6</a> 'revisit'</h4> <p>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbrieviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to --- 514,523 ---- formally specified somewhere.] </p> ! <a name="anchor9"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.6"></a><h3>4.6. 'revisit'</h3> <p>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to *************** *** 527,531 **** somewhere.] </p> ! <a name="rfc.section.4.7"></a><h4><a name="anchor10">4.7</a> 'conversion'</h4> <p>A 'conversion' record contains an alternative version of another record's --- 549,555 ---- somewhere.] </p> ! <a name="anchor10"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.7"></a><h3>4.7. 'conversion'</h3> <p>A 'conversion' record contains an alternative version of another record's *************** *** 549,553 **** specified somewhere.] </p> ! <a name="rfc.section.4.8"></a><h4><a name="anchor11">4.8</a> 'continuation'</h4> <p>A 'continuation' record needs to be logically appended to a prior record --- 573,579 ---- specified somewhere.] </p> ! <a name="anchor11"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.4.8"></a><h3>4.8. 'continuation'</h3> <p>A 'continuation' record needs to be logically appended to a prior record *************** *** 599,608 **** record-id = uri </pre> - <p>The warc-id string may change in future versions, but will always begin "warc/", and will always be 8 octets long. </p> <p>Named parameters after the header-line, if any, follow the ! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a>[ANVL]. Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters --- 625,633 ---- record-id = uri </pre> <p>The warc-id string may change in future versions, but will always begin "warc/", and will always be 8 octets long. </p> <p>Named parameters after the header-line, if any, follow the ! line-oriented syntax called <a class="info" href="#ANVL">ANVL<span> (</span><span class="info">Kunze, J., Kahle, B., Masanes, J., and G. Mohr, “A Name-Value Language,” .</span><span>)</span></a> [ANVL]. Normally, named parameters are optional and their order is insignificant, however, specific record types require that certain named parameters *************** *** 612,616 **** consecutive newlines). </p> ! <a name="rfc.section.5.1"></a><h4><a name="anchor13">5.1</a> Positional Parameters</h4> <p>This section describes each of the individual positional parameters --- 637,643 ---- consecutive newlines). </p> ! <a name="anchor13"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.5.1"></a><h3>5.1. Positional Parameters</h3> <p>This section describes each of the individual positional parameters *************** *** 638,642 **** this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the ! end of the file. <br /> --- 665,670 ---- this many octets from that first character of the record header, there should be two newlines and either the beginning of a new record or the ! end of the file. (WARC reading implementations may choose to tolerate ! more or fewer newlines at the end of a record.) <br /> *************** *** 644,649 **** ! Defensive programming suggests the practice of tolerating fewer or ! more than two newlines at record's end. If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might --- 672,676 ---- ! If the first next token does not match the first token of a WARC record, then the previous data-length should be considered in error; corrective action might *************** *** 725,729 **** </dd> </dl></blockquote> ! <a name="rfc.section.5.2"></a><h4><a name="anchor14">5.2</a> Named Parameters</h4> <p>Named parameters, also referred to as named fields, are optional --- 752,758 ---- </dd> </dl></blockquote> ! <a name="anchor14"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.5.2"></a><h3>5.2. Named Parameters</h3> <p>Named parameters, also referred to as named fields, are optional *************** *** 757,761 **** </pre> - [REVIEW ISSUE: Should we recommend an algorithm? SHA1's continued viability as a secure hash is in doubt given recent crypto research --- 786,789 ---- *************** *** 863,867 **** header-line.] </p> ! <a name="rfc.section.7.1"></a><h4><a name="anchor17">7.1</a> Record Truncation</h4> <p>Any record may indicate that truncation has occurred and give the --- 891,897 ---- header-line.] </p> ! <a name="anchor17"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.7.1"></a><h3>7.1. Record Truncation</h3> <p>Any record may indicate that truncation has occurred and give the *************** *** 871,875 **** exceeding a length limit. </p> ! <a name="rfc.section.7.2"></a><h4><a name="anchor18">7.2</a> Record Segmentation</h4> <p>A record that will not fit into a single WARC file of desired --- 901,907 ---- exceeding a length limit. </p> ! <a name="anchor18"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.7.2"></a><h3>7.2. Record Segmentation</h3> <p>A record that will not fit into a single WARC file of desired *************** *** 906,910 **** <a name="rfc.section.8"></a><h3>8. WARC Application to Specific Protocols</h3> ! <a name="rfc.section.8.1"></a><h4><a name="anchor20">8.1</a> HTTP and HTTPS</h4> <p>A full HTTP or HTTPS response, with protocol information and --- 938,944 ---- <a name="rfc.section.8"></a><h3>8. WARC Application to Specific Protocols</h3> ! <a name="anchor20"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.8.1"></a><h3>8.1. HTTP and HTTPS</h3> <p>A full HTTP or HTTPS response, with protocol information and *************** *** 956,960 **** "message/http" type. </p> ! <a name="rfc.section.8.2"></a><h4><a name="anchor21">8.2</a> DNS</h4> <p>A request for DNS information can be summarized in a URI in --- 990,996 ---- "message/http" type. </p> ! <a name="anchor21"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.8.2"></a><h3>8.2. DNS</h3> <p>A request for DNS information can be summarized in a URI in *************** *** 966,970 **** type. </p> ! <a name="rfc.section.8.3"></a><h4><a name="anchor22">8.3</a> Other Resources with URIs, and Other Protocols</h4> <p>Any resource that can be identified with a URI, even if it is not --- 1002,1008 ---- type. </p> ! <a name="anchor22"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.8.3"></a><h3>8.3. Other Resources with URIs, and Other Protocols</h3> <p>Any resource that can be identified with a URI, even if it is not *************** *** 1009,1013 **** compressing WARC files with GZIP. </p> ! <a name="rfc.section.9.1"></a><h4><a name="anchor24">9.1</a> Record-at-a-time Compression</h4> <p>Per section 2.2 of the GZIP specification, a valid GZIP file --- 1047,1053 ---- compressing WARC files with GZIP. </p> ! <a name="anchor24"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.1"></a><h3>9.1. Record-at-a-time Compression</h3> <p>Per section 2.2 of the GZIP specification, a valid GZIP file *************** *** 1029,1033 **** record. </p> ! <a name="rfc.section.9.2"></a><h4><a name="anchor25">9.2</a> GZIP extra field: skip-lengths ('sl')</h4> <p>Customarily, GZIP members do not declare their compressed --- 1069,1075 ---- record. </p> ! <a name="anchor25"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.2"></a><h3>9.2. GZIP extra field: skip-lengths ('sl')</h3> <p>Customarily, GZIP members do not declare their compressed *************** *** 1069,1073 **** appropriate. </p> ! <a name="rfc.section.9.3"></a><h4><a name="anchor26">9.3</a> GZIP WARC File Extension</h4> <p>WARC files compressed with the above conventions remain legal GZIP --- 1111,1117 ---- appropriate. </p> ! <a name="anchor26"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.3"></a><h3>9.3. GZIP WARC File Extension</h3> <p>WARC files compressed with the above conventions remain legal GZIP *************** *** 1195,1199 **** there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include ! <a class="info" href="#RFC2141">URN<span> (</span><span class="info">Moats, R., “URN Syntax,” May 1997.</span><span>)</span></a>[RFC2141], <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rogers, “The ARK Persistent Identifier Scheme,” February 2005.</span><span>)</span></a>, <a class="info" href="#GUID">[GUID]<span> (</span><span class="info">, “Wikipedia: Globally Unique Identifiers,” .</span><span>)</span></a>, etc. </p> --- 1239,1243 ---- there are providers to service them. This specification does not dictate what identifier scheme to use; suitable schemes include ! <a class="info" href="#RFC2141">URN<span> (</span><span class="info">Moats, R., “URN Syntax,” May 1997.</span><span>)</span></a> [RFC2141], <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.</span><span>)</span></a>, <a class="info" href="#GUID">[GUID]<span> (</span><span class="info">, “Wikipedia: Globally Unique Identifiers,” .</span><span>)</span></a>, etc. </p> *************** *** 1208,1212 **** </p> <p>These conventions are suggested by <a class="info" href="#RFC2396">[RFC2396]<span> (</span><span class="info">Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.</span><span>)</span></a>, ! formalized by the <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rogers, “The ARK Persistent Identifier Scheme,” February 2005.</span><span>)</span></a> scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that --- 1252,1256 ---- </p> <p>These conventions are suggested by <a class="info" href="#RFC2396">[RFC2396]<span> (</span><span class="info">Berners-Lee, T., Fielding, R., and L. Masinter, “Uniform Resource Identifiers (URI): Generic Syntax,” August 1998.</span><span>)</span></a>, ! formalized by the <a class="info" href="#ARK">[ARK]<span> (</span><span class="info">Kunze, J. and R. Rodgers, “The ARK Persistent Identifier Scheme,” August 2005.</span><span>)</span></a> scheme, and are applicable to such things as the summarizing of large search results from Internet-wide indexing engines. As an example of a convention that *************** *** 1218,1222 **** http://abc.org/12026/987654321 </pre> - <p>The convention could also reserve the extension strings "_s", "_d", and "_t" to indicate record- ids for secondary, duplicate, and --- 1262,1265 ---- *************** *** 1230,1234 **** http://abc.org/12026/987654321/_t </pre> - <p>...in which an integer count may further extend the identifier when more there is more than one relationship of the given type. --- 1273,1276 ---- *************** *** 1246,1255 **** and checksums shown are plausible random filler. </p> ! <a name="rfc.section.B.1"></a><h4><a name="anchor33">Appendix B.1</a> Example of 'warcinfo' Record</h4> <p>The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbrieviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with --- 1288,1299 ---- and checksums shown are plausible random filler. </p> ! <a name="anchor33"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.1"></a><h3>Appendix B.1. Example of 'warcinfo' Record</h3> <p>The following 'warcinfo' example includes an XML description of the enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbreviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with *************** *** 1283,1287 **** </pre> - <p>The first line (spread over three lines for readability) shows the required line of positional parameters. This record has no named --- 1327,1330 ---- *************** *** 1290,1294 **** header-line. Two newlines follow the content block. </p> ! <a name="rfc.section.B.2"></a><h4><a name="anchor34">Appendix B.2</a> Example of 'request' Record</h4> <p>A 'request' record captures the protocol request used to collect a --- 1333,1339 ---- header-line. Two newlines follow the content block. </p> ! <a name="anchor34"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.2"></a><h3>Appendix B.2. Example of 'request' Record</h3> <p>A 'request' record captures the protocol request used to collect a *************** *** 1307,1312 **** </pre> ! ! <a name="rfc.section.B.3"></a><h4><a name="anchor35">Appendix B.3</a> Example of 'response' Record</h4> <p>The archived response to the above request might look like the --- 1352,1358 ---- </pre> ! <a name="anchor35"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.3"></a><h3>Appendix B.3. Example of 'response' Record</h3> <p>The archived response to the above request might look like the *************** *** 1333,1342 **** [6958 bytes of binary data here] </pre> - <p>Note the 'Related-Record-ID' named field referring back to the generating 'request' record, and the creation-date identical to the previous record. </p> ! <a name="rfc.section.B.4"></a><h4><a name="anchor36">Appendix B.4</a> Example of 'resource' Record</h4> <p>This same file, "logo.jpg", might be archived internally to an --- 1379,1389 ---- [6958 bytes of binary data here] </pre> <p>Note the 'Related-Record-ID' named field referring back to the generating 'request' record, and the creation-date identical to the previous record. </p> ! <a name="anchor36"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.4"></a><h3>Appendix B.4. Example of 'resource' Record</h3> <p>This same file, "logo.jpg", might be archived internally to an *************** *** 1351,1356 **** [6958 bytes of binary data here] </pre> ! ! <a name="rfc.section.B.5"></a><h4><a name="anchor37">Appendix B.5</a> Example of 'metadata' Record</h4> <p>If some crawl-time metadata should be archived near the above --- 1398,1404 ---- [6958 bytes of binary data here] </pre> ! <a name="anchor37"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.5"></a><h3>Appendix B.5. Example of 'metadata' Record</h3> <p>If some crawl-time metadata should be archived near the above *************** *** 1370,1379 **** </harvestmetadata> </pre> - <p>Note again the same creation-date as the preceding related records. A relationship is declared o the preceding 'response' record, but declaring a relationship to the 'request' would also be legal. </p> ! <a name="rfc.section.B.6"></a><h4><a name="anchor38">Appendix B.6</a> Example of 'revisit' Record</h4> <p>If the same URI is later revisited and the content is unchanged, a --- 1418,1428 ---- </harvestmetadata> </pre> <p>Note again the same creation-date as the preceding related records. A relationship is declared o the preceding 'response' record, but declaring a relationship to the 'request' would also be legal. </p> ! <a name="anchor38"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.6"></a><h3>Appendix B.6. Example of 'revisit' Record</h3> <p>If the same URI is later revisited and the content is unchanged, a *************** *** 1396,1400 **** </revisit> </pre> - <p>Again, reference is made back to the original 'response' record. A new creation-date reflects he time of revisit. This content block --- 1445,1448 ---- *************** *** 1405,1409 **** defined. </p> ! <a name="rfc.section.B.7"></a><h4><a name="anchor39">Appendix B.7</a> Example of 'conversion' Record</h4> <p>At some future date, the "image/jpeg" format may no longer be --- 1453,1459 ---- defined. </p> ! <a name="anchor39"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.7"></a><h3>Appendix B.7. Example of 'conversion' Record</h3> <p>At some future date, the "image/jpeg" format may no longer be *************** *** 1421,1425 **** [3098 bytes of binary data here] </pre> - <p>An accompanying 'metadata' record, referring to this 'conversion' record, could contain additional details about the --- 1471,1474 ---- *************** *** 1427,1431 **** serve this role.) </p> ! <a name="rfc.section.B.8"></a><h4><a name="anchor40">Appendix B.8</a> Example of 'continuation' Record</h4> <p>If the 'response' above had been so large that it would not fit --- 1476,1482 ---- serve this role.) </p> ! <a name="anchor40"></a><br /><hr /> ! <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.B.8"></a><h3>Appendix B.8. Example of 'continuation' Record</h3> <p>If the 'response' above had been so large that it would not fit *************** *** 1447,1451 **** [39514114 bytes of binary data here] </pre> - <p>Note that the 'Segment-Origin-ID' refers to the first segment of the set, the one with the "Segment-Number: 1" named field. --- 1498,1501 ---- *************** *** 1460,1464 **** <td class="author-text">Burner, M. and B. Kahle, “<a href="http://www.archive.org/web/researcher/ArcFileFormat.php">The ARC File Format</a>,” September 1996.</td></tr> <tr><td class="author-text" valign="top"><a name="ARK">[ARK]</a></td> ! <td class="author-text">Kunze, J. and R. Rogers, “<a href="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">The ARK Persistent Identifier Scheme</a>,” February 2005.</td></tr> <tr><td class="author-text" valign="top"><a name="GUID">[GUID]</a></td> <td class="author-text">“<a href="http://en.wikipedia.org/wiki/GUID">Wikipedia: Globally Unique Identifiers</a>.”</td></tr> --- 1510,1514 ---- <td class="author-text">Burner, M. and B. Kahle, “<a href="http://www.archive.org/web/researcher/ArcFileFormat.php">The ARC File Format</a>,” September 1996.</td></tr> <tr><td class="author-text" valign="top"><a name="ARK">[ARK]</a></td> ! <td class="author-text">Kunze, J. and R. Rodgers, “<a href="http://www.cdlib.org/inside/diglib/ark/arkspec.pdf">The ARK Persistent Identifier Scheme</a>,” August 2005.</td></tr> <tr><td class="author-text" valign="top"><a name="GUID">[GUID]</a></td> <td class="author-text">“<a href="http://en.wikipedia.org/wiki/GUID">Wikipedia: Globally Unique Identifiers</a>.”</td></tr> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** warc_file_format.xml 22 Aug 2005 17:28:24 -0000 1.6 --- warc_file_format.xml 23 Aug 2005 17:35:41 -0000 1.7 *************** *** 121,125 **** Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briey describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the --- 121,125 ---- Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briefly describes the harvested content and its length. This is directly followed by the the retrieval protocol response messages and content. The motivation to revise the format arose from the *************** *** 137,141 **** organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbrieviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that --- 137,141 ---- organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned ! metadata, abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools that *************** *** 367,371 **** <t>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbrieviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to --- 367,371 ---- <t>A 'revisit' record describes the revisitation of content already archived, ! and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' record to *************** *** 1129,1133 **** enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbrieviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with --- 1129,1133 ---- enclosing WARC file that is loosely modelled after the descriptions currently used in Internet Archive ARC files. However, this is an ! abbreviated and speculative illustration; the referenced WARC-specific namespace "http://archive.org/warc/0.8" has not been formally defined anywhere, and may not reflect eventual practice with Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.3 retrieving revision 1.4 diff -C2 -d -r1.3 -r1.4 *** warc_file_format.txt 18 Aug 2005 01:57:10 -0000 1.3 --- warc_file_format.txt 23 Aug 2005 17:35:41 -0000 1.4 *************** *** 120,163 **** 3. The WARC Record Model . . . . . . . . . . . . . . . . . . . . 6 4. Record Types . . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.1 'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.2 'response' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.3 'resource' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.4 'request' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.5 'metadata' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.6 'revisit' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.7 'conversion' . . . . . . . . . . . . . . . . . . . . . . . 10 ! 4.8 'continuation' . . . . . . . . . . . . . . . . . . . . . . 10 5. Record Header . . . . . . . . . . . . . . . . . . . . . . . . 12 ! 5.1 Positional Parameters . . . . . . . . . . . . . . . . . . 13 ! 5.2 Named Parameters . . . . . . . . . . . . . . . . . . . . . 14 6. Record Content Block . . . . . . . . . . . . . . . . . . . . . 17 7. Truncated and Segmented Records . . . . . . . . . . . . . . . 18 ! 7.1 Record Truncation . . . . . . . . . . . . . . . . . . . . 18 ! 7.2 Record Segmentation . . . . . . . . . . . . . . . . . . . 18 8. WARC Application to Specific Protocols . . . . . . . . . . . . 20 ! 8.1 HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 20 ! 8.2 DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ! 8.3 Other Resources with URIs, and Other Protocols . . . . . . 21 9. Compression Recommendations . . . . . . . . . . . . . . . . . 22 ! 9.1 Record-at-a-time Compression . . . . . . . . . . . . . . . 22 ! 9.2 GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3 GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23 ! 10. WARC File Name and Size Recommendations . . . . . . . . . . 24 ! 11. Registration of MIME Media Type application/warc . . . . . . 25 ! 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . 26 ! 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 27 ! A. Consideratons in Choice of record-id . . . . . . . . . . . . . 28 ! B. Examples of WARC Records . . . . . . . . . . . . . . . . . . . 29 ! B.1 Example of 'warcinfo' Record . . . . . . . . . . . . . . . 29 ! B.2 Example of 'request' Record . . . . . . . . . . . . . . . 30 ! B.3 Example of 'response' Record . . . . . . . . . . . . . . . 30 ! B.4 Example of 'resource' Record . . . . . . . . . . . . . . . 31 ! B.5 Example of 'metadata' Record . . . . . . . . . . . . . . . 31 ! B.6 Example of 'revisit' Record . . . . . . . . . . . . . . . 31 ! B.7 Example of 'conversion' Record . . . . . . . . . . . . . . 32 ! B.8 Example of 'continuation' Record . . . . . . . . . . . . . 32 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . 33 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 34 ! Intellectual Property and Copyright Statements . . . . . . . . 36 --- 120,163 ---- 3. The WARC Record Model . . . . . . . . . . . . . . . . . . . . 6 4. Record Types . . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.1. 'warcinfo' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.2. 'response' . . . . . . . . . . . . . . . . . . . . . . . . 8 ! 4.3. 'resource' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.4. 'request' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.5. 'metadata' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.6. 'revisit' . . . . . . . . . . . . . . . . . . . . . . . . 9 ! 4.7. 'conversion' . . . . . . . . . . . . . . . . . . . . . . . 10 ! 4.8. 'continuation' . . . . . . . . . . . . . . . . . . . . . . 10 5. Record Header . . . . . . . . . . . . . . . . . . . . . . . . 12 ! 5.1. Positional Parameters . . . . . . . . . . . . . . . . . . 13 ! 5.2. Named Parameters . . . . . . . . . . . . . . . . . . . . . 14 6. Record Content Block . . . . . . . . . . . . . . . . . . . . . 17 7. Truncated and Segmented Records . . . . . . . . . . . . . . . 18 ! 7.1. Record Truncation . . . . . . . . . . . . . . . . . . . . 18 ! 7.2. Record Segmentation . . . . . . . . . . . . . . . . . . . 18 8. WARC Application to Specific Protocols . . . . . . . . . . . . 20 ! 8.1. HTTP and HTTPS . . . . . . . . . . . . . . . . . . . . . . 20 ! 8.2. DNS . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ! 8.3. Other Resources with URIs, and Other Protocols . . . . . . 21 9. Compression Recommendations . . . . . . . . . . . . . . . . . 22 ! 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 22 ! 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3. GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23 ! 10. WARC File Name and Size Recommendations . . . . . . . . . . . 24 ! 11. Registration of MIME Media Type application/warc . . . . . . . 25 ! 12. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 26 ! 13. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 27 ! Appendix A. Consideratons in Choice of record-id . . . . . . . . 28 ! Appendix B. Examples of WARC Records . . . . . . . . . . . . . . 29 ! Appendix B.1. Example of 'warcinfo' Record . . . . . . . . . . . . 29 ! Appendix B.2. Example of 'request' Record . . . . . . . . . . . . 30 ! Appendix B.3. Example of 'response' Record . . . . . . . . . . . . 30 ! Appendix B.4. Example of 'resource' Record . . . . . . . . . . . . 31 ! Appendix B.5. Example of 'metadata' Record . . . . . . . . . . . . 31 ! Appendix B.6. Example of 'revisit' Record . . . . . . . . . . . . 31 ! Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32 ! Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 ! Intellectual Property and Copyright Statements . . . . . . . . . . 36 *************** *** 182,200 **** Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briey describes the harvested content and its length. This ! is directly followed by the the retrieval protocol response messages ! and content. The motivation to revise the format arose from the ! discussion and experiences of the International Internet Preservation ! Consortium (IIPC) [IIPC], whose members include the IA and the ! national libraries of a dozen countries. The revised format is ! expected to become the primary output format of the open-source ! Heritrix [HERITRIX] web crawler, and the input format for a wide ! array of cataloguing and access tools. The WARC format generalizes the older format to better support the ! harvesting, display, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, ! abbrieviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools --- 182,202 ---- Archive (IA) to record a sequence of materials captured from the web (e.g., web "pages"). Each capture is preceded by a one-line header ! that very briefly describes the harvested content and its length. ! This is directly followed by the the retrieval protocol response ! messages and content. The motivation to revise the format arose from ! the discussion and experiences of the International Internet ! Preservation Consortium (IIPC) [IIPC], whose members include the IA ! and the national libraries of a dozen countries. The revised format ! is expected to be a standard way to structure, manage and store ! billions of collected web resources. For example, WARC will be an ! output format of harvesting software, such as the open-source ! Heritrix [HERITRIX] web crawler, and an input format for a wide array ! of cataloguing and access tools. The WARC format generalizes the older format to better support the ! harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, ! abbreviated duplicate detection events, and later-date transformations. The revision may also be useful for more general applications than web archiving. To aid the development of tools *************** *** 219,224 **** - - Kunze, et al. Expires January 2, 2006 [Page 4] --- 221,224 ---- *************** *** 409,413 **** appropriate and how they can be standardized is warranted.] ! 4.1 'warcinfo' A 'warcinfo' record describes the records that follow it, up through --- 409,413 ---- appropriate and how they can be standardized is warranted.] ! 4.1. 'warcinfo' A 'warcinfo' record describes the records that follow it, up through *************** *** 436,440 **** content block must be formally defined somewhere.] ! 4.2 'response' A 'response' record contains an entire protocol response, such as a --- 436,440 ---- content block must be formally defined somewhere.] ! 4.2. 'response' A 'response' record contains an entire protocol response, such as a *************** *** 454,458 **** named parameters 'IP-Address' and 'Related-Record-ID'. ! 4.3 'resource' A 'resource' record contains a resource, without full protocol --- 454,458 ---- named parameters 'IP-Address' and 'Related-Record-ID'. ! 4.3. 'resource' A 'resource' record contains a resource, without full protocol *************** *** 462,466 **** often includes the named parameter 'Related-Record-ID'. ! 4.4 'request' A 'request' record holds the manner in which a primary record's --- 462,466 ---- often includes the named parameter 'Related-Record-ID'. ! 4.4. 'request' A 'request' record holds the manner in which a primary record's *************** *** 469,473 **** parameter 'Related-Record-ID'. ! 4.5 'metadata' A 'metadata' record contains content created in order to further --- 469,473 ---- parameter 'Related-Record-ID'. ! 4.5. 'metadata' A 'metadata' record contains content created in order to further *************** *** 487,494 **** formally specified somewhere.] ! 4.6 'revisit' A 'revisit' record describes the revisitation of content already ! archived, and includes only an abbrieviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' --- 487,494 ---- formally specified somewhere.] ! 4.6. 'revisit' A 'revisit' record describes the revisitation of content already ! archived, and includes only an abbreviated content block which must be interpreted relative to a previous record. Most typically, a 'revisit' record is be used instead of 'response' or 'resource' *************** *** 528,532 **** somewhere.] ! 4.7 'conversion' A 'conversion' record contains an alternative version of another --- 528,532 ---- somewhere.] ! 4.7. 'conversion' A 'conversion' record contains an alternative version of another *************** *** 550,554 **** specified somewhere.] ! 4.8 'continuation' A 'continuation' record needs to be logically appended to a prior --- 550,554 ---- specified somewhere.] ! 4.8. 'continuation' A 'continuation' record needs to be logically appended to a prior *************** *** 674,678 **** ! 5.1 Positional Parameters This section describes each of the individual positional parameters --- 674,678 ---- ! 5.1. Positional Parameters This section describes each of the individual positional parameters *************** *** 695,707 **** After proceeding this many octets from that first character of the record header, there should be two newlines and either the ! beginning of a new record or the end of the file. ! Defensive programming suggests the practice of tolerating fewer or ! more than two newlines at record's end. If the first next token ! does not match the first token of a WARC record, then the previous ! data-length should be considered in error; corrective action might ! include searching for a nearby occurrence of "warc/0.8" and other ! character patterns indicative of a legal record beginning. record-type The kind of WARC record. All record types are optional, --- 695,708 ---- After proceeding this many octets from that first character of the record header, there should be two newlines and either the ! beginning of a new record or the end of the file. (WARC reading ! implementations may choose to tolerate more or fewer newlines at ! the end of a record.) ! If the first next token does not match the first token of a WARC ! record, then the previous data-length should be considered in ! error; corrective action might include searching for a nearby ! occurrence of "warc/0.8" and other character patterns indicative ! of a legal record beginning. record-type The kind of WARC... [truncated message content] |