From: John A. K. <joh...@us...> - 2005-08-26 23:19:27
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20105 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: added proposed text for a Warcinfo-ID named parameter Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.5 retrieving revision 1.6 diff -C2 -d -r1.5 -r1.6 *** warc_file_format.html 24 Aug 2005 01:39:51 -0000 1.5 --- warc_file_format.html 26 Aug 2005 23:19:18 -0000 1.6 *************** *** 234,238 **** GZIP extra field: skip-lengths ('sl')<br /> <a href="#anchor26">9.3.</a> ! GZIP WARC File Extension<br /> <a href="#anchor27">10.</a> WARC File Name and Size Recommendations<br /> --- 234,238 ---- GZIP extra field: skip-lengths ('sl')<br /> <a href="#anchor26">9.3.</a> ! GZIP WARC File Name Suffix<br /> <a href="#anchor27">10.</a> WARC File Name and Size Recommendations<br /> *************** *** 406,412 **** record <span class="emph">data-length</span>. </p> ! <p>It is customary, and recommended, that the first record of a WARC ! describe the file itself, using the 'warcinfo' record-type, and a ! descriptive content block format. </p> <p>Subsequent records contain content blocks that are either the --- 406,415 ---- record <span class="emph">data-length</span>. </p> ! <p>It is often the case that the first record of a WARC to has the ! record-type 'warcinfo' and is used to describe the records that follow it. ! It is always the case that the concatenation of any two WARC files is a ! syntactically correct WARC file; care should be taken, however, when ! concatenation would inadvertently cause 'warcinfo' records to appear ! at points in the result that would create confusion. </p> <p>Subsequent records contain content blocks that are either the *************** *** 851,854 **** --- 854,873 ---- </dd> + <dt>Warcinfo-ID: record-id</dt> + <dd> + When present, indicates the record-id of the associated 'warcinfo' + record for this record. Typically, the Warcinfo-ID parameter is used + when the context of the applicable 'warcinfo' record is unavailable, + such as after distributing single records into separate WARC files. + WARC writing applications (such web crawlers) may choose to record + this parameter routinely (e.g., before computing checksums). + + The Warcinfo-ID parameter overrides any association with a previously + occurring (in the WARC) 'warcinfo' record, thus providing a way to protect + the true association when records are combined from different WARCs. + Use of this parameter in a record of type 'warcinfo' is undefined and + reserved for possible future extension. + + </dd> </dl></blockquote> <a name="anchor15"></a><br /><hr /> *************** *** 1113,1124 **** <a name="anchor26"></a><br /><hr /> <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.3"></a><h3>9.3. GZIP WARC File Extension</h3> ! <p>WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure they are properly recognized by GZIP tools, they ! should only get the customary additional ".gz" file extension suffix, ! making their suffix ".warc.gz". Software which works with WARC files ! compressed using these conventions will detect and exploit them; other ! GZIP software will harmlessly ignore the extensions. </p> <a name="anchor27"></a><br /><hr /> --- 1132,1143 ---- <a name="anchor26"></a><br /><hr /> <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> ! <a name="rfc.section.9.3"></a><h3>9.3. GZIP WARC File Name Suffix</h3> ! <p>A WARC file compressed with the extra GZIP field conventions described ! in this document is a legal GZIP file. To ensure that it is properly ! recognized by GZIP tools, its name should have the customary ".gz" ! appended to it, making the complete suffix, ".warc.gz". ! GZIP software that does not recognize the extra GZIP fields will ! simply pass over them without benefit or harm. </p> <a name="anchor27"></a><br /><hr /> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.9 retrieving revision 1.10 diff -C2 -d -r1.9 -r1.10 *** warc_file_format.xml 26 Aug 2005 22:29:40 -0000 1.9 --- warc_file_format.xml 26 Aug 2005 23:19:18 -0000 1.10 *************** *** 260,266 **** record <spanx style="emph">data-length</spanx>.</t> ! <t>It is customary, and recommended, that the first record of a WARC ! describe the file itself, using the 'warcinfo' record-type, and a ! descriptive content block format.</t> <t>Subsequent records contain content blocks that are either the --- 260,269 ---- record <spanx style="emph">data-length</spanx>.</t> ! <t>It is often the case that the first record of a WARC to has the ! record-type 'warcinfo' and is used to describe the records that follow it. ! It is always the case that the concatenation of any two WARC files is a ! syntactically correct WARC file; care should be taken, however, when ! concatenation would inadvertently cause 'warcinfo' records to appear ! at points in the result that would create confusion.</t> <t>Subsequent records contain content blocks that are either the *************** *** 680,683 **** --- 683,701 ---- </t> + <t hangText="Warcinfo-ID: record-id"> + When present, indicates the record-id of the associated 'warcinfo' + record for this record. Typically, the Warcinfo-ID parameter is used + when the context of the applicable 'warcinfo' record is unavailable, + such as after distributing single records into separate WARC files. + WARC writing applications (such web crawlers) may choose to record + this parameter routinely (e.g., before computing checksums). + + The Warcinfo-ID parameter overrides any association with a previously + occurring (in the WARC) 'warcinfo' record, thus providing a way to protect + the true association when records are combined from different WARCs. + Use of this parameter in a record of type 'warcinfo' is undefined and + reserved for possible future extension. + </t> + </list> Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.4 retrieving revision 1.5 diff -C2 -d -r1.4 -r1.5 *** warc_file_format.txt 23 Aug 2005 17:35:41 -0000 1.4 --- warc_file_format.txt 26 Aug 2005 23:19:18 -0000 1.5 *************** *** 142,146 **** 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 22 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3. GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23 10. WARC File Name and Size Recommendations . . . . . . . . . . . 24 11. Registration of MIME Media Type application/warc . . . . . . . 25 --- 142,146 ---- 9.1. Record-at-a-time Compression . . . . . . . . . . . . . . . 22 9.2. GZIP extra field: skip-lengths ('sl') . . . . . . . . . . 22 ! 9.3. GZIP WARC File Name Suffix . . . . . . . . . . . . . . . . 23 10. WARC File Name and Size Recommendations . . . . . . . . . . . 24 11. Registration of MIME Media Type application/warc . . . . . . . 25 *************** *** 342,348 **** _data-length_. ! It is customary, and recommended, that the first record of a WARC ! describe the file itself, using the 'warcinfo' record-type, and a ! descriptive content block format. Subsequent records contain content blocks that are either the direct --- 342,352 ---- _data-length_. ! It is often the case that the first record of a WARC to has the ! record-type 'warcinfo' and is used to describe the records that ! follow it. It is always the case that the concatenation of any two ! WARC files is a syntactically correct WARC file; care should be ! taken, however, when concatenation would inadvertently cause ! 'warcinfo' records to appear at points in the result that would ! create confusion. Subsequent records contain content blocks that are either the direct *************** *** 385,392 **** - - - - Kunze, et al. Expires January 2, 2006 [Page 7] --- 389,392 ---- *************** *** 474,480 **** describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost ! always refer to another record of another type, with hat other record ! holding original harvested or transformed content. (However, it is ! allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be --- 474,480 ---- describe, explain, or accompany a harvested resource, in ways not covered by other record types. A 'metadata' record will almost ! always refer to another record of another type, with that other ! record holding original harvested or transformed content. (However, ! it is allowable for a 'metadata' record to refer to any record type, including other 'metadata' records, or to refer to no other individual record at all.) Any number of metadata records may be *************** *** 506,510 **** ! preferred if the current record's is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) --- 506,510 ---- ! preferred if the current record is understandable standing alone. (It is not required that any revisit of a previously-visited URI use 'revisit', only those which refer back to other records.) *************** *** 532,544 **** A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content ransformations that ! maintain viability of content after widely available rendering ools ! for the originally stored format disappear. As needed, the original ! content may be migrated (transformed) to a more viable format in ! order to keep the information usable with current tools while ! minimizing loss of information (intellectual content, look and feel, ! etc). Any number of transformation records may be created that reference a specific source record, which may itself contain ! ransformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe --- 532,544 ---- A 'conversion' record contains an alternative version of another record's content that was created as the result of an archival ! process. Typically, this is used to hold content transformations ! that maintain viability of content after widely available rendering ! tools for the originally stored format disappear. As needed, the ! original content may be migrated (transformed) to a more viable ! format in order to keep the information usable with current tools ! while minimizing loss of information (intellectual content, look and ! feel, etc). Any number of transformation records may be created that reference a specific source record, which may itself contain ! transformed content. Each transformation should result in a freestanding, complete record, with no dependency on survival of the original record. Metadata records may be used to further describe *************** *** 711,715 **** subject-uri The original URI whose collection gave rise to the ! information content in this record. In he context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', --- 711,715 ---- subject-uri The original URI whose collection gave rise to the ! information content in this record. In the context of web harvesting, this is the URI that was the target of a crawler's retrieval request. Indirectly, such as for a 'revisit', *************** *** 717,725 **** uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of he WARC file, as a URI. Care should be taken to ensure that the URI in this value is - properly escaped (per [RFC2396] and that it is written with no --- 717,725 ---- uri appearing in the original record to which the newer record pertains. For a 'warcinfo' record, this parameter is given a ! synthesized value for the creation name of the WARC file, as a ! URI. Care should be taken to ensure that the URI in this value is *************** *** 730,733 **** --- 730,734 ---- + properly escaped (per [RFC2396] and that it is written with no internal whitespace. *************** *** 780,784 **** - Kunze, et al. Expires January 2, 2006 [Page 14] --- 781,784 ---- *************** *** 825,829 **** A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about ! record-id considerations. This creates satellite record- ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some --- 825,829 ---- A potential strategy, after choosing one record to be primary, is to extend its record-id as described in the Appendix about ! record-id considerations. This creates satellite record-ids for related records that contain the primary record-id as an initial substring, which greatly optimizes the detection (and in some *************** *** 850,871 **** Truncated: reason-token When present, indicates that the current record ends before the apparent end of the source material, but no ! continuation records are forthcoming. Possible values indicate he ! reason for the truncation: 'length' for exceeding a desired length ! limit; 'time' for exceeding a desired time limit during collection. ! ! ! ! ! ! ! ! ! ! ! ! ! --- 850,871 ---- Truncated: reason-token When present, indicates that the current record ends before the apparent end of the source material, but no ! continuation records are forthcoming. Possible values indicate ! the reason for the truncation: 'length' for exceeding a desired ! length limit; 'time' for exceeding a desired time limit during collection. ! Warcinfo-ID: record-id When present, indicates the record-id of the ! associated 'warcinfo' record for this record. Typically, the ! Warcinfo-ID parameter is used when the context of the applicable ! 'warcinfo' record is unavailable, such as after distributing ! single records into separate WARC files. WARC writing ! applications (such web crawlers) may choose to record this ! parameter routinely (e.g., before computing checksums). The ! Warcinfo-ID parameter overrides any association with a previously ! occurring (in the WARC) 'warcinfo' record, thus providing a way to ! protect the true association when records are combined from ! different WARCs. Use of this parameter in a record of type ! 'warcinfo' is undefined and reserved for possible future ! extension. *************** *** 974,978 **** records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is eventually ! know to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it requires --- 974,978 ---- records to be written without know their ultimate length, with only a small fixed-size edit to the header when the length is eventually ! known to complete the record. This named-field-based mechanism does not allow a later discovery that a record needs truncation or segmentation to be reflected via a small header edit; it requires *************** *** 1011,1015 **** with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of he Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. --- 1011,1015 ---- with an incremented 'Segment-Number' field. They must also include a ! 'Segment-Origin-ID' field with a value of the Record-ID of the record containing the first segment of the set. All segments of a set must have identical subject-uri parameters. *************** *** 1140,1144 **** Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files hat have meaningful URIs retrieved from a locally-accessible filesystem or other repository. --- 1140,1144 ---- Any resource that can be identified with a URI, even if it is not retrieved via an Internet operation, may be archived in a WARC file ! under a 'resource' type record. This includes files that have meaningful URIs retrieved from a locally-accessible filesystem or other repository. *************** *** 1184,1190 **** However, experience with the precursor ARC format at the Internet ! Archive has demonstrated hat applying simple standard compression can ! result in significant storage savings, while preserving random access ! to individual records. For this purpose, the GZIP format with customary "deflate" --- 1184,1190 ---- However, experience with the precursor ARC format at the Internet ! Archive has demonstrated that applying simple standard compression ! can result in significant storage savings, while preserving random ! access to individual records. For this purpose, the GZIP format with customary "deflate" *************** *** 1221,1229 **** Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a ! small portion of a record, would like to skip to he next full record. ! In the absence of an external, precalculated index, using only the ! WARC record's uncompressed length would require the complete current ! record to be decompressed o find the start of the next record. ! --- 1221,1229 ---- Customarily, GZIP members do not declare their compressed length. This presents a problem for WARC processing which, after reading a ! small portion of a record, would like to skip to the next full ! record. In the absence of an external, precalculated index, using ! only the WARC record's uncompressed length would require the complete ! current record to be decompressed to find the start of the next ! record. *************** *** 1264,1275 **** appropriate. ! 9.3. GZIP WARC File Extension ! WARC files compressed with the above conventions remain legal GZIP ! files. Thus, to ensure hey are properly recognized by GZIP tools, ! they should only get the customary additional ".gz" file extension ! suffix, making their suffix ".warc.gz". Software which works with ! WARC files compressed using these conventions will detect and exploit ! them; other GZIP software will harmlessly ignore the extensions. --- 1264,1275 ---- appropriate. ! 9.3. GZIP WARC File Name Suffix ! A WARC file compressed with the extra GZIP field conventions ! described in this document is a legal GZIP file. To ensure that it ! is properly recognized by GZIP tools, its name should have the ! customary ".gz" appended to it, making the complete suffix, ! ".warc.gz". GZIP software that does not recognize the extra GZIP ! fields will simply pass over them without benefit or harm. *************** *** 1300,1304 **** Prefix is an abbreviation usually reflective of the project or crawl ! that created this file. imestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often --- 1300,1304 ---- Prefix is an abbreviation usually reflective of the project or crawl ! that created this file. Timestamp is a 14-digit GMT timestamp indicating the time the file was initially begun. Serial is an increasing serial-number within the process creating the files, often *************** *** 1314,1319 **** This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. he file name prefix "iipc" should ! be avoided unless participating in the IIPC naming registry. [REVIEW ISSUE: Discover sense of the group for what naming and --- 1314,1319 ---- This specification does not require any particular WARC file naming practice, but recommends conventions similar to the above be adopted ! within WARC-creating institutions. The file name prefix "iipc" ! should be avoided unless participating in the IIPC naming registry. [REVIEW ISSUE: Discover sense of the group for what naming and *************** *** 1405,1409 **** After IESG approval, IANA is expected to register the WARC type ! "application/warc" using he application provided in this document. --- 1405,1409 ---- After IESG approval, IANA is expected to register the WARC type ! "application/warc" using the application provided in this document. *************** *** 1461,1465 **** This document could not have been written without major contributions ! from participants of he International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. --- 1461,1465 ---- This document could not have been written without major contributions ! from participants of the International Internet Preservation Consortium, especially Steen Christensen, and Julien Masanes. *************** *** 1534,1538 **** blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in he context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a --- 1534,1538 ---- blocks. Although the 'Related-Record-ID' parameter required of 'metadata', 'revisit', and 'conversion' records is sufficient to ! convey relatedness in the context of a single WARC file, great optimization can be had when relatedness can be inferred by third parties through identifier comparison rather than by lookup in a *************** *** 1595,1602 **** <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata> ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org --- 1595,1602 ---- <?xml version="1.0" encoding="UTF-8" standalone="yes"?> ! <warcmetadata ! xmlns:dc="http://purl.org/dc/elements/1.1/" ! xmlns:dcterms="http://purl.org/dc/terms/" ! xmlns:warc="http://archive.org/warc/0.8/"> <warc:software> Heritrix 1.4.0 http://crawler.archive.org *************** *** 1611,1615 **** </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo nxsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> --- 1611,1615 ---- </warc:http-header-user-agent> <dc:format>WARC file version 0.8</dc:format> ! <dcterms:conformsTo xsi:type="dcterms:URI"> http://www.archive.org/documents/WarcFileFormat.php </dcterms:conformsTo> *************** *** 1754,1763 **** Again, reference is made back to the original 'response' record. A ! new creation-date reflects he time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing he result of a revisit remain to be ! defined. Appendix B.7. Example of 'conversion' Record --- 1754,1763 ---- Again, reference is made back to the original 'response' record. A ! new creation-date reflects the time of revisit. This content block hypothesizes including header excerpts from a server response to explain the results of the revisit. (In this case, the remote server indicated the resource was unchanged from the previous 'Etag' value.) ! The actual formats for describing the result of a revisit remain to ! be defined. Appendix B.7. Example of 'conversion' Record |