[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.html,1.5,1.6 warc_file_format.txt

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv20105

Modified Files:
	warc_file_format.html warc_file_format.txt 
	warc_file_format.xml 
Log Message:
added proposed text for a Warcinfo-ID named parameter

Index: warc_file_format.html
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v
retrieving revision 1.5
retrieving revision 1.6
diff -C2 -d -r1.5 -r1.6
*** warc_file_format.html	24 Aug 2005 01:39:51 -0000	1.5
--- warc_file_format.html	26 Aug 2005 23:19:18 -0000	1.6
***************
*** 234,238 ****
  GZIP extra field: skip-lengths ('sl')<br />
  &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor26">9.3.</a>&nbsp;
! GZIP WARC File Extension<br />
  <a href="#anchor27">10.</a>&nbsp;
  WARC File Name and Size Recommendations<br />
--- 234,238 ----
  GZIP extra field: skip-lengths ('sl')<br />
  &nbsp;&nbsp;&nbsp;&nbsp;<a href="#anchor26">9.3.</a>&nbsp;
! GZIP WARC File Name Suffix<br />
  <a href="#anchor27">10.</a>&nbsp;
  WARC File Name and Size Recommendations<br />
***************
*** 406,412 ****
  record <span class="emph">data-length</span>.
  </p>
! <p>It is customary, and recommended, that the first record of a WARC
! describe the file itself, using the 'warcinfo' record-type, and a
! descriptive content block format.
  </p>
  <p>Subsequent records contain content blocks that are either the
--- 406,415 ----
  record <span class="emph">data-length</span>.
  </p>
! <p>It is often the case that the first record of a WARC to has the
! record-type 'warcinfo' and is used to describe the records that follow it.
! It is always the case that the concatenation of any two WARC files is a
! syntactically correct WARC file; care should be taken, however, when
! concatenation would inadvertently cause 'warcinfo' records to appear
! at points in the result that would create confusion.
  </p>
  <p>Subsequent records contain content blocks that are either the
***************
*** 851,854 ****
--- 854,873 ----

  </dd>
+ <dt>Warcinfo-ID: record-id</dt>
+ <dd>
+ When present, indicates the record-id of the associated 'warcinfo'
+ record for this record.  Typically, the Warcinfo-ID parameter is used
+ when the context of the applicable 'warcinfo' record is unavailable,
+ such as after distributing single records into separate WARC files.
+ WARC writing applications (such web crawlers) may choose to record
+ this parameter routinely (e.g., before computing checksums).
+ 
+ The Warcinfo-ID parameter overrides any association with a previously
+ occurring (in the WARC) 'warcinfo' record, thus providing a way to protect
+ the true association when records are combined from different WARCs.
+ Use of this parameter in a record of type 'warcinfo' is undefined and
+ reserved for possible future extension.
+  
+ </dd>
  </dl></blockquote>
  <a name="anchor15"></a><br /><hr />
***************
*** 1113,1124 ****
  <a name="anchor26"></a><br /><hr />
  <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.3"></a><h3>9.3.&nbsp;GZIP WARC File Extension</h3>

! <p>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
! should only get the customary additional ".gz" file extension suffix,
! making their suffix ".warc.gz". Software which works with WARC files
! compressed using these conventions will detect and exploit them; other
! GZIP software will harmlessly ignore the extensions.
  </p>
  <a name="anchor27"></a><br /><hr />
--- 1132,1143 ----
  <a name="anchor26"></a><br /><hr />
  <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2">&nbsp;TOC&nbsp;</a></td></tr></table>
! <a name="rfc.section.9.3"></a><h3>9.3.&nbsp;GZIP WARC File Name Suffix</h3>

! <p>A WARC file compressed with the extra GZIP field conventions described
! in this document is a legal GZIP file.  To ensure that it is properly
! recognized by GZIP tools, its name should have the customary ".gz"
! appended to it, making the complete suffix, ".warc.gz".
! GZIP software that does not recognize the extra GZIP fields will
! simply pass over them without benefit or harm.
  </p>
  <a name="anchor27"></a><br /><hr />

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.9
retrieving revision 1.10
diff -C2 -d -r1.9 -r1.10
*** warc_file_format.xml	26 Aug 2005 22:29:40 -0000	1.9
--- warc_file_format.xml	26 Aug 2005 23:19:18 -0000	1.10
***************
*** 260,266 ****
  record <spanx style="emph">data-length</spanx>.</t>

! <t>It is customary, and recommended, that the first record of a WARC
! describe the file itself, using the 'warcinfo' record-type, and a
! descriptive content block format.</t>

  <t>Subsequent records contain content blocks that are either the
--- 260,269 ----
  record <spanx style="emph">data-length</spanx>.</t>

! <t>It is often the case that the first record of a WARC to has the
! record-type 'warcinfo' and is used to describe the records that follow it.
! It is always the case that the concatenation of any two WARC files is a
! syntactically correct WARC file; care should be taken, however, when
! concatenation would inadvertently cause 'warcinfo' records to appear
! at points in the result that would create confusion.</t>

  <t>Subsequent records contain content blocks that are either the
***************
*** 680,683 ****
--- 683,701 ----
   </t>

+  <t hangText="Warcinfo-ID: record-id">
+ When present, indicates the record-id of the associated 'warcinfo'
+ record for this record.  Typically, the Warcinfo-ID parameter is used
+ when the context of the applicable 'warcinfo' record is unavailable,
+ such as after distributing single records into separate WARC files.
+ WARC writing applications (such web crawlers) may choose to record
+ this parameter routinely (e.g., before computing checksums).
+ 
+ The Warcinfo-ID parameter overrides any association with a previously
+ occurring (in the WARC) 'warcinfo' record, thus providing a way to protect
+ the true association when records are combined from different WARCs.
+ Use of this parameter in a record of type 'warcinfo' is undefined and
+ reserved for possible future extension.
+  </t>
+ 
  </list>

Index: warc_file_format.txt
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** warc_file_format.txt	23 Aug 2005 17:35:41 -0000	1.4
--- warc_file_format.txt	26 Aug 2005 23:19:18 -0000	1.5
***************
*** 142,146 ****
       9.1.  Record-at-a-time Compression . . . . . . . . . . . . . . . 22
       9.2.  GZIP extra field: skip-lengths ('sl')  . . . . . . . . . . 22
!      9.3.  GZIP WARC File Extension . . . . . . . . . . . . . . . . . 23
     10. WARC File Name and Size Recommendations  . . . . . . . . . . . 24
     11. Registration of MIME Media Type application/warc . . . . . . . 25
--- 142,146 ----
       9.1.  Record-at-a-time Compression . . . . . . . . . . . . . . . 22
       9.2.  GZIP extra field: skip-lengths ('sl')  . . . . . . . . . . 22
!      9.3.  GZIP WARC File Name Suffix . . . . . . . . . . . . . . . . 23
     10. WARC File Name and Size Recommendations  . . . . . . . . . . . 24
     11. Registration of MIME Media Type application/warc . . . . . . . 25
***************
*** 342,348 ****
     _data-length_.

!    It is customary, and recommended, that the first record of a WARC
!    describe the file itself, using the 'warcinfo' record-type, and a
!    descriptive content block format.

     Subsequent records contain content blocks that are either the direct
--- 342,352 ----
     _data-length_.

!    It is often the case that the first record of a WARC to has the
!    record-type 'warcinfo' and is used to describe the records that
!    follow it.  It is always the case that the concatenation of any two
!    WARC files is a syntactically correct WARC file; care should be
!    taken, however, when concatenation would inadvertently cause
!    'warcinfo' records to appear at points in the result that would
!    create confusion.

     Subsequent records contain content blocks that are either the direct
***************
*** 385,392 ****

- 
- 
- 
- 
  Kunze, et al.            Expires January 2, 2006                [Page 7]

--- 389,392 ----
***************
*** 474,480 ****
     describe, explain, or accompany a harvested resource, in ways not
     covered by other record types.  A 'metadata' record will almost
!    always refer to another record of another type, with hat other record
!    holding original harvested or transformed content.  (However, it is
!    allowable for a 'metadata' record to refer to any record type,
     including other 'metadata' records, or to refer to no other
     individual record at all.)  Any number of metadata records may be
--- 474,480 ----
     describe, explain, or accompany a harvested resource, in ways not
     covered by other record types.  A 'metadata' record will almost
!    always refer to another record of another type, with that other
!    record holding original harvested or transformed content.  (However,
!    it is allowable for a 'metadata' record to refer to any record type,
     including other 'metadata' records, or to refer to no other
     individual record at all.)  Any number of metadata records may be
***************
*** 506,510 ****

!    preferred if the current record's is understandable standing alone.
     (It is not required that any revisit of a previously-visited URI use
     'revisit', only those which refer back to other records.)
--- 506,510 ----

!    preferred if the current record is understandable standing alone.
     (It is not required that any revisit of a previously-visited URI use
     'revisit', only those which refer back to other records.)
***************
*** 532,544 ****
     A 'conversion' record contains an alternative version of another
     record's content that was created as the result of an archival
!    process.  Typically, this is used to hold content ransformations that
!    maintain viability of content after widely available rendering ools
!    for the originally stored format disappear.  As needed, the original
!    content may be migrated (transformed) to a more viable format in
!    order to keep the information usable with current tools while
!    minimizing loss of information (intellectual content, look and feel,
!    etc).  Any number of transformation records may be created that
     reference a specific source record, which may itself contain
!    ransformed content.  Each transformation should result in a
     freestanding, complete record, with no dependency on survival of the
     original record.  Metadata records may be used to further describe
--- 532,544 ----
     A 'conversion' record contains an alternative version of another
     record's content that was created as the result of an archival
!    process.  Typically, this is used to hold content transformations
!    that maintain viability of content after widely available rendering
!    tools for the originally stored format disappear.  As needed, the
!    original content may be migrated (transformed) to a more viable
!    format in order to keep the information usable with current tools
!    while minimizing loss of information (intellectual content, look and
!    feel, etc).  Any number of transformation records may be created that
     reference a specific source record, which may itself contain
!    transformed content.  Each transformation should result in a
     freestanding, complete record, with no dependency on survival of the
     original record.  Metadata records may be used to further describe
***************
*** 711,715 ****

     subject-uri The original URI whose collection gave rise to the
!       information content in this record.  In he context of web
        harvesting, this is the URI that was the target of a crawler's
        retrieval request.  Indirectly, such as for a 'revisit',
--- 711,715 ----

     subject-uri The original URI whose collection gave rise to the
!       information content in this record.  In the context of web
        harvesting, this is the URI that was the target of a crawler's
        retrieval request.  Indirectly, such as for a 'revisit',
***************
*** 717,725 ****
        uri appearing in the original record to which the newer record
        pertains.  For a 'warcinfo' record, this parameter is given a
!       synthesized value for the creation name of he WARC file, as a URI.

        Care should be taken to ensure that the URI in this value is
-       properly escaped (per [RFC2396] and that it is written with no

--- 717,725 ----
        uri appearing in the original record to which the newer record
        pertains.  For a 'warcinfo' record, this parameter is given a
!       synthesized value for the creation name of the WARC file, as a
!       URI.

        Care should be taken to ensure that the URI in this value is

***************
*** 730,733 ****
--- 730,734 ----

+       properly escaped (per [RFC2396] and that it is written with no
        internal whitespace.

***************
*** 780,784 ****

- 
  Kunze, et al.            Expires January 2, 2006               [Page 14]

--- 781,784 ----
***************
*** 825,829 ****
        A potential strategy, after choosing one record to be primary, is
        to extend its record-id as described in the Appendix about
!       record-id considerations.  This creates satellite record- ids for
        related records that contain the primary record-id as an initial
        substring, which greatly optimizes the detection (and in some
--- 825,829 ----
        A potential strategy, after choosing one record to be primary, is
        to extend its record-id as described in the Appendix about
!       record-id considerations.  This creates satellite record-ids for
        related records that contain the primary record-id as an initial
        substring, which greatly optimizes the detection (and in some
***************
*** 850,871 ****
     Truncated: reason-token When present, indicates that the current
        record ends before the apparent end of the source material, but no
!       continuation records are forthcoming.  Possible values indicate he
!       reason for the truncation: 'length' for exceeding a desired length
!       limit; 'time' for exceeding a desired time limit during
        collection.

! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 
! 

--- 850,871 ----
     Truncated: reason-token When present, indicates that the current
        record ends before the apparent end of the source material, but no
!       continuation records are forthcoming.  Possible values indicate
!       the reason for the truncation: 'length' for exceeding a desired
!       length limit; 'time' for exceeding a desired time limit during
        collection.

!    Warcinfo-ID: record-id When present, indicates the record-id of the
!       associated 'warcinfo' record for this record.  Typically, the
!       Warcinfo-ID parameter is used when the context of the applicable
!       'warcinfo' record is unavailable, such as after distributing
!       single records into separate WARC files.  WARC writing
!       applications (such web crawlers) may choose to record this
!       parameter routinely (e.g., before computing checksums).  The
!       Warcinfo-ID parameter overrides any association with a previously
!       occurring (in the WARC) 'warcinfo' record, thus providing a way to
!       protect the true association when records are combined from
!       different WARCs.  Use of this parameter in a record of type
!       'warcinfo' is undefined and reserved for possible future
!       extension.

***************
*** 974,978 ****
     records to be written without know their ultimate length, with only a
     small fixed-size edit to the header when the length is eventually
!    know to complete the record.  This named-field-based mechanism does
     not allow a later discovery that a record needs truncation or
     segmentation to be reflected via a small header edit; it requires
--- 974,978 ----
     records to be written without know their ultimate length, with only a
     small fixed-size edit to the header when the length is eventually
!    known to complete the record.  This named-field-based mechanism does
     not allow a later discovery that a record needs truncation or
     segmentation to be reflected via a small header edit; it requires
***************
*** 1011,1015 ****

     with an incremented 'Segment-Number' field.  They must also include a
!    'Segment-Origin-ID' field with a value of he Record-ID of the record
     containing the first segment of the set.  All segments of a set must
     have identical subject-uri parameters.
--- 1011,1015 ----

     with an incremented 'Segment-Number' field.  They must also include a
!    'Segment-Origin-ID' field with a value of the Record-ID of the record
     containing the first segment of the set.  All segments of a set must
     have identical subject-uri parameters.
***************
*** 1140,1144 ****
     Any resource that can be identified with a URI, even if it is not
     retrieved via an Internet operation, may be archived in a WARC file
!    under a 'resource' type record.  This includes files hat have
     meaningful URIs retrieved from a locally-accessible filesystem or
     other repository.
--- 1140,1144 ----
     Any resource that can be identified with a URI, even if it is not
     retrieved via an Internet operation, may be archived in a WARC file
!    under a 'resource' type record.  This includes files that have
     meaningful URIs retrieved from a locally-accessible filesystem or
     other repository.
***************
*** 1184,1190 ****

     However, experience with the precursor ARC format at the Internet
!    Archive has demonstrated hat applying simple standard compression can
!    result in significant storage savings, while preserving random access
!    to individual records.

     For this purpose, the GZIP format with customary "deflate"
--- 1184,1190 ----

     However, experience with the precursor ARC format at the Internet
!    Archive has demonstrated that applying simple standard compression
!    can result in significant storage savings, while preserving random
!    access to individual records.

     For this purpose, the GZIP format with customary "deflate"
***************
*** 1221,1229 ****
     Customarily, GZIP members do not declare their compressed length.
     This presents a problem for WARC processing which, after reading a
!    small portion of a record, would like to skip to he next full record.
!    In the absence of an external, precalculated index, using only the
!    WARC record's uncompressed length would require the complete current
!    record to be decompressed o find the start of the next record.
! 

--- 1221,1229 ----
     Customarily, GZIP members do not declare their compressed length.
     This presents a problem for WARC processing which, after reading a
!    small portion of a record, would like to skip to the next full
!    record.  In the absence of an external, precalculated index, using
!    only the WARC record's uncompressed length would require the complete
!    current record to be decompressed to find the start of the next
!    record.

***************
*** 1264,1275 ****
     appropriate.

! 9.3.  GZIP WARC File Extension

!    WARC files compressed with the above conventions remain legal GZIP
!    files.  Thus, to ensure hey are properly recognized by GZIP tools,
!    they should only get the customary additional ".gz" file extension
!    suffix, making their suffix ".warc.gz".  Software which works with
!    WARC files compressed using these conventions will detect and exploit
!    them; other GZIP software will harmlessly ignore the extensions.

--- 1264,1275 ----
     appropriate.

! 9.3.  GZIP WARC File Name Suffix

!    A WARC file compressed with the extra GZIP field conventions
!    described in this document is a legal GZIP file.  To ensure that it
!    is properly recognized by GZIP tools, its name should have the
!    customary ".gz" appended to it, making the complete suffix,
!    ".warc.gz".  GZIP software that does not recognize the extra GZIP
!    fields will simply pass over them without benefit or harm.

***************
*** 1300,1304 ****

     Prefix is an abbreviation usually reflective of the project or crawl
!    that created this file. imestamp is a 14-digit GMT timestamp
     indicating the time the file was initially begun.  Serial is an
     increasing serial-number within the process creating the files, often
--- 1300,1304 ----

     Prefix is an abbreviation usually reflective of the project or crawl
!    that created this file.  Timestamp is a 14-digit GMT timestamp
     indicating the time the file was initially begun.  Serial is an
     increasing serial-number within the process creating the files, often
***************
*** 1314,1319 ****
     This specification does not require any particular WARC file naming
     practice, but recommends conventions similar to the above be adopted
!    within WARC-creating institutions. he file name prefix "iipc" should
!    be avoided unless participating in the IIPC naming registry.

     [REVIEW ISSUE: Discover sense of the group for what naming and
--- 1314,1319 ----
     This specification does not require any particular WARC file naming
     practice, but recommends conventions similar to the above be adopted
!    within WARC-creating institutions.  The file name prefix "iipc"
!    should be avoided unless participating in the IIPC naming registry.

     [REVIEW ISSUE: Discover sense of the group for what naming and
***************
*** 1405,1409 ****

     After IESG approval, IANA is expected to register the WARC type
!    "application/warc" using he application provided in this document.

--- 1405,1409 ----

     After IESG approval, IANA is expected to register the WARC type
!    "application/warc" using the application provided in this document.

***************
*** 1461,1465 ****

     This document could not have been written without major contributions
!    from participants of he International Internet Preservation
     Consortium, especially Steen Christensen, and Julien Masanes.

--- 1461,1465 ----

     This document could not have been written without major contributions
!    from participants of the International Internet Preservation
     Consortium, especially Steen Christensen, and Julien Masanes.

***************
*** 1534,1538 ****
     blocks.  Although the 'Related-Record-ID' parameter required of
     'metadata', 'revisit', and 'conversion' records is sufficient to
!    convey relatedness in he context of a single WARC file, great
     optimization can be had when relatedness can be inferred by third
     parties through identifier comparison rather than by lookup in a
--- 1534,1538 ----
     blocks.  Although the 'Related-Record-ID' parameter required of
     'metadata', 'revisit', and 'conversion' records is sufficient to
!    convey relatedness in the context of a single WARC file, great
     optimization can be had when relatedness can be inferred by third
     parties through identifier comparison rather than by lookup in a
***************
*** 1595,1602 ****

     <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
!    <warcmetadata>
!    xmlns:dc="http://purl.org/dc/elements/1.1/"
!    xmlns:dcterms="http://purl.org/dc/terms/"
!    xmlns:warc="http://archive.org/warc/0.8/">
     <warc:software>
     Heritrix 1.4.0 http://crawler.archive.org
--- 1595,1602 ----

     <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
!    <warcmetadata
!        xmlns:dc="http://purl.org/dc/elements/1.1/"
!        xmlns:dcterms="http://purl.org/dc/terms/"
!        xmlns:warc="http://archive.org/warc/0.8/">
     <warc:software>
     Heritrix 1.4.0 http://crawler.archive.org
***************
*** 1611,1615 ****
     </warc:http-header-user-agent>
     <dc:format>WARC file version 0.8</dc:format>
!    <dcterms:conformsTo nxsi:type="dcterms:URI">
     http://www.archive.org/documents/WarcFileFormat.php
     </dcterms:conformsTo>
--- 1611,1615 ----
     </warc:http-header-user-agent>
     <dc:format>WARC file version 0.8</dc:format>
!    <dcterms:conformsTo xsi:type="dcterms:URI">
     http://www.archive.org/documents/WarcFileFormat.php
     </dcterms:conformsTo>
***************
*** 1754,1763 ****

     Again, reference is made back to the original 'response' record.  A
!    new creation-date reflects he time of revisit.  This content block
     hypothesizes including header excerpts from a server response to
     explain the results of the revisit.  (In this case, the remote server
     indicated the resource was unchanged from the previous 'Etag' value.)
!    The actual formats for describing he result of a revisit remain to be
!    defined.

  Appendix B.7.  Example of 'conversion' Record
--- 1754,1763 ----

     Again, reference is made back to the original 'response' record.  A
!    new creation-date reflects the time of revisit.  This content block
     hypothesizes including header excerpts from a server response to
     explain the results of the revisit.  (In this case, the remote server
     indicated the resource was unchanged from the previous 'Etag' value.)
!    The actual formats for describing the result of a revisit remain to
!    be defined.

  Appendix B.7.  Example of 'conversion' Record

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.html,1.5,1.6 warc_file_format.txt

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.html,1.5,1.6 warc_file_format.txt,1.4,1.5 warc_file_format.xml,1.9,1.10