[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.xml,1.7,1.8 warc_file_format.html

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/archive-access/archive-access/src/docs/warc
In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv25656

Modified Files:
	warc_file_format.xml warc_file_format.html 
Log Message:
* warc_file_format.xml
    Added entity definition for mdash. Typos.  Fixed warcinfo example xml.

Index: warc_file_format.html
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v
retrieving revision 1.4
retrieving revision 1.5
diff -C2 -d -r1.4 -r1.5
*** warc_file_format.html	23 Aug 2005 17:35:41 -0000	1.4
--- warc_file_format.html	24 Aug 2005 01:39:51 -0000	1.5
***************
*** 411,417 ****
  </p>
  <p>Subsequent records contain content blocks that are either the
! direct result of a retrieval attempt &mdash; web pages, inline images,
  URL redirection information, DNS hostname lookup results, standalone
! files, etc. &mdash; or they are synthesized content blocks (e.g.,
  metadata, transformed content) that provide additional information
  about archived content. Any content block may contain arbitrary text
--- 411,417 ----
  </p>
  <p>Subsequent records contain content blocks that are either the
! direct result of a retrieval attempt &#8212; web pages, inline images,
  URL redirection information, DNS hostname lookup results, standalone
! files, etc. &#8212; or they are synthesized content blocks (e.g.,
  metadata, transformed content) that provide additional information
  about archived content. Any content block may contain arbitrary text
***************
*** 501,505 ****
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with hat other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
--- 501,505 ----
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with that other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
***************
*** 527,531 ****
  <p>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record's is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)
--- 527,531 ----
  <p>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)
***************
*** 555,560 ****
  <p>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content ransformations that
! maintain viability of content after widely available rendering ools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
--- 555,560 ----
  <p>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content transformations that
! maintain viability of content after widely available rendering tools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
***************
*** 562,566 ****
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain ransformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
--- 562,566 ----
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain transformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
***************
*** 662,666 ****
  The number of octets in the record, starting with the first letter
  ("w") of the first token, through to the end of the content block 
! &mdash; not including the 2 record-ending newlines.  After proceeding 
  this many octets from that first character of the record header, there
  should be two newlines and either the beginning of a new record or the
--- 662,666 ----
  The number of octets in the record, starting with the first letter
  ("w") of the first token, through to the end of the content block 
! &#8212; not including the 2 record-ending newlines.  After proceeding 
  this many octets from that first character of the record header, there
  should be two newlines and either the beginning of a new record or the
***************
*** 688,697 ****
  <dd>
  The original URI whose collection gave rise to the information content
! in this record. In he context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of he WARC file, as a URI.

  <br />
--- 688,697 ----
  <dd>
  The original URI whose collection gave rise to the information content
! in this record. In the context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of the WARC file, as a URI.

  <br />
***************
*** 820,824 ****
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record- ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
--- 820,824 ----
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record-ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
***************
*** 846,850 ****
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate he reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
--- 846,850 ----
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate the reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
***************
*** 883,887 ****
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually know to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
--- 883,887 ----
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually known to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
***************
*** 917,921 ****
  <p>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of he Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.
--- 917,921 ----
  <p>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of the Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.
***************
*** 1008,1012 ****
  <p>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files hat have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.
--- 1008,1012 ----
  <p>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files that have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.
***************
*** 1033,1037 ****
  </p>
  <p>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated hat applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.
--- 1033,1037 ----
  </p>
  <p>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated that applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.
***************
*** 1075,1082 ****
  <p>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to he next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed o find the start of the next
  record.
  </p>
--- 1075,1082 ----
  <p>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to the next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed to find the start of the next
  record.
  </p>
***************
*** 1116,1120 ****

  <p>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure hey are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
--- 1116,1120 ----

  <p>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
***************
*** 1134,1138 ****
  </p>
  <p>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  imestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
--- 1134,1138 ----
  </p>
  <p>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  Timestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
***************
*** 1148,1152 ****
  <p>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  he file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.
  </p>
--- 1148,1152 ----
  <p>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  The file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.
  </p>
***************
*** 1212,1216 ****

  <p>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using he application provided in this document.
  </p>
  <a name="anchor30"></a><br /><hr />
--- 1212,1216 ----

  <p>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using the application provided in this document.
  </p>
  <a name="anchor30"></a><br /><hr />
***************
*** 1219,1223 ****

  <p>This document could not have been written without major
! contributions from participants of he International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.
--- 1219,1223 ----

  <p>This document could not have been written without major
! contributions from participants of the International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.
***************
*** 1246,1250 ****
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in he context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
--- 1246,1250 ----
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in the context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
***************
*** 1305,1312 ****

  &lt;?xml version="1.0" encoding="UTF-8" standalone="yes"?&gt;
! &lt;warcmetadata&gt;
! xmlns:dc="http://purl.org/dc/elements/1.1/"
! xmlns:dcterms="http://purl.org/dc/terms/"
! xmlns:warc="http://archive.org/warc/0.8/"&gt;
  &lt;warc:software&gt;
  Heritrix 1.4.0 http://crawler.archive.org
--- 1305,1312 ----

  &lt;?xml version="1.0" encoding="UTF-8" standalone="yes"?&gt;
! &lt;warcmetadata
!     xmlns:dc="http://purl.org/dc/elements/1.1/"
!     xmlns:dcterms="http://purl.org/dc/terms/"
!     xmlns:warc="http://archive.org/warc/0.8/"&gt;
  &lt;warc:software&gt;
  Heritrix 1.4.0 http://crawler.archive.org
***************
*** 1321,1325 ****
  &lt;/warc:http-header-user-agent&gt;
  &lt;dc:format&gt;WARC file version 0.8&lt;/dc:format&gt;
! &lt;dcterms:conformsTo nxsi:type="dcterms:URI"&gt;
  http://www.archive.org/documents/WarcFileFormat.php
  &lt;/dcterms:conformsTo&gt;
--- 1321,1325 ----
  &lt;/warc:http-header-user-agent&gt;
  &lt;dc:format&gt;WARC file version 0.8&lt;/dc:format&gt;
! &lt;dcterms:conformsTo xsi:type="dcterms:URI"&gt;
  http://www.archive.org/documents/WarcFileFormat.php
  &lt;/dcterms:conformsTo&gt;
***************
*** 1446,1454 ****
  </pre>
  <p>Again, reference is made back to the original 'response' record. A
! new creation-date reflects he time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing he result of a revisit remain to be
  defined.
  </p>
--- 1446,1454 ----
  </pre>
  <p>Again, reference is made back to the original 'response' record. A
! new creation-date reflects the time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing the result of a revisit remain to be
  defined.
  </p>

Index: warc_file_format.xml
===================================================================
RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v
retrieving revision 1.7
retrieving revision 1.8
diff -C2 -d -r1.7 -r1.8
*** warc_file_format.xml	23 Aug 2005 17:35:41 -0000	1.7
--- warc_file_format.xml	24 Aug 2005 01:39:50 -0000	1.8
***************
*** 2,5 ****
--- 2,7 ----
  <!DOCTYPE rfc SYSTEM 'rfcXXXX.dtd' [

+   <!ENTITY mdash '&#8212;' >
+ 
    <!ENTITY rfc0822 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.0822.xml'>
    <!ENTITY rfc1034 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.1034.xml'>
***************
*** 349,353 ****
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with hat other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
--- 351,355 ----
  explain, or accompany a harvested resource, in ways not covered by
  other record types. A 'metadata' record will almost always refer to
! another record of another type, with that other record holding original
  harvested or transformed content. (However, it is allowable for a
  'metadata' record to refer to any record type, including other
***************
*** 375,379 ****
  <t>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record's is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)</t>
--- 377,381 ----
  <t>A 'revisit' record should only be used when interpreting the record
  requires consulting a previous record; other record types should be
! preferred if the current record is understandable standing
  alone. (It is not required that any revisit of a previously-visited
  URI use 'revisit', only those which refer back to other records.)</t>
***************
*** 403,408 ****
  <t>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content ransformations that
! maintain viability of content after widely available rendering ools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
--- 405,410 ----
  <t>A 'conversion' record contains an alternative version of another record's
  content that was created as the result of an archival
! process. Typically, this is used to hold content transformations that
! maintain viability of content after widely available rendering tools
  for the originally stored format disappear. As needed, the original
  content may be migrated (transformed) to a more viable format in order
***************
*** 410,414 ****
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain ransformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
--- 412,416 ----
  loss of information (intellectual content, look and feel, etc). Any
  number of transformation records may be created that reference a
! specific source record, which may itself contain transformed
  content. Each transformation should result in a freestanding, complete
  record, with no dependency on survival of the original
***************
*** 535,544 ****
   <t hangText="subject-uri">
  The original URI whose collection gave rise to the information content
! in this record. In he context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of he WARC file, as a URI.

  <vspace blankLines="2" />
--- 537,546 ----
   <t hangText="subject-uri">
  The original URI whose collection gave rise to the information content
! in this record. In the context of web harvesting, this is the URI that
  was the target of a crawler's retrieval request. Indirectly, such as
  for a 'revisit', 'metadata', or 'conversion' record, it is a copy of
  the subject-uri appearing in the original record to which the newer
  record pertains. For a 'warcinfo' record, this parameter is given a
! synthesized value for the creation name of the WARC file, as a URI.

  <vspace blankLines="2" />
***************
*** 650,654 ****
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record- ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
--- 652,656 ----
  A potential strategy, after choosing one record to be primary, is to
  extend its record-id as described in the Appendix about record-id
! considerations. This creates satellite record-ids for related records
  that contain the primary record-id as an initial substring, which
  greatly optimizes the detection (and in some cases derivation) of
***************
*** 673,677 ****
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate he reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
--- 675,679 ----
  When present, indicates that the current record ends before the
  apparent end of the source material, but no continuation records are
! forthcoming. Possible values indicate the reason for the truncation:
  'length' for exceeding a desired length limit; 'time' for exceeding a
  desired time limit during collection.
***************
*** 713,717 ****
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually know to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
--- 715,719 ----
  allow records to be written without know their ultimate length, with
  only a small fixed-size edit to the header when the length is
! eventually known to complete the record. This named-field-based
  mechanism does not allow a later discovery that a record needs
  truncation or segmentation to be reflected via a small header edit; it
***************
*** 745,749 ****
  <t>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of he Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.</t>
--- 747,751 ----
  <t>All subsequent segments must have a record type of 'continuation',
  with an incremented 'Segment-Number' field. They must also include a
! 'Segment-Origin-ID' field with a value of the Record-ID of the record
  containing the first segment of the set. All segments of a set must
  have identical subject-uri parameters.</t>
***************
*** 838,842 ****
  <t>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files hat have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.</t>
--- 840,844 ----
  <t>Any resource that can be identified with a URI, even if it is not
  retrieved via an Internet operation, may be archived in a WARC file
! under a 'resource' type record. This includes files that have
  meaningful URIs retrieved from a locally-accessible filesystem or
  other repository.</t>
***************
*** 865,869 ****

  <t>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated hat applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.</t>
--- 867,871 ----

  <t>However, experience with the precursor ARC format at the Internet
! Archive has demonstrated that applying simple standard compression can
  result in significant storage savings, while preserving random access
  to individual records.</t>
***************
*** 905,912 ****
  <t>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to he next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed o find the start of the next
  record.</t>

--- 907,914 ----
  <t>Customarily, GZIP members do not declare their compressed
  length. This presents a problem for WARC processing which, after
! reading a small portion of a record, would like to skip to the next
  full record. In the absence of an external, precalculated index, using
  only the WARC record's uncompressed length would require the complete
! current record to be decompressed to find the start of the next
  record.</t>

***************
*** 946,950 ****

  <t>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure hey are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
--- 948,952 ----

  <t>WARC files compressed with the above conventions remain legal GZIP
! files. Thus, to ensure they are properly recognized by GZIP tools, they
  should only get the customary additional ".gz" file extension suffix,
  making their suffix ".warc.gz". Software which works with WARC files
***************
*** 966,970 ****

  <t>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  imestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
--- 968,972 ----

  <t>Prefix is an abbreviation usually reflective of the project or
! crawl that created this file.  Timestamp is a 14-digit GMT timestamp
  indicating the time the file was initially begun. Serial is an
  increasing serial-number within the process creating the files, often
***************
*** 980,984 ****
  <t>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  he file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.</t>

--- 982,986 ----
  <t>This specification does not require any particular WARC file naming
  practice, but recommends conventions similar to the above be adopted
! within WARC-creating institutions.  The file name prefix "iipc" should
  be avoided unless participating in the IIPC naming registry.</t>

***************
*** 1044,1048 ****

  <t>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using he application provided in this document.</t>

    </section>
--- 1046,1050 ----

  <t>After IESG approval, IANA is expected to register the WARC type
! "application/warc" using the application provided in this document.</t>

    </section>
***************
*** 1051,1055 ****

  <t>This document could not have been written without major
! contributions from participants of he International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.</t>
--- 1053,1057 ----

  <t>This document could not have been written without major
! contributions from participants of the International Internet
  Preservation Consortium, especially Steen Christensen, and Julien
  Masanes.</t>
***************
*** 1078,1082 ****
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in he context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
--- 1080,1084 ----
  blocks. Although the 'Related-Record-ID' parameter required of
  'metadata', 'revisit', and 'conversion' records is sufficient to
! convey relatedness in the context of a single WARC file, great
  optimization can be had when relatedness can be inferred by third
  parties through identifier comparison rather than by lookup in a
***************
*** 1141,1148 ****

  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
! <warcmetadata>
! xmlns:dc="http://purl.org/dc/elements/1.1/"
! xmlns:dcterms="http://purl.org/dc/terms/"
! xmlns:warc="http://archive.org/warc/0.8/">
  <warc:software>
  Heritrix 1.4.0 http://crawler.archive.org
--- 1143,1150 ----

  <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
! <warcmetadata
!     xmlns:dc="http://purl.org/dc/elements/1.1/"
!     xmlns:dcterms="http://purl.org/dc/terms/"
!     xmlns:warc="http://archive.org/warc/0.8/">
  <warc:software>
  Heritrix 1.4.0 http://crawler.archive.org
***************
*** 1157,1161 ****
  </warc:http-header-user-agent>
  <dc:format>WARC file version 0.8</dc:format>
! <dcterms:conformsTo nxsi:type="dcterms:URI">
  http://www.archive.org/documents/WarcFileFormat.php
  </dcterms:conformsTo>
--- 1159,1163 ----
  </warc:http-header-user-agent>
  <dc:format>WARC file version 0.8</dc:format>
! <dcterms:conformsTo xsi:type="dcterms:URI">
  http://www.archive.org/documents/WarcFileFormat.php
  </dcterms:conformsTo>
***************
*** 1304,1312 ****

  <t>Again, reference is made back to the original 'response' record. A
! new creation-date reflects he time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing he result of a revisit remain to be
  defined.</t>

--- 1306,1314 ----

  <t>Again, reference is made back to the original 'response' record. A
! new creation-date reflects the time of revisit. This content block
  hypothesizes including header excerpts from a server response to
  explain the results of the revisit. (In this case, the remote server
  indicated the resource was unchanged from the previous 'Etag' value.)
! The actual formats for describing the result of a revisit remain to be
  defined.</t>

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.xml,1.7,1.8 warc_file_format.html

[Archive-access-cvs] archive-access/src/docs/warc warc_file_format.xml,1.7,1.8 warc_file_format.html,1.4,1.5