From: Michael S. <sta...@us...> - 2005-09-15 22:18:15
|
Update of /cvsroot/archive-access/archive-access/src/docs/warc In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv32472 Modified Files: warc_file_format.html warc_file_format.txt warc_file_format.xml Log Message: * warc_file_format.xml Added Appendix C of collection ABNF (Needs work still). * warc_file_format.html * warc_file_format.txt Generated from warc_file_format.xml Index: warc_file_format.html =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.html,v retrieving revision 1.7 retrieving revision 1.8 diff -C2 -d -r1.7 -r1.8 *** warc_file_format.html 28 Aug 2005 18:55:30 -0000 1.7 --- warc_file_format.html 15 Sep 2005 22:18:05 -0000 1.8 *************** *** 263,266 **** --- 263,268 ---- <a href="#anchor40">Appendix B.8.</a> Example of 'continuation' Record<br /> + <a href="#anchor41">Appendix C.</a> + Collected BNF for WARC<br /> <a href="#rfc.references1">14.</a> References<br /> *************** *** 1531,1534 **** --- 1533,1579 ---- the set, the one with the "Segment-Number: 1" named field. </p> + <a name="anchor41"></a><br /><hr /> + <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> + <a name="rfc.section.C"></a><h3>Appendix C. Collected BNF for WARC</h3> + <pre> + warc-file = 1*warc-record + warc-record = header block CRLF CRLF + header = header-line CRLF *anvl-field CRLF + block = *OCTET + + header-line = warc-id tsp data-length tsp record-type tsp + subject-uri tsp creation-date tsp + content-type tsp record-id + tsp = 1*WSP + + warc-id = "warc/" DIGIT "." DIGIT + data-length = 1*DIGIT + record-type = "warcinfo" / "response" / "request" / "metadata" / + "revisit" / "conversion" / "continuation" / + future-type + future-type = 1*VCHAR + subject-uri = uri + uri = <'URI' per RFC3986> + creation-date = timestamp + timestamp = <date per below> + content-type = type "/" subtype + type = <'type' per RFC2045> + subtype = <'subtype' per RFC2045> + record-id = uri + + anvl-field = field-name ":" [ field-body ] CRLF + field-name = 1*<any CHAR, excluding control-chars and ":"> + field-body = text [CRLF LWSP-char field-body] + text = 1*<any UTF-8 character, including bare + CR and bare LF, but NOT including CRLF> + ; (Octal, Decimal.) + CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) + CR = <ASCII CR, carriage return> ; ( 15, 13.) + LF = <ASCII LF, linefeed> ; ( 12, 10.) + SPACE = <ASCII SP, space> ; ( 40, 32.) + HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) + CRLF = CR LF + LWSP-char = SPACE / HTAB ; semantics = SPACE + </pre> <a name="rfc.references1"></a><br /><hr /> <table summary="layout" cellpadding="0" cellspacing="2" class="bug" align="right"><tr><td class="bug"><a href="#toc" class="link2"> TOC </a></td></tr></table> Index: warc_file_format.xml =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.xml,v retrieving revision 1.11 retrieving revision 1.12 diff -C2 -d -r1.11 -r1.12 *** warc_file_format.xml 28 Aug 2005 18:55:30 -0000 1.11 --- warc_file_format.xml 15 Sep 2005 22:18:06 -0000 1.12 *************** *** 17,21 **** <!ENTITY rfc2540 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.2540.xml'> <!ENTITY rfc4027 PUBLIC '' 'http://xml.resource.org/public/rfc/bibxml/reference.RFC.4027.xml'> - ]> <?rfc symrefs="yes"?> --- 17,20 ---- *************** *** 1397,1401 **** --- 1396,1452 ---- </appendix> + </appendix> + + <appendix title="Collected BNF for WARC"> + <!-- + TODO: Bring in the definitions for OCTET, etc., from RFC2234. + TODO: Whats the slash mean? Others have |. + TODO: Timestamp, mimetype. + TODO: The dot after in ANVL zero? + TODO: Do all abnf as entity includes so not repeated. + --> + <figure> + <artwork> + warc-file = 1*warc-record + warc-record = header block CRLF CRLF + header = header-line CRLF *anvl-field CRLF + block = *OCTET + + header-line = warc-id tsp data-length tsp record-type tsp + subject-uri tsp creation-date tsp + content-type tsp record-id + tsp = 1*WSP + + warc-id = "warc/" DIGIT "." DIGIT + data-length = 1*DIGIT + record-type = "warcinfo" / "response" / "request" / "metadata" / + "revisit" / "conversion" / "continuation" / + future-type + future-type = 1*VCHAR + subject-uri = uri + uri = <'URI' per RFC3986> + creation-date = timestamp + timestamp = <date per below> + content-type = type "/" subtype + type = <'type' per RFC2045> + subtype = <'subtype' per RFC2045> + record-id = uri + + anvl-field = field-name ":" [ field-body ] CRLF + field-name = 1*<any CHAR, excluding control-chars and ":"> + field-body = text [CRLF LWSP-char field-body] + text = 1*<any UTF-8 character, including bare + CR and bare LF, but NOT including CRLF> + ; (Octal, Decimal.) + CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) + CR = <ASCII CR, carriage return> ; ( 15, 13.) + LF = <ASCII LF, linefeed> ; ( 12, 10.) + SPACE = <ASCII SP, space> ; ( 40, 32.) + HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) + CRLF = CR LF + LWSP-char = SPACE / HTAB ; semantics = SPACE + </artwork> + </figure> </appendix> Index: warc_file_format.txt =================================================================== RCS file: /cvsroot/archive-access/archive-access/src/docs/warc/warc_file_format.txt,v retrieving revision 1.6 retrieving revision 1.7 diff -C2 -d -r1.6 -r1.7 *** warc_file_format.txt 28 Aug 2005 18:55:30 -0000 1.6 --- warc_file_format.txt 15 Sep 2005 22:18:06 -0000 1.7 *************** *** 157,164 **** Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32 Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 33 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 35 ! Intellectual Property and Copyright Statements . . . . . . . . . . 36 ! --- 157,164 ---- Appendix B.7. Example of 'conversion' Record . . . . . . . . . . . 32 Appendix B.8. Example of 'continuation' Record . . . . . . . . . . 32 ! Appendix C. Collected BNF for WARC . . . . . . . . . . . . . . . 34 ! 14. References . . . . . . . . . . . . . . . . . . . . . . . . . . 34 ! Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 37 ! Intellectual Property and Copyright Statements . . . . . . . . . . 38 *************** *** 1812,1815 **** --- 1812,1895 ---- set, the one with the "Segment-Number: 1" named field. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Kunze, et al. Expires January 2, 2006 [Page 33] + + Internet-Draft WARC File Format, 0.8revB July 2005 + + + Appendix C. Collected BNF for WARC + + warc-file = 1*warc-record + warc-record = header block CRLF CRLF + header = header-line CRLF *anvl-field CRLF + block = *OCTET + + header-line = warc-id tsp data-length tsp record-type tsp + subject-uri tsp creation-date tsp + content-type tsp record-id + tsp = 1*WSP + + warc-id = "warc/" DIGIT "." DIGIT + data-length = 1*DIGIT + record-type = "warcinfo" / "response" / "request" / "metadata" / + "revisit" / "conversion" / "continuation" / + future-type + future-type = 1*VCHAR + subject-uri = uri + uri = <'URI' per RFC3986> + creation-date = timestamp + timestamp = <date per below> + content-type = type "/" subtype + type = <'type' per RFC2045> + subtype = <'subtype' per RFC2045> + record-id = uri + + anvl-field = field-name ":" [ field-body ] CRLF + field-name = 1*<any CHAR, excluding control-chars and ":"> + field-body = text [CRLF LWSP-char field-body] + text = 1*<any UTF-8 character, including bare + CR and bare LF, but NOT including CRLF> + ; (Octal, Decimal.) + CHAR = <any ASCII/UTF-8 character> ; (0-177, 0.-127.) + CR = <ASCII CR, carriage return> ; ( 15, 13.) + LF = <ASCII LF, linefeed> ; ( 12, 10.) + SPACE = <ASCII SP, space> ; ( 40, 32.) + HTAB = <ASCII HT, horizontal-tab> ; ( 11, 9.) + CRLF = CR LF + LWSP-char = SPACE / HTAB ; semantics = SPACE + + 14. References *************** *** 1818,1821 **** --- 1898,1909 ---- [ARC] Burner, M. and B. Kahle, "The ARC File Format", + + + + Kunze, et al. Expires January 2, 2006 [Page 34] + + Internet-Draft WARC File Format, 0.8revB July 2005 + + September 1996. *************** *** 1842,1853 **** [RFC1884] Hinden, R. and S. Deering, "IP Version 6 Addressing - - - - Kunze, et al. Expires January 2, 2006 [Page 33] - - Internet-Draft WARC File Format, 0.8revB July 2005 - - Architecture", RFC 1884, December 1995. --- 1930,1933 ---- *************** *** 1874,1877 **** --- 1954,1965 ---- [RFC2540] Eastlake, D., "Detached Domain Name System (DNS) + + + + Kunze, et al. Expires January 2, 2006 [Page 35] + + Internet-Draft WARC File Format, 0.8revB July 2005 + + Information", RFC 2540, March 1999. *************** *** 1901,1905 **** ! Kunze, et al. Expires January 2, 2006 [Page 34] Internet-Draft WARC File Format, 0.8revB July 2005 --- 1989,2017 ---- ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! Kunze, et al. Expires January 2, 2006 [Page 36] Internet-Draft WARC File Format, 0.8revB July 2005 *************** *** 1957,1961 **** ! Kunze, et al. Expires January 2, 2006 [Page 35] Internet-Draft WARC File Format, 0.8revB July 2005 --- 2069,2073 ---- ! Kunze, et al. Expires January 2, 2006 [Page 37] Internet-Draft WARC File Format, 0.8revB July 2005 *************** *** 2013,2016 **** ! Kunze, et al. Expires January 2, 2006 [Page 36] --- 2125,2128 ---- ! Kunze, et al. Expires January 2, 2006 [Page 38] |