From: A.M. K. <aku...@us...> - 2001-09-20 21:43:06
|
Update of /cvsroot/py-howto/pyhowto In directory usw-pr-cvs1:/tmp/cvs-serv5046 Modified Files: xml-howto.tex Log Message: Massive rewriting and expansion of the introductory section Index: xml-howto.tex =================================================================== RCS file: /cvsroot/py-howto/pyhowto/xml-howto.tex,v retrieving revision 1.13 retrieving revision 1.14 diff -C2 -r1.13 -r1.14 *** xml-howto.tex 2001/09/20 21:19:05 1.13 --- xml-howto.tex 2001/09/20 21:43:03 1.14 *************** *** 38,159 **** reasonably simple to implement and use, and is already being used for specifying markup languages for various new standards: MathML for ! expressing mathematical equations, Synchronized Multimedia ! Integration Language for ! multimedia presentations, and so forth. SGML and XML represent a document by tagging the document's various ! components with their function, or meaning. For example, an academic ! paper contains several parts: it has a title, one or more authors, an ! abstract, the actual text of the paper, a list of references, and so ! forth. A markup languge for writing such papers would therefore have ! tags for indicating what the contents of the abstract are, what the ! title is, and so forth. This should not be confused with the physical ! details of how the document is actually printed on paper. The ! abstract might be printed with narrow margins in a smaller font than ! the rest of the document, but the markup usually won't be concerned ! with details such as this; other software will translate from the ! markup language to a typesetting language such as \TeX, and will ! handle the details. A markup language specified using XML looks a lot like HTML; a document consists of a single \dfn{element}, which contains sub-elements, which can have further sub-elements inside them. Elements are indicated by \dfn{tags} in the text. Tags are always ! inside angle brackets \code{<}~\code{>}. There are two forms of ! elements. An element can contain content between opening and closing tags, as in \code{<name>Euryale</name>}, which is a \element{name} element containing the data \samp{Euryale}. This content may be text ! data, other XML elements, or a mixture of both. Elements can also be ! empty, in which case they contain nothing, and are represented as a ! single tag ended with a slash, as in \code{<stop/>}, which is an empty ! \element{stop} element. Unlike HTML, XML element names are ! case-sensitive; \element{stop} and \element{Stop} are two different ! element types. Opening and empty tags can also contain attributes, which specify ! values associated with an element. For example, text such as \code{<name lang='greek'>Herakles</name>}, the \element{name} element has a \attribute{lang} attribute which has a value of \samp{greek}. ! This would contrast with \code{<name lang='latin'>Hercules</name>}, ! where the attribute's value is \samp{latin}. ! A given XML language is specified with a Document Type Definition, or ! \dfn{DTD}. The DTD declares the element names that are allowed, and ! how elements can be nested inside each other. The DTD also specifies ! the attributes that can be provided for each element, their default ! values, and if they can be omitted. For example, to take an example ! from HTML, the \element{LI} element, representing an entry in a list, ! can only occur inside certain elements which represent lists, such as ! \element{OL} or \element{UL}. A \dfn{validating parser} can be given ! a DTD and a document, and verify whether a given document is legal ! according to the DTD's rules, or determine that one or more rules have ! been violated. ! ! Applications that process XML can be classed into two types. The ! simplest class is an application that only handles one particular ! markup language. For example, a chemistry program may only need to ! process Chemical Markup Language, but not MathML. This ! application can therefore be written specifically for a single DTD, ! and doesn't need to be capable of handling multiple markup ! languages. This type is simpler to write, and can easily be ! implemented with the available Python software. ! ! The second type of application is less common, and has to be able to ! handle any markup language you throw at it. An example might be a ! smart XML editor that helps you to write XML that conforms to a ! selected DTD; it might do so by not letting you enter an element where ! it would be illegal, or by suggesting elements that can be placed at ! the current cursor location. Such an application needs to handle any ! possible XML-defined markup, and therefore must be able to obtain a ! data structure embodying the DTD in use. XXX This type of application ! can't currently be implemented in Python without difficulty (XXX but ! wait and see if a DTD module is included...) ! For the full details of XML's syntax, the one definitive source is the ! XML 1.0 specification, available on the Web at \url{http://www.w3.org/TR/xml-spec.html}. However, like all ! specifications, it's quite formal and isn't intended to be a friendly introduction or a tutorial. The annotated version of the standard, at ! \url{http://www.xml.com/xml/pub/axml/axmlintro.html}, is quite helpful ! in clarifying the specification's intent. There are also various ! informal tutorials and books available to introduce you to XML. ! ! The rest of this HOWTO will assume that you're familiar with the ! relevant terminology. Most sections will use XML terms such as ! \emph{element} and \emph{attribute}; section~\ref{DOM} on the Document ! Object Model will assume that you've read the relevant Working Draft, ! and are familiar with things like Iterators and Nodes. ! Section~\ref{SAX} does not require that you have experience with the ! Java SAX implentations. ! \subsection{Related Links} ! \begin{seealso} ! \seetitle[http://www.w3.org/XML/]{Extensible Markup Language (XML)} ! {The World Wide Web Consortium's main page leading to ! documents relating to XML. Start here for background ! information and specifications.} ! \seetitle[http://www.oasis-open.org/cover/sgml-xml.html]{The XML ! Cover Pages}{Perhaps the leading index of information on ! XML and SGML.} ! \end{seealso} \section{Installing the XML Toolkit} ! Windows users should get the precompiled version at ! \url{http://sourceforge.net/projects/pyxml}; Mac users will use the ! corresponding precompiled version at \url{XXX}. Linux users may wish ! to use either the Debian package from \url{XXX}, or the RPM from ! \url{http://sourceforge.net/projects/pyxml}. To compile from source ! on a \UNIX{} platform, simply perform the following steps. \begin{enumerate} - \item If you have are using Python 1.5, you need to install the - distutils first, which are available from - \url{http://www.python.org/sigs/distutils-sig}. Python 1.6 and later - already includes the distutils, so you can skip this step. ! \item Get a copy of the source distribution from \url{http://sourceforge.net/projects/pyxml}. Unpack it with the following command. --- 38,318 ---- reasonably simple to implement and use, and is already being used for specifying markup languages for various new standards: MathML for ! expressing mathematical equations, Synchronized Multimedia Integration ! Language for multimedia presentations, and so forth. SGML and XML represent a document by tagging the document's various ! components with their function or meaning. For example, a book ! contains several parts: it has a title, one or more authors, the text ! of the book, perhaps a preface or an index, and so forth. A markup ! languge for writing books would therefore have elements indicating ! what the contents of the preface are, what the title is, and so forth. ! This should not be confused with the physical details of how the ! document is actually printed on paper. The index might be printed ! with narrow margins in a smaller font than the rest of the book, but ! markup usually isn't (or shouldn't be, anyway) concerned with details ! such as this. Instead, other software will translate from the markup ! language to a typesetting language such as \TeX, handling the ! presentation details. ! ! This section will provide a brief overview of XML and a few related ! standards, but it's far from being complete because making it complete ! would require a full-length book and not a short HOWTO. There's no ! better way to get a completely accurate description than to read the ! original W3C Recommendations; you can find links to them in ! section~\ref{xml-links}, ``Related Links''. If you already know what ! XML is, you can skip the rest of this section. ! ! Later sections of this HOWTO assume that you're familiar with XML ! terminology. Most sections will use XML terms such as \emph{element} ! and \emph{attribute}. Section~\ref{SAX} does not require that you ! have experience with the Java SAX implentations. + + \subsection{Elements, Attributes and Entities} + A markup language specified using XML looks a lot like HTML; a document consists of a single \dfn{element}, which contains sub-elements, which can have further sub-elements inside them. Elements are indicated by \dfn{tags} in the text. Tags are always ! inside angle brackets \code{<}~\code{>}. Elements can either contain content, or they can be empty: ! ! \begin{itemize} ! ! \item An element can contain content between opening and closing tags, as in \code{<name>Euryale</name>}, which is a \element{name} element containing the data \samp{Euryale}. This content may be text ! data, other XML elements, or a mixture of both. ! ! \item Elements can also be empty, containing nothing, and are ! represented as a single tag ended with a slash. For example, ! \code{<stop/>} is an empty \element{stop} element. Unlike HTML, XML ! element names are case-sensitive; \element{stop} and \element{Stop} ! are two different elements. ! ! \end{itemize} Opening and empty tags can also contain attributes, which specify ! values associated with an element. For example, in the XML text \code{<name lang='greek'>Herakles</name>}, the \element{name} element has a \attribute{lang} attribute which has a value of \samp{greek}. ! Contrast this with \code{<name lang='latin'>Hercules</name>}, where ! the attribute's value is \samp{latin}. ! ! XML also includes \dfn{entities} as a shorthand for including a ! particular character or a longer string. Entity references always ! begin with a \samp{\&} and end with a \samp{;}. For example, a ! particular Unicode character can be written as \code{\&\#4660;} using ! its character code in decimal, or as \code{\&\#x1234;} using ! hexadecimal. It's also possible to define your own entities, making ! \code{\&title;} expand to ``The Odyssey'', for example. If you want to ! include the \samp{\&} character in XML content, it must be written as ! \code{\&}. ! ! \subsection{Well-Formed XML} ! ! A legal XML document must, as a minimum, be \dfn{well-formed}: each ! open tag must have a corresponding closing tag, and tags must nest ! properly. For example, \code{<b><i>text</b></i>} is not well-formed ! because the \element{i} element should be enclosed inside the ! \element{b} element, but instead the closing \code{</b>} tag is ! encountered first. This example can be made well-formed by swapping ! the order of the closing tags, resulting in \code{<b><i>text</i></b>}. ! ! If you've ever written HTML by hand, you may have acquired the habit ! of being a bit sloppy about this. Strictly speaking, HTML has exactly ! the same rules about nesting tags as XML, but most Web browsers are ! very forgiving of errors in HTML. This is convenient for HTML ! authors, but it makes it difficult to write programs to parse HTML ! input, because the programs have to cope with all sorts of malformed ! input. ! ! The authors of the XML specification didn't want XML to fall into the ! same trap, because it would make XML processing software much harder ! to write. Therefore, all XML parsers have to be strict and must ! report an error if their input isn't well-formed. The Expat parser ! includes an executable program named \program{xmlwf} that parses the ! contents of files and reports any well-formedness violations; it's ! very handy for checking XML data that's been output from a program or ! written by hand. ! ! ! \subsection{DTDs} ! ! Well-formedness just says that all tags nest properly and that every ! opening tag is matched by a closing tag. It says nothing about the ! order of elements or about which elements can be contained inside other ! elements. ! ! The following XML, apparently representing a book, is well-formed but ! it makes no logical sense: ! ! \begin{verbatim} ! <book> ! <index> ... </index> ! <chapter> ... </chapter> ! <chapter> ... </chapter> ! <abstract> ... </abstract> ! <chapter> ... </chapter> ! <preface> .... </preface> ! </book> ! \end{verbatim} ! Prefaces don't come at the end of books, the index doesn't belong at ! the front, and the abstract doesn't belong in the middle. ! Well-formedness alone doesn't provide any way of enforcing that order. ! You could write a Python program that took an XML file like this and ! checked whether all the parts are in order, but then someone wanting ! to understand what documents are legal would have to read your program. ! ! Document Type Definitions, or \dfn{DTDs} for short, are a more concise ! way of enforcing ordering and nesting rules. A DTD declares the ! element names that are allowed, and how elements can be nested inside ! each other. To take an example from HTML, the \element{LI} element, ! representing an entry in a list, can only occur inside certain ! elements which represent lists, such as \element{OL} or \element{UL}. ! ! The DTD also specifies the attributes that can be provided for each ! element, the default value for each attribute, and whether the ! attribute can be omitted. A \dfn{validating parser} can take a ! document and a DTD, and check whether the document is legal according ! to the DTD's rules. ! ! Note that it's quite possible to get useful work done without using a ! validating parser and writing a DTD. You might decide that just ! writing well-formed XML and checking it with a Python program is all ! you need. ! ! A DTD lists the supported elements, the order in which elements must ! occur, and the possible attributes for each element. Here's a ! fragment from an imaginary DTD for writing books: ! \begin{verbatim} ! <!ELEMENT book (abstract?, preface, chapter*, appendix?)> ! <!ELEMENT abstract ...> ! <!ELEMENT chapter ...> ! <!ATTLIST chapter id ID #REQUIRED ! title CDATA #IMPLIED> ! \end{verbatim} ! ! The first line declares the \element{book} element, and specifies the ! elements that can occur inside it and the order in which the ! subelements must be provided. DTDs borrow from regular expression ! notation in order to express how elements can be repeated; \samp{?} ! means an element must occur 0 or 1 times, \samp{*} is 0 or more times, ! and \samp{+} means the element must occur 1 or more times. For ! example, the \element{abstract} and \element{appendix} elements are ! optional inside a \element{book} element. Exactly one ! \element{preface} element has to be present, and it can be followed by ! any number of \element{chapter} elements; having no chapters at all ! would be legal. ! ! The \code{ATTLIST} declaration specifies attributes for the ! \element{chapter} element. Chapters can have two attributes, ! \attribute{id} and \attribute{title}. \attribute{title} contains ! character data (CDATA) and is optional (that's what \samp{\#IMPLIED} ! means, for obscure historical reasons). \attribute{id} must contain ! an ID value, and it's required and not optional. ! ! A validating parser could take this DTD and a sample document, and ! report whether the document is \dfn{valid} according to the rules of ! the DTD. A document is valid if all the elements occur in the right ! order, and in the right number of repetitions. ! ! \subsection{Related Links} ! \label{xml-links} ! ! For the full details of XML's syntax, the definitive source is the XML ! 1.0 specification, available on the Web at \url{http://www.w3.org/TR/xml-spec.html}. However, like all ! specifications it's quite formal and isn't intended to be a friendly introduction or a tutorial. The annotated version of the standard, at ! \url{http://www.xml.com/xml/pub/a/axml/axmlintro.html}, is quite helpful ! in clarifying the specification's intent. There are also many more ! informal tutorials and books available to introduce you to XML at ! greater length. ! The XML Cover Pages, at \url{http://xml.coverpages.org}, are an ! extensive collection of links to XML and SGML resources, including a ! news page that's updated every few days. If you can only remember one ! XML-related URL, remember this one. Cafe con Leche, ! at \url{http://www.ibiblio.org/xml/}, is another good resource. ! ! The xml-dev mailing list is a high-traffic list for implementation and ! development; see \url{http://www.xml.org/xml/xmldev.shtml} for ! archives and subscription information. Be warned: Some people might ! find the discussion too focused on inventing new standards and tools, ! not on applying existing standards. ! ! ! \section{Related Standards} ! XML 1.0 is the basic standard, but people have built many, \emph{many} ! additional standards and tools on top of XML or to be used with XML. ! This section will quickly introduce some of these related ! technologies, paying particular attention to those that are supported ! by the Python/XML package. + %\subsection{XPath and XPointer} + %XXX write section on XPath and XPointer + + \subsection{XSLT} + + XML documents are often transformed from one format to another. These + transformations can be minor, such as changing all \element{OL} + elements into \element{UL} elements, or major, such as translating a + DocBook document into HTML so it can be displayed in a Web browser. + You can write a separate Python program to do each transformation as + you need it, and at times that will be the most appropriate option, + but an alternative approach is to use XSL, the Extensible Stylesheet + Language. + + XSL is really two standards: XSLT, XSL Transformations; and XSL-FO, + XSL Formatting Objects. XSLT is used much more often than XSL-FO, + because XSL-FO is intended primarily for rendering XML for printing + onto paper. XSLT is a general tool for transforming one XML document + into another document, and therefore can be used for more diverse + tasks. + + To use XSLT, you have to write a \dfn{stylesheet}, which is itself an + XML document written in the XSLT DTD. The source document is turned + into a tree structure, and the stylesheet specifies the transformation + you want to perform by selecting some elements from the tree and + rearranging them. + + The Python/XML package includes 4XSLT, an XSLT processor written by + Fourthought, Inc., letting you write Python programs that apply an + XSLT stylesheet to a document. + + \subsubsection{Related Links} + + The W3C's XSL page is at \url{http://www.w3.org/Style/XSL/}, and links + to the XSLT specifications and to friendlier tutorials. + + + %\subsection{XML Schemas} + + % XXX write a section on XSchema + + + %\subsection{RDF} + + % XXX write a section on RDF; point users toward Redfoot. + + \section{Installing the XML Toolkit} + + Releases are available from + \url{http://sourceforge.net/projects/pyxml/}. + Windows users should download the appropriate precompiled version. + Linux users can either download an RPM, or install from source. Users + on other platfoms have no choice but to install from source. ! To compile from source on a \UNIX{} platform, simply perform the ! following steps. \begin{enumerate} ! \item Download the latest version of the source distribution from \url{http://sourceforge.net/projects/pyxml}. Unpack it with the following command. *************** *** 163,200 **** \end{verbatim} ! \item ! Run: ! \begin{verbatim} ! python setup.py install ! \end{verbatim} ! ! To properly execute this operation, a C compiler is required - the ! same that was used to build Python itself. On a Unix system, this ! operation may require superuser permissions. \code{setup.py} supports ! a number of different commands and options, invoke \code{setup.py} ! without any arguments to obtain help. \end{enumerate} If you have difficulty installing this software, send a problem report ! to <xm...@py...> describing the problem, or submit a bug report at \url{http://sourceforget.net/projects/pyxml}. There are various demonstration programs in the \file{demo/} directory ! of the source distribution. You may wish to look at them next to get ! an impression of what's possible with the XML tools, and as a source of example code. - % package layout \subsection{Related Links} - - \begin{seealso} - \seetitle[http://www.python.org/topics/xml/]{Python and XML - Processing}{This is the starting point for Python-related - XML topics; it is updated to refer to all software, - mailing lists, documentation, etc.} - \end{seealso} \section{SAX: The Simple API for XML} --- 322,359 ---- \end{verbatim} ! \item Run \code{python setup.py install}. In order to run this, ! you'll need to have a C compiler installed, and it should be the same ! one that was used to build your Python installation. On a Unix system, ! this operation may require superuser permissions. \code{setup.py} ! supports a number of different commands and options; invoke ! \code{setup.py} without any arguments to see a help message. \end{enumerate} + If you're using Python 1.5, you'll have to install the Distutils + first, available from + \url{http://www.python.org/sigs/distutils-sig}. The PyXML package + still works with Python 1.5.2, but the Unicode support is much less + powerful. Versions of Python after 1.5.2 added very complete Unicode + support, so if you're going to be doing serious XML work you should + use the latest version of Python 2.x. At this writing, the latest + released version is Python 2.1. + If you have difficulty installing this software, send a problem report ! to the XML-SIG mailing list describing the problem, or submit a bug report at \url{http://sourceforget.net/projects/pyxml}. There are various demonstration programs in the \file{demo/} directory ! of the Python/XML source distribution. You may wish to look at them ! to get an idea of what's possible with the XML tools, and as a source of example code. \subsection{Related Links} + The Python/XML Topic Guide, at + \url{http://pyxml.sourceforge.net/topics/} is the starting point for + Python-related XML topics; it links to software, mailing lists, + documentation, etc. \section{SAX: The Simple API for XML} |