From: A.M. K. <aku...@us...> - 2001-09-20 02:34:32
|
Update of /cvsroot/py-howto/pyhowto In directory usw-pr-cvs1:/tmp/cvs-serv875 Modified Files: xml-howto.tex Log Message: Rewrite pass over SAX section Index: xml-howto.tex =================================================================== RCS file: /cvsroot/py-howto/pyhowto/xml-howto.tex,v retrieving revision 1.10 retrieving revision 1.11 diff -C2 -r1.10 -r1.11 *** xml-howto.tex 2001/03/21 01:17:00 1.10 --- xml-howto.tex 2001/09/20 02:34:28 1.11 *************** *** 187,213 **** \label{SAX} ! The Simple API for XML isn't a standard in the formal sense, but an ! informal specification designed by David Megginson, with input from ! many people on the xml-dev mailing list. SAX defines an event-driven ! interface for parsing XML. To use SAX, you must create Python class ! instances which implement a specified interface, and the parser will ! then call various methods of those objects. ! ! This howto describes version 2 of SAX (also referred to as ! SAX2). Earlier versions of this text did explain SAX1, which is ! primarily of historical interest only. SAX is most suitable for purposes where you want to read through an entire XML document from beginning to end, and perform some ! computation, such as building a data structure representating a ! document, or summarizing information in a document (computing an ! average value of a certain element, for example). It's not very ! useful if you want to modify the document structure in some ! complicated way that involves changing how elements are nested, though ! it could be used if you simply wish to change element contents or ! attributes. For example, you would not want to re-order chapters in a ! book using SAX, but you might want to change the contents of any ! \element{name} elements with the attribute \attribute{lang} equal to ! 'greek' into Greek letters. One advantage of SAX is speed and simplicity. Let's say --- 187,213 ---- \label{SAX} ! The Simple API for XML isn't a standard in the formal sense that XML ! or ANSI C are. Rather, SAX is an informal specification originally ! designed by David Megginson with input from many people on the xml-dev ! mailing list. SAX defines an event-driven interface for parsing XML. ! To use SAX, you must create Python class instances which implement a ! specified interface, and the parser will then call various methods on ! those objects. ! ! This HOWTO describes version 2 of SAX (also referred to as ! SAX2). Earlier versions of this text explained SAX1, which is now only ! of historical interest. SAX is most suitable for purposes where you want to read through an entire XML document from beginning to end, and perform some ! computation such as building a data structure or summarizing the ! contained information (computing an average value of a certain ! element, for example). SAX is not very convenient if you want to ! modify the document structure by changing how elements are nested, ! though it would be straightforward to write a SAX program that simply ! changed element contents or attributes. For example, you wouldn't ! want to re-order chapters in a book using SAX, but you might want to ! extract the contents of all \element{name} elements with the attribute ! \attribute{lang} set to 'greek'. One advantage of SAX is speed and simplicity. Let's say *************** *** 219,227 **** instance which ignores all elements that aren't \element{writer}. ! Another advantage is that you don't have the whole document resident ! in memory at any one time, which matters if you are processing really ! huge documents. ! SAX defines 4 basic interfaces; an SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to --- 219,227 ---- instance which ignores all elements that aren't \element{writer}. ! Another advantage of SAX is that you don't have the whole document ! resident in memory at any one time, which matters if you are ! processing really huge documents. ! SAX defines 4 basic interfaces. A SAX-compliant XML parser can be passed any objects that support these interfaces, and will call various methods as data is processed. Your task, therefore, is to *************** *** 245,249 **** \lineii{EntityResolver}{Called to resolve references to external entities. If your documents will have no external entity references, ! you won't need to implement this interface. } \lineii{ErrorHandler}{Called for error handling. The parser will call --- 245,249 ---- \lineii{EntityResolver}{Called to resolve references to external entities. If your documents will have no external entity references, ! you don't need to implement this interface.} \lineii{ErrorHandler}{Called for error handling. The parser will call *************** *** 255,259 **** listed above are implemented as Python classes. The default method implementations are defined to do nothing---the method body is just a ! Python \code{pass} statement--so usually you can simply ignore methods that aren't relevant to your application. --- 255,259 ---- listed above are implemented as Python classes. The default method implementations are defined to do nothing---the method body is just a ! Python \code{pass} statement---so usually you can simply ignore methods that aren't relevant to your application. *************** *** 261,265 **** \begin{verbatim} # Define your specialized handler classes ! from xml.sax import Contenthandler, ... class docHandler(ContentHandler): ... --- 261,265 ---- \begin{verbatim} # Define your specialized handler classes ! from xml.sax import ContentHandler, ... class docHandler(ContentHandler): ... *************** *** 274,287 **** parser.setContentHandler(dh) ! # Parse the file; your handler's method will get called parser.parse(sys.stdin) - \end{verbatim} \subsection{Starting Out} ! Following the earlier example, let's consider a simple XML format for ! storing information about a comic book collection. Here's a sample ! document for a collection consisting of a single issue: \begin{verbatim} --- 274,286 ---- parser.setContentHandler(dh) ! # Parse the file; your handler's methods will get called parser.parse(sys.stdin) \end{verbatim} \subsection{Starting Out} ! Let's follow the earlier example of a comic book collection, using a ! simple DTD-less format. Here's a sample document for a collection ! consisting of a single issue: \begin{verbatim} *************** *** 298,304 **** \samp{collection} element. It has one child \element{comic} element for each issue; the book's title and number are given as attributes of ! the \element{comic} element, which can have one or more children ! containing the issue's writer and artists. There may be several ! artists or writers for a single issue. Let's start off with something simple: a document handler named --- 297,304 ---- \samp{collection} element. It has one child \element{comic} element for each issue; the book's title and number are given as attributes of ! the \element{comic} element. The \element{comic} element can in turn ! contain several other elements such as \element{writer} and ! \element{penciller} listing the writer and artists responsible for the ! issue. There may be several artists or writers for a single issue. Let's start off with something simple: a document handler named *************** *** 317,332 **** \class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver}, and \class{ErrorHandler}. This is what you should use if you want to ! use one class for everything. When you want separate classes for each ! purpose, or if you want to implement only a single interface, you can ! just subclass each interface individually. Neither of the two ! approaches is always ``better'' than the other; their suitability ! depends on what you're trying to do, and on what you prefer. ! Since this class is doing a search, an instance needs to know what to ! search for. The desired title and issue number are passed to the \class{FindIssue} constructor, and stored as part of the instance. ! Now let's look at the function which actually does all the work. ! This simple task only requires looking at the attributes of a given element, so only the \method{startElement} method is relevant. --- 317,331 ---- \class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver}, and \class{ErrorHandler}. This is what you should use if you want to ! just write a single class that wraps up all the logic for your ! parsing. You could also subclass each interface individually and ! implement separate classes for each purpose. Neither of the two ! approaches is always ``better'' than the other; mostly it's a matter ! of taste. ! Since this class is doing a search, an instance needs to know what it's searching for. The desired title and issue number are passed to the \class{FindIssue} constructor, and stored as part of the instance. ! Now let's override some of the parsing methods. ! This simple search only requires looking at the attributes of a given element, so only the \method{startElement} method is relevant. *************** *** 339,353 **** title = attrs.get('title', None) number = attrs.get('number', None) ! if title == self.search_title and number == self.search_number: ! print title, '#'+str(number), 'found' \end{verbatim} The \method{startElement()} method is passed a string giving the name of the element, and an instance containing the element's attributes. ! The latter implements the \class{AttributeList} interface, which ! includes most of the semantics of Python dictionaries. Therefore, the ! function looks for \element{comic} elements, and compares the ! specified \attribute{title} and \attribute{number} attributes to the ! search values. If they match, a message is printed out. \method{startElement()} is called for every single element in the --- 338,356 ---- title = attrs.get('title', None) number = attrs.get('number', None) ! if (title == self.search_title and ! number == self.search_number): ! print title, '#' + str(number), 'found' \end{verbatim} The \method{startElement()} method is passed a string giving the name of the element, and an instance containing the element's attributes. ! Attributes are accessed using ! methods from the \class{AttributeList} interface, which ! includes most of the semantics of Python dictionaries. ! ! To summarize, \method{startElement()} function looks for ! \element{comic} elements, and compares the specified \attribute{title} ! and \attribute{number} attributes to the search values. If they ! match, a message is printed out. \method{startElement()} is called for every single element in the *************** *** 363,369 **** \end{verbatim} ! To actually use the class, we need top-level code that creates ! instances of a parser and of \class{FindIssue}, associates them, and ! then calls a parser method to process the input. \begin{verbatim} --- 366,372 ---- \end{verbatim} ! To actually use the class, we need top-level code that creates ! instances of a parser and of \class{FindIssue}, associates the parser ! and the handler, and then calls a parser method to process the input. \begin{verbatim} *************** *** 374,377 **** --- 377,381 ---- # Create a parser parser = make_parser() + # Tell the parser we are not interested in XML namespaces parser.setFeature(feature_namespaces, 0) *************** *** 389,416 **** The \function{make_parser} class can automate the job of creating parsers. There are already several XML parsers available to Python, ! and more might be added in future. \file{xmllib.py} is included with ! Python 1.5, so it's always available, but it's also not particularly ! fast. A faster version of \file{xmllib.py} is included in ! \module{xml.parsers}. The \module{xml.parsers.expat} module is faster ! still, so it's obviously a preferred choice if it's available. ! \function{make_parser} determines which parsers are available and ! chooses the fastest one, so you don't have to know what the different ! parsers are, or how they differ. (You can also tell \function{make_parser} to try a list of parsers, if you want to use a specific one). ! In SAX2, XML namespace are supported. Parsers will not call ! \method{startElement}, but \method{startElementNS} if namespace ! processing is active. Since our content handler does not implement the ! namespace-aware methods, we request that namespace processing is ! deactivated. The default of this setting varies from parser to parser, ! so you should always set it to a safe value -- unless your handlers ! support either method. ! ! Once you've created a parser instance, calling ! \method{setContentHandler} tells the parser what to use as the ! handler. ! If you run the above code with the sample XML document, it'll output \code{Sandman \#62 found.} --- 393,423 ---- The \function{make_parser} class can automate the job of creating parsers. There are already several XML parsers available to Python, ! and more might be added in future. \file{xmllib.py} is included as ! part of the Python standard library, so it's always available, but ! it's also not particularly fast. A faster version of \file{xmllib.py} ! is included in \module{xml.parsers}. The \module{xml.parsers.expat} ! module is faster still, so it's obviously a preferred choice if it's ! available. \function{make_parser} determines which parsers are ! available and chooses the fastest one, so you don't have to know what ! the different parsers are, or how they differ. (You can also tell \function{make_parser} to try a list of parsers, if you want to use a specific one). ! SAX2 supports XML namespaces. If namespace processing is active, ! parsers won't call \method{startElement()}, but instead will call a ! method named \method{startElementNS()}. Since our \class{FindIssue} ! content handler doesn't implement the namespace-aware methods, we ! request that namespace processing is deactivated. The default of this ! setting varies from parser to parser, so you should always set it to a ! safe value (unless your handler supports both namespace-aware and ! -unaware processing). ! ! Once you've created a parser instance, calling the ! \method{setContentHandler()} method tells the parser what to use as ! the content handler. There are similar methods for setting the other ! handlers: \method{setDTDHandler()}, \method{setEntityResolver()}, and ! \method{setErrorHandler()}. ! If you run the above code with the sample XML document, it'll print \code{Sandman \#62 found.} *************** *** 427,431 **** The \code{\&foo;} entity is unknown, and the \element{comic} element isn't closed (if it was empty, there would be a \samp{/} before the ! closing \samp{>}. As a result, you get a SAXParseException, e.g. \begin{verbatim} --- 434,439 ---- The \code{\&foo;} entity is unknown, and the \element{comic} element isn't closed (if it was empty, there would be a \samp{/} before the ! closing \samp{>}. As a result, you get a ! \exception{SAXParseException}, e.g. \begin{verbatim} *************** *** 434,451 **** The default code for the \class{ErrorHandler} interface automatically ! raises an exception for any error; if that is what you want in case of ! an error, you don't need to change the error handler. Otherwise, you ! should provide your own version of the \class{ErrorHandler} interface, ! and at minimum override the \method{error()} and \method{fatalError()} methods. The minimal implementation for each method can be a single line. The methods in the \class{ErrorHandler} ! interface--\method{warning}, \method{error}, and ! \method{fatalError}--are all passed a single argument, an exception instance. The exception will always be a subclass of \exception{SAXException}, and calling \code{str()} on it will produce a readable error message explaining the problem. ! So, to re-implement a variant of \class{ErrorRaiser}, simply define ! one of the three methods to print the exception they're passed: \begin{verbatim} --- 442,460 ---- The default code for the \class{ErrorHandler} interface automatically ! raises an exception for any error; if that is what you want, you don't ! need to implement an error handler class at all. Otherwise, you can ! provide your own version of the \class{ErrorHandler} interface, at ! minimum overriding the \method{error()} and \method{fatalError()} methods. The minimal implementation for each method can be a single line. The methods in the \class{ErrorHandler} ! interface---\method{warning()}, \method{error()}, and ! \method{fatalError()}---are all passed a single argument, an exception instance. The exception will always be a subclass of \exception{SAXException}, and calling \code{str()} on it will produce a readable error message explaining the problem. ! For example, if you just want to continue running if a recoverable ! error occurs, simply define the \method{error()} method to print the ! exception it's passed: \begin{verbatim} *************** *** 460,464 **** \subsection{Searching Element Content} ! Let's tackle a slightly more complicated task, printing out all issues written by a certain author. This now requires looking at element content, because the writer's name is inside a \element{writer} --- 469,473 ---- \subsection{Searching Element Content} ! Let's tackle a slightly more complicated task: printing out all issues written by a certain author. This now requires looking at element content, because the writer's name is inside a \element{writer} *************** *** 526,531 **** flexibly; you can include extra spaces or newlines wherever you like. This means that you must normalize the whitespace before comparing ! attribute values or element content; otherwise the comparision might ! produce a wrong result due to the content of two elements having different amounts of whitespace. --- 535,540 ---- flexibly; you can include extra spaces or newlines wherever you like. This means that you must normalize the whitespace before comparing ! attribute values or element content; otherwise the comparison might ! produce an incorrect result due to the content of two elements having different amounts of whitespace. *************** *** 545,556 **** single function call. In the example above, there might be only one call to \method{characters()} for the string \samp{Peter Milligan}, or ! it might call \method{characters()} once for each character. More ! realistically, if the content contains an entity reference, as in ! \samp{Wagner ! \& Seagle}, the parser might call the method three times; once for ! \samp{Wagner\ }, once for \samp{\&}, represented by the entity ! reference, and again for \samp{\ Seagle}. ! For step 2 of \class{FindWriter}, \method{characters()} only has to check \code{inWriterContent}, and if it's true, add the characters to the string being built up. --- 554,564 ---- single function call. In the example above, there might be only one call to \method{characters()} for the string \samp{Peter Milligan}, or ! it might call \method{characters()} once for each character. Another, ! more realistic example: if the content contains an entity reference, ! as in \samp{Wagner \& Seagle}, the parser might call the method ! three times; once for \samp{Wagner\ }, once for \samp{\&}, represented ! by the entity reference, and again for \samp{\ Seagle}. ! For step 2 of the algorithm, \method{characters()} only has to check \code{inWriterContent}, and if it's true, add the characters to the string being built up. *************** *** 571,580 **** \function{normalize_whitespace()} function is called. This can be done because we know that leading and trailing whitespace are ! insignificant for this element, in this DTD. End tags can't have attributes on them, so there's no \var{attrs} ! parameter. Empty elements with attributes, such as \samp{<arc ! name="Season of Mists"/>}, will result in a call to ! \method{startElement()}, followed immediately by a call to \method{endElement()}. XXX how are external entities handled? Anything special need to be --- 579,589 ---- \function{normalize_whitespace()} function is called. This can be done because we know that leading and trailing whitespace are ! insignificant for this application. End tags can't have attributes on them, so there's no \var{attrs} ! parameter to the \method{endElement()} method. Empty elements with ! attributes, such as \samp{<arc name="Season of Mists"/>}, will result ! in a call to \method{startElement()}, followed immediately by a call ! to \method{endElement()}. XXX how are external entities handled? Anything special need to be *************** *** 583,594 **** \subsection{Related Links} ! \begin{definitions} ! \term{\url{http://www.megginson.com/SAX/}} ! % ! The SAX home page. This has the most recent copy of the specification, and lists SAX implementations for various languages and ! platforms. At the moment it's somewhat Java-centric. - \end{definitions} \section{DOM: The Document Object Model} --- 592,600 ---- \subsection{Related Links} ! The SAX home page is \url{http://www.megginson.com/SAX/}. ! This has the most recent copy of the specification, and lists SAX implementations for various languages and ! platforms. At the moment it's somewhat Java-centric, though. \section{DOM: The Document Object Model} |