[Py-howto-checkins] CVS: pyhowto xml-howto.tex,1.10,1.11

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Update of /cvsroot/py-howto/pyhowto
In directory usw-pr-cvs1:/tmp/cvs-serv875

Modified Files:
	xml-howto.tex 
Log Message:
Rewrite pass over SAX section

Index: xml-howto.tex
===================================================================
RCS file: /cvsroot/py-howto/pyhowto/xml-howto.tex,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -r1.10 -r1.11
*** xml-howto.tex	2001/03/21 01:17:00	1.10
--- xml-howto.tex	2001/09/20 02:34:28	1.11
***************
*** 187,213 ****
  \label{SAX}

! The Simple API for XML isn't a standard in the formal sense, but an
! informal specification designed by David Megginson, with input from
! many people on the xml-dev mailing list.  SAX defines an event-driven
! interface for parsing XML.  To use SAX, you must create Python class
! instances which implement a specified interface, and the parser will
! then call various methods of those objects.
! 
! This howto describes version 2 of SAX (also referred to as
! SAX2). Earlier versions of this text did explain SAX1, which is
! primarily of historical interest only.

  SAX is most suitable for purposes where you want to read through an
  entire XML document from beginning to end, and perform some
! computation, such as building a data structure representating a
! document, or summarizing information in a document (computing an
! average value of a certain element, for example).  It's not very
! useful if you want to modify the document structure in some
! complicated way that involves changing how elements are nested, though
! it could be used if you simply wish to change element contents or
! attributes.  For example, you would not want to re-order chapters in a
! book using SAX, but you might want to change the contents of any
! \element{name} elements with the attribute \attribute{lang} equal to
! 'greek' into Greek letters.

  One advantage of SAX is speed and simplicity.  Let's say
--- 187,213 ----
  \label{SAX}

! The Simple API for XML isn't a standard in the formal sense that XML
! or ANSI C are.  Rather, SAX is an informal specification originally
! designed by David Megginson with input from many people on the xml-dev
! mailing list.  SAX defines an event-driven interface for parsing XML.
! To use SAX, you must create Python class instances which implement a
! specified interface, and the parser will then call various methods on
! those objects.
! 
! This HOWTO describes version 2 of SAX (also referred to as
! SAX2). Earlier versions of this text explained SAX1, which is now only
! of historical interest.

  SAX is most suitable for purposes where you want to read through an
  entire XML document from beginning to end, and perform some
! computation such as building a data structure or summarizing the
! contained information (computing an average value of a certain
! element, for example).  SAX is not very convenient if you want to
! modify the document structure by changing how elements are nested,
! though it would be straightforward to write a SAX program that simply
! changed element contents or attributes.  For example, you wouldn't
! want to re-order chapters in a book using SAX, but you might want to
! extract the contents of all \element{name} elements with the attribute
! \attribute{lang} set to 'greek'.

  One advantage of SAX is speed and simplicity.  Let's say
***************
*** 219,227 ****
  instance which ignores all elements that aren't \element{writer}.

! Another advantage is that you don't have the whole document resident
! in memory at any one time, which matters if you are processing really
! huge documents.

! SAX defines 4 basic interfaces; an SAX-compliant XML parser can be
  passed any objects that support these interfaces, and will call
  various methods as data is processed.  Your task, therefore, is to
--- 219,227 ----
  instance which ignores all elements that aren't \element{writer}.

! Another advantage of SAX is that you don't have the whole document
! resident in memory at any one time, which matters if you are
! processing really huge documents.  

! SAX defines 4 basic interfaces. A SAX-compliant XML parser can be
  passed any objects that support these interfaces, and will call
  various methods as data is processed.  Your task, therefore, is to
***************
*** 245,249 ****
  \lineii{EntityResolver}{Called to resolve references to external
  entities.  If your documents will have no external entity references,
! you won't need to implement this interface. }

  \lineii{ErrorHandler}{Called for error handling.  The parser will call
--- 245,249 ----
  \lineii{EntityResolver}{Called to resolve references to external
  entities.  If your documents will have no external entity references,
! you don't need to implement this interface.}

  \lineii{ErrorHandler}{Called for error handling.  The parser will call
***************
*** 255,259 ****
  listed above are implemented as Python classes.  The default method
  implementations are defined to do nothing---the method body is just a
! Python \code{pass} statement--so usually you can simply ignore methods
  that aren't relevant to your application. 

--- 255,259 ----
  listed above are implemented as Python classes.  The default method
  implementations are defined to do nothing---the method body is just a
! Python \code{pass} statement---so usually you can simply ignore methods
  that aren't relevant to your application. 

***************
*** 261,265 ****
  \begin{verbatim}
  # Define your specialized handler classes
! from xml.sax import Contenthandler, ...
  class docHandler(ContentHandler):
      ...
--- 261,265 ----
  \begin{verbatim}
  # Define your specialized handler classes
! from xml.sax import ContentHandler, ...
  class docHandler(ContentHandler):
      ...
***************
*** 274,287 ****
  parser.setContentHandler(dh)

! # Parse the file; your handler's method will get called
  parser.parse(sys.stdin)
- 
  \end{verbatim}

  \subsection{Starting Out}

! Following the earlier example, let's consider a simple XML format for
! storing information about a comic book collection.  Here's a sample
! document for a collection consisting of a single issue:

  \begin{verbatim}
--- 274,286 ----
  parser.setContentHandler(dh)

! # Parse the file; your handler's methods will get called
  parser.parse(sys.stdin)
  \end{verbatim}

  \subsection{Starting Out}

! Let's follow the earlier example of a comic book collection, using a
! simple DTD-less format. Here's a sample document for a collection
! consisting of a single issue:

  \begin{verbatim}
***************
*** 298,304 ****
  \samp{collection} element.  It has one child \element{comic} element
  for each issue; the book's title and number are given as attributes of
! the \element{comic} element, which can have one or more children
! containing the issue's writer and artists.  There may be several
! artists or writers for a single issue.

  Let's start off with something simple: a document handler named
--- 297,304 ----
  \samp{collection} element.  It has one child \element{comic} element
  for each issue; the book's title and number are given as attributes of
! the \element{comic} element.  The \element{comic} element can in turn
! contain several other elements such as \element{writer} and
! \element{penciller} listing the writer and artists responsible for the
! issue.  There may be several artists or writers for a single issue.

  Let's start off with something simple: a document handler named
***************
*** 317,332 ****
  \class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver},
  and \class{ErrorHandler}.  This is what you should use if you want to
! use one class for everything.  When you want separate classes for each
! purpose, or if you want to implement only a single interface, you can
! just subclass each interface individually.  Neither of the two
! approaches is always ``better'' than the other; their suitability
! depends on what you're trying to do, and on what you prefer.

! Since this class is doing a search, an instance needs to know what to
! search for.  The desired title and issue number are passed to the
  \class{FindIssue} constructor, and stored as part of the instance.

! Now let's look at the function which actually does all the work.
! This simple task only requires looking at the attributes of a given
  element, so only the \method{startElement} method is relevant.

--- 317,331 ----
  \class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver},
  and \class{ErrorHandler}.  This is what you should use if you want to
! just write a single class that wraps up all the logic for your
! parsing.  You could also subclass each interface individually and
! implement separate classes for each purpose.  Neither of the two
! approaches is always ``better'' than the other; mostly it's a matter
! of taste.

! Since this class is doing a search, an instance needs to know what it's searching for.  The desired title and issue number are passed to the
  \class{FindIssue} constructor, and stored as part of the instance.

! Now let's override some of the parsing methods.
! This simple search only requires looking at the attributes of a given
  element, so only the \method{startElement} method is relevant.

***************
*** 339,353 ****
          title = attrs.get('title', None)
          number = attrs.get('number', None)
!         if title == self.search_title and number == self.search_number:
!             print title, '#'+str(number), 'found'
  \end{verbatim}

  The \method{startElement()} method is passed a string giving the name
  of the element, and an instance containing the element's attributes.
! The latter implements the \class{AttributeList} interface, which
! includes most of the semantics of Python dictionaries.  Therefore, the 
! function looks for \element{comic} elements, and compares the
! specified \attribute{title} and \attribute{number} attributes to the
! search values.  If they match, a message is printed out.

  \method{startElement()} is called for every single element in the
--- 338,356 ----
          title = attrs.get('title', None)
          number = attrs.get('number', None)
!         if (title == self.search_title and 
! 	    number == self.search_number):
!             print title, '#' + str(number), 'found'
  \end{verbatim}

  The \method{startElement()} method is passed a string giving the name
  of the element, and an instance containing the element's attributes.
! Attributes are accessed using 
! methods from the \class{AttributeList} interface, which
! includes most of the semantics of Python dictionaries.  
! 
! To summarize, \method{startElement()} function looks for
! \element{comic} elements, and compares the specified \attribute{title}
! and \attribute{number} attributes to the search values.  If they
! match, a message is printed out.

  \method{startElement()} is called for every single element in the
***************
*** 363,369 ****
  \end{verbatim}

! To actually use the class, we need top-level code that creates 
! instances of a parser and of \class{FindIssue}, associates them, and
! then calls a parser method to process the input.

  \begin{verbatim}
--- 366,372 ----
  \end{verbatim}

! To actually use the class, we need top-level code that creates
! instances of a parser and of \class{FindIssue}, associates the parser
! and the handler, and then calls a parser method to process the input.

  \begin{verbatim}
***************
*** 374,377 ****
--- 377,381 ----
      # Create a parser
      parser = make_parser()
+ 
      # Tell the parser we are not interested in XML namespaces
      parser.setFeature(feature_namespaces, 0)
***************
*** 389,416 ****
  The \function{make_parser} class can automate the job of creating
  parsers.  There are already several XML parsers available to Python,
! and more might be added in future.  \file{xmllib.py} is included with
! Python 1.5, so it's always available, but it's also not particularly
! fast.  A faster version of \file{xmllib.py} is included in
! \module{xml.parsers}.  The \module{xml.parsers.expat} module is faster
! still, so it's obviously a preferred choice if it's available.
! \function{make_parser} determines which parsers are available and
! chooses the fastest one, so you don't have to know what the different
! parsers are, or how they differ. (You can also tell
  \function{make_parser} to try a list of parsers, if you want to use a
  specific one).

! In SAX2, XML namespace are supported. Parsers will not call
! \method{startElement}, but \method{startElementNS} if namespace
! processing is active. Since our content handler does not implement the
! namespace-aware methods, we request that namespace processing is
! deactivated. The default of this setting varies from parser to parser,
! so you should always set it to a safe value -- unless your handlers
! support either method.
! 
! Once you've created a parser instance, calling
! \method{setContentHandler} tells the parser what to use as the
! handler.

! If you run the above code with the sample XML document, it'll output
  \code{Sandman \#62 found.}  

--- 393,423 ----
  The \function{make_parser} class can automate the job of creating
  parsers.  There are already several XML parsers available to Python,
! and more might be added in future.  \file{xmllib.py} is included as
! part of the Python standard library, so it's always available, but
! it's also not particularly fast.  A faster version of \file{xmllib.py}
! is included in \module{xml.parsers}.  The \module{xml.parsers.expat}
! module is faster still, so it's obviously a preferred choice if it's
! available.  \function{make_parser} determines which parsers are
! available and chooses the fastest one, so you don't have to know what
! the different parsers are, or how they differ. (You can also tell
  \function{make_parser} to try a list of parsers, if you want to use a
  specific one).

! SAX2 supports XML namespaces.  If namespace processing is active,
! parsers won't call \method{startElement()}, but instead will call a
! method named \method{startElementNS()}. Since our \class{FindIssue}
! content handler doesn't implement the namespace-aware methods, we
! request that namespace processing is deactivated.  The default of this
! setting varies from parser to parser, so you should always set it to a
! safe value (unless your handler supports both namespace-aware and
! -unaware processing).
! 
! Once you've created a parser instance, calling the
! \method{setContentHandler()} method tells the parser what to use as
! the content handler.  There are similar methods for setting the other
! handlers: \method{setDTDHandler()}, \method{setEntityResolver()}, and
! \method{setErrorHandler()}.

! If you run the above code with the sample XML document, it'll print
  \code{Sandman \#62 found.}  

***************
*** 427,431 ****
  The \code{\&foo;} entity is unknown, and the \element{comic} element
  isn't closed (if it was empty, there would be a \samp{/} before the
! closing \samp{>}. As a result, you get a SAXParseException, e.g.

  \begin{verbatim}
--- 434,439 ----
  The \code{\&foo;} entity is unknown, and the \element{comic} element
  isn't closed (if it was empty, there would be a \samp{/} before the
! closing \samp{>}. As a result, you get a
! \exception{SAXParseException}, e.g.

  \begin{verbatim}
***************
*** 434,451 ****

  The default code for the \class{ErrorHandler} interface automatically
! raises an exception for any error; if that is what you want in case of
! an error, you don't need to change the error handler.  Otherwise, you
! should provide your own version of the \class{ErrorHandler} interface,
! and at minimum override the \method{error()} and \method{fatalError()}
  methods.  The minimal implementation for each method can be a single
  line.  The methods in the \class{ErrorHandler}
! interface--\method{warning}, \method{error}, and
! \method{fatalError}--are all passed a single argument, an exception
  instance.  The exception will always be a subclass of
  \exception{SAXException}, and calling \code{str()} on it will produce
  a readable error message explaining the problem.

! So, to re-implement a variant of \class{ErrorRaiser}, simply define
! one of the three methods to print the exception they're passed:

  \begin{verbatim}
--- 442,460 ----

  The default code for the \class{ErrorHandler} interface automatically
! raises an exception for any error; if that is what you want, you don't
! need to implement an error handler class at all.  Otherwise, you can
! provide your own version of the \class{ErrorHandler} interface, at
! minimum overriding the \method{error()} and \method{fatalError()}
  methods.  The minimal implementation for each method can be a single
  line.  The methods in the \class{ErrorHandler}
! interface---\method{warning()}, \method{error()}, and
! \method{fatalError()}---are all passed a single argument, an exception
  instance.  The exception will always be a subclass of
  \exception{SAXException}, and calling \code{str()} on it will produce
  a readable error message explaining the problem.

! For example, if you just want to continue running if a recoverable
! error occurs, simply define the \method{error()} method to print the
! exception it's passed:

  \begin{verbatim}
***************
*** 460,464 ****
  \subsection{Searching Element Content}

! Let's tackle a slightly more complicated task, printing out all issues
  written by a certain author.  This now requires looking at element
  content, because the writer's name is inside a \element{writer}
--- 469,473 ----
  \subsection{Searching Element Content}

! Let's tackle a slightly more complicated task: printing out all issues
  written by a certain author.  This now requires looking at element
  content, because the writer's name is inside a \element{writer}
***************
*** 526,531 ****
  flexibly; you can include extra spaces or newlines wherever you like.
  This means that you must normalize the whitespace before comparing
! attribute values or element content; otherwise the comparision might
! produce a wrong result due to the content of two elements having
  different amounts of whitespace.

--- 535,540 ----
  flexibly; you can include extra spaces or newlines wherever you like.
  This means that you must normalize the whitespace before comparing
! attribute values or element content; otherwise the comparison might
! produce an incorrect result due to the content of two elements having
  different amounts of whitespace.

***************
*** 545,556 ****
  single function call.  In the example above, there might be only one
  call to \method{characters()} for the string \samp{Peter Milligan}, or
! it might call \method{characters()} once for each character.  More
! realistically, if the content contains an entity reference, as in
! \samp{Wagner
! \&amp; Seagle}, the parser might call the method three times; once for 
! \samp{Wagner\ }, once for \samp{\&}, represented by the entity
! reference, and again for \samp{\ Seagle}.

! For step 2 of \class{FindWriter}, \method{characters()} only has to
  check \code{inWriterContent}, and if it's true, add the characters to
  the string being built up.
--- 554,564 ----
  single function call.  In the example above, there might be only one
  call to \method{characters()} for the string \samp{Peter Milligan}, or
! it might call \method{characters()} once for each character.  Another,
! more realistic example: if the content contains an entity reference,
! as in \samp{Wagner \&amp; Seagle}, the parser might call the method
! three times; once for \samp{Wagner\ }, once for \samp{\&}, represented
! by the entity reference, and again for \samp{\ Seagle}.

! For step 2 of the algorithm, \method{characters()} only has to
  check \code{inWriterContent}, and if it's true, add the characters to
  the string being built up.
***************
*** 571,580 ****
  \function{normalize_whitespace()} function is called.  This can be
  done because we know that leading and trailing whitespace are
! insignificant for this element, in this DTD.  

  End tags can't have attributes on them, so there's no \var{attrs}
! parameter.  Empty elements with attributes, such as \samp{<arc
! name="Season of Mists"/>}, will result in a call to
! \method{startElement()}, followed immediately by a call to \method{endElement()}.

  XXX how are external entities handled?  Anything special need to be
--- 579,589 ----
  \function{normalize_whitespace()} function is called.  This can be
  done because we know that leading and trailing whitespace are
! insignificant for this application.  

  End tags can't have attributes on them, so there's no \var{attrs}
! parameter to the \method{endElement()} method.  Empty elements with
! attributes, such as \samp{<arc name="Season of Mists"/>}, will result
! in a call to \method{startElement()}, followed immediately by a call
! to \method{endElement()}.

  XXX how are external entities handled?  Anything special need to be
***************
*** 583,594 ****
  \subsection{Related Links}

! \begin{definitions}
! \term{\url{http://www.megginson.com/SAX/}}
! %
! The SAX home page.  This has the most recent copy of the
  specification, and lists SAX implementations for various languages and
! platforms.  At the moment it's somewhat Java-centric.

- \end{definitions}

  \section{DOM: The Document Object Model}
--- 592,600 ----
  \subsection{Related Links}

! The SAX home page is \url{http://www.megginson.com/SAX/}.
! This has the most recent copy of the
  specification, and lists SAX implementations for various languages and
! platforms.  At the moment it's somewhat Java-centric, though.

  \section{DOM: The Document Object Model}