|
From: A.M. K. <aku...@us...> - 2001-09-20 02:34:32
|
Update of /cvsroot/py-howto/pyhowto
In directory usw-pr-cvs1:/tmp/cvs-serv875
Modified Files:
xml-howto.tex
Log Message:
Rewrite pass over SAX section
Index: xml-howto.tex
===================================================================
RCS file: /cvsroot/py-howto/pyhowto/xml-howto.tex,v
retrieving revision 1.10
retrieving revision 1.11
diff -C2 -r1.10 -r1.11
*** xml-howto.tex 2001/03/21 01:17:00 1.10
--- xml-howto.tex 2001/09/20 02:34:28 1.11
***************
*** 187,213 ****
\label{SAX}
! The Simple API for XML isn't a standard in the formal sense, but an
! informal specification designed by David Megginson, with input from
! many people on the xml-dev mailing list. SAX defines an event-driven
! interface for parsing XML. To use SAX, you must create Python class
! instances which implement a specified interface, and the parser will
! then call various methods of those objects.
!
! This howto describes version 2 of SAX (also referred to as
! SAX2). Earlier versions of this text did explain SAX1, which is
! primarily of historical interest only.
SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
! computation, such as building a data structure representating a
! document, or summarizing information in a document (computing an
! average value of a certain element, for example). It's not very
! useful if you want to modify the document structure in some
! complicated way that involves changing how elements are nested, though
! it could be used if you simply wish to change element contents or
! attributes. For example, you would not want to re-order chapters in a
! book using SAX, but you might want to change the contents of any
! \element{name} elements with the attribute \attribute{lang} equal to
! 'greek' into Greek letters.
One advantage of SAX is speed and simplicity. Let's say
--- 187,213 ----
\label{SAX}
! The Simple API for XML isn't a standard in the formal sense that XML
! or ANSI C are. Rather, SAX is an informal specification originally
! designed by David Megginson with input from many people on the xml-dev
! mailing list. SAX defines an event-driven interface for parsing XML.
! To use SAX, you must create Python class instances which implement a
! specified interface, and the parser will then call various methods on
! those objects.
!
! This HOWTO describes version 2 of SAX (also referred to as
! SAX2). Earlier versions of this text explained SAX1, which is now only
! of historical interest.
SAX is most suitable for purposes where you want to read through an
entire XML document from beginning to end, and perform some
! computation such as building a data structure or summarizing the
! contained information (computing an average value of a certain
! element, for example). SAX is not very convenient if you want to
! modify the document structure by changing how elements are nested,
! though it would be straightforward to write a SAX program that simply
! changed element contents or attributes. For example, you wouldn't
! want to re-order chapters in a book using SAX, but you might want to
! extract the contents of all \element{name} elements with the attribute
! \attribute{lang} set to 'greek'.
One advantage of SAX is speed and simplicity. Let's say
***************
*** 219,227 ****
instance which ignores all elements that aren't \element{writer}.
! Another advantage is that you don't have the whole document resident
! in memory at any one time, which matters if you are processing really
! huge documents.
! SAX defines 4 basic interfaces; an SAX-compliant XML parser can be
passed any objects that support these interfaces, and will call
various methods as data is processed. Your task, therefore, is to
--- 219,227 ----
instance which ignores all elements that aren't \element{writer}.
! Another advantage of SAX is that you don't have the whole document
! resident in memory at any one time, which matters if you are
! processing really huge documents.
! SAX defines 4 basic interfaces. A SAX-compliant XML parser can be
passed any objects that support these interfaces, and will call
various methods as data is processed. Your task, therefore, is to
***************
*** 245,249 ****
\lineii{EntityResolver}{Called to resolve references to external
entities. If your documents will have no external entity references,
! you won't need to implement this interface. }
\lineii{ErrorHandler}{Called for error handling. The parser will call
--- 245,249 ----
\lineii{EntityResolver}{Called to resolve references to external
entities. If your documents will have no external entity references,
! you don't need to implement this interface.}
\lineii{ErrorHandler}{Called for error handling. The parser will call
***************
*** 255,259 ****
listed above are implemented as Python classes. The default method
implementations are defined to do nothing---the method body is just a
! Python \code{pass} statement--so usually you can simply ignore methods
that aren't relevant to your application.
--- 255,259 ----
listed above are implemented as Python classes. The default method
implementations are defined to do nothing---the method body is just a
! Python \code{pass} statement---so usually you can simply ignore methods
that aren't relevant to your application.
***************
*** 261,265 ****
\begin{verbatim}
# Define your specialized handler classes
! from xml.sax import Contenthandler, ...
class docHandler(ContentHandler):
...
--- 261,265 ----
\begin{verbatim}
# Define your specialized handler classes
! from xml.sax import ContentHandler, ...
class docHandler(ContentHandler):
...
***************
*** 274,287 ****
parser.setContentHandler(dh)
! # Parse the file; your handler's method will get called
parser.parse(sys.stdin)
-
\end{verbatim}
\subsection{Starting Out}
! Following the earlier example, let's consider a simple XML format for
! storing information about a comic book collection. Here's a sample
! document for a collection consisting of a single issue:
\begin{verbatim}
--- 274,286 ----
parser.setContentHandler(dh)
! # Parse the file; your handler's methods will get called
parser.parse(sys.stdin)
\end{verbatim}
\subsection{Starting Out}
! Let's follow the earlier example of a comic book collection, using a
! simple DTD-less format. Here's a sample document for a collection
! consisting of a single issue:
\begin{verbatim}
***************
*** 298,304 ****
\samp{collection} element. It has one child \element{comic} element
for each issue; the book's title and number are given as attributes of
! the \element{comic} element, which can have one or more children
! containing the issue's writer and artists. There may be several
! artists or writers for a single issue.
Let's start off with something simple: a document handler named
--- 297,304 ----
\samp{collection} element. It has one child \element{comic} element
for each issue; the book's title and number are given as attributes of
! the \element{comic} element. The \element{comic} element can in turn
! contain several other elements such as \element{writer} and
! \element{penciller} listing the writer and artists responsible for the
! issue. There may be several artists or writers for a single issue.
Let's start off with something simple: a document handler named
***************
*** 317,332 ****
\class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver},
and \class{ErrorHandler}. This is what you should use if you want to
! use one class for everything. When you want separate classes for each
! purpose, or if you want to implement only a single interface, you can
! just subclass each interface individually. Neither of the two
! approaches is always ``better'' than the other; their suitability
! depends on what you're trying to do, and on what you prefer.
! Since this class is doing a search, an instance needs to know what to
! search for. The desired title and issue number are passed to the
\class{FindIssue} constructor, and stored as part of the instance.
! Now let's look at the function which actually does all the work.
! This simple task only requires looking at the attributes of a given
element, so only the \method{startElement} method is relevant.
--- 317,331 ----
\class{ContentHandler}, \class{DTDHandler}, \class{EntityResolver},
and \class{ErrorHandler}. This is what you should use if you want to
! just write a single class that wraps up all the logic for your
! parsing. You could also subclass each interface individually and
! implement separate classes for each purpose. Neither of the two
! approaches is always ``better'' than the other; mostly it's a matter
! of taste.
! Since this class is doing a search, an instance needs to know what it's searching for. The desired title and issue number are passed to the
\class{FindIssue} constructor, and stored as part of the instance.
! Now let's override some of the parsing methods.
! This simple search only requires looking at the attributes of a given
element, so only the \method{startElement} method is relevant.
***************
*** 339,353 ****
title = attrs.get('title', None)
number = attrs.get('number', None)
! if title == self.search_title and number == self.search_number:
! print title, '#'+str(number), 'found'
\end{verbatim}
The \method{startElement()} method is passed a string giving the name
of the element, and an instance containing the element's attributes.
! The latter implements the \class{AttributeList} interface, which
! includes most of the semantics of Python dictionaries. Therefore, the
! function looks for \element{comic} elements, and compares the
! specified \attribute{title} and \attribute{number} attributes to the
! search values. If they match, a message is printed out.
\method{startElement()} is called for every single element in the
--- 338,356 ----
title = attrs.get('title', None)
number = attrs.get('number', None)
! if (title == self.search_title and
! number == self.search_number):
! print title, '#' + str(number), 'found'
\end{verbatim}
The \method{startElement()} method is passed a string giving the name
of the element, and an instance containing the element's attributes.
! Attributes are accessed using
! methods from the \class{AttributeList} interface, which
! includes most of the semantics of Python dictionaries.
!
! To summarize, \method{startElement()} function looks for
! \element{comic} elements, and compares the specified \attribute{title}
! and \attribute{number} attributes to the search values. If they
! match, a message is printed out.
\method{startElement()} is called for every single element in the
***************
*** 363,369 ****
\end{verbatim}
! To actually use the class, we need top-level code that creates
! instances of a parser and of \class{FindIssue}, associates them, and
! then calls a parser method to process the input.
\begin{verbatim}
--- 366,372 ----
\end{verbatim}
! To actually use the class, we need top-level code that creates
! instances of a parser and of \class{FindIssue}, associates the parser
! and the handler, and then calls a parser method to process the input.
\begin{verbatim}
***************
*** 374,377 ****
--- 377,381 ----
# Create a parser
parser = make_parser()
+
# Tell the parser we are not interested in XML namespaces
parser.setFeature(feature_namespaces, 0)
***************
*** 389,416 ****
The \function{make_parser} class can automate the job of creating
parsers. There are already several XML parsers available to Python,
! and more might be added in future. \file{xmllib.py} is included with
! Python 1.5, so it's always available, but it's also not particularly
! fast. A faster version of \file{xmllib.py} is included in
! \module{xml.parsers}. The \module{xml.parsers.expat} module is faster
! still, so it's obviously a preferred choice if it's available.
! \function{make_parser} determines which parsers are available and
! chooses the fastest one, so you don't have to know what the different
! parsers are, or how they differ. (You can also tell
\function{make_parser} to try a list of parsers, if you want to use a
specific one).
! In SAX2, XML namespace are supported. Parsers will not call
! \method{startElement}, but \method{startElementNS} if namespace
! processing is active. Since our content handler does not implement the
! namespace-aware methods, we request that namespace processing is
! deactivated. The default of this setting varies from parser to parser,
! so you should always set it to a safe value -- unless your handlers
! support either method.
!
! Once you've created a parser instance, calling
! \method{setContentHandler} tells the parser what to use as the
! handler.
! If you run the above code with the sample XML document, it'll output
\code{Sandman \#62 found.}
--- 393,423 ----
The \function{make_parser} class can automate the job of creating
parsers. There are already several XML parsers available to Python,
! and more might be added in future. \file{xmllib.py} is included as
! part of the Python standard library, so it's always available, but
! it's also not particularly fast. A faster version of \file{xmllib.py}
! is included in \module{xml.parsers}. The \module{xml.parsers.expat}
! module is faster still, so it's obviously a preferred choice if it's
! available. \function{make_parser} determines which parsers are
! available and chooses the fastest one, so you don't have to know what
! the different parsers are, or how they differ. (You can also tell
\function{make_parser} to try a list of parsers, if you want to use a
specific one).
! SAX2 supports XML namespaces. If namespace processing is active,
! parsers won't call \method{startElement()}, but instead will call a
! method named \method{startElementNS()}. Since our \class{FindIssue}
! content handler doesn't implement the namespace-aware methods, we
! request that namespace processing is deactivated. The default of this
! setting varies from parser to parser, so you should always set it to a
! safe value (unless your handler supports both namespace-aware and
! -unaware processing).
!
! Once you've created a parser instance, calling the
! \method{setContentHandler()} method tells the parser what to use as
! the content handler. There are similar methods for setting the other
! handlers: \method{setDTDHandler()}, \method{setEntityResolver()}, and
! \method{setErrorHandler()}.
! If you run the above code with the sample XML document, it'll print
\code{Sandman \#62 found.}
***************
*** 427,431 ****
The \code{\&foo;} entity is unknown, and the \element{comic} element
isn't closed (if it was empty, there would be a \samp{/} before the
! closing \samp{>}. As a result, you get a SAXParseException, e.g.
\begin{verbatim}
--- 434,439 ----
The \code{\&foo;} entity is unknown, and the \element{comic} element
isn't closed (if it was empty, there would be a \samp{/} before the
! closing \samp{>}. As a result, you get a
! \exception{SAXParseException}, e.g.
\begin{verbatim}
***************
*** 434,451 ****
The default code for the \class{ErrorHandler} interface automatically
! raises an exception for any error; if that is what you want in case of
! an error, you don't need to change the error handler. Otherwise, you
! should provide your own version of the \class{ErrorHandler} interface,
! and at minimum override the \method{error()} and \method{fatalError()}
methods. The minimal implementation for each method can be a single
line. The methods in the \class{ErrorHandler}
! interface--\method{warning}, \method{error}, and
! \method{fatalError}--are all passed a single argument, an exception
instance. The exception will always be a subclass of
\exception{SAXException}, and calling \code{str()} on it will produce
a readable error message explaining the problem.
! So, to re-implement a variant of \class{ErrorRaiser}, simply define
! one of the three methods to print the exception they're passed:
\begin{verbatim}
--- 442,460 ----
The default code for the \class{ErrorHandler} interface automatically
! raises an exception for any error; if that is what you want, you don't
! need to implement an error handler class at all. Otherwise, you can
! provide your own version of the \class{ErrorHandler} interface, at
! minimum overriding the \method{error()} and \method{fatalError()}
methods. The minimal implementation for each method can be a single
line. The methods in the \class{ErrorHandler}
! interface---\method{warning()}, \method{error()}, and
! \method{fatalError()}---are all passed a single argument, an exception
instance. The exception will always be a subclass of
\exception{SAXException}, and calling \code{str()} on it will produce
a readable error message explaining the problem.
! For example, if you just want to continue running if a recoverable
! error occurs, simply define the \method{error()} method to print the
! exception it's passed:
\begin{verbatim}
***************
*** 460,464 ****
\subsection{Searching Element Content}
! Let's tackle a slightly more complicated task, printing out all issues
written by a certain author. This now requires looking at element
content, because the writer's name is inside a \element{writer}
--- 469,473 ----
\subsection{Searching Element Content}
! Let's tackle a slightly more complicated task: printing out all issues
written by a certain author. This now requires looking at element
content, because the writer's name is inside a \element{writer}
***************
*** 526,531 ****
flexibly; you can include extra spaces or newlines wherever you like.
This means that you must normalize the whitespace before comparing
! attribute values or element content; otherwise the comparision might
! produce a wrong result due to the content of two elements having
different amounts of whitespace.
--- 535,540 ----
flexibly; you can include extra spaces or newlines wherever you like.
This means that you must normalize the whitespace before comparing
! attribute values or element content; otherwise the comparison might
! produce an incorrect result due to the content of two elements having
different amounts of whitespace.
***************
*** 545,556 ****
single function call. In the example above, there might be only one
call to \method{characters()} for the string \samp{Peter Milligan}, or
! it might call \method{characters()} once for each character. More
! realistically, if the content contains an entity reference, as in
! \samp{Wagner
! \& Seagle}, the parser might call the method three times; once for
! \samp{Wagner\ }, once for \samp{\&}, represented by the entity
! reference, and again for \samp{\ Seagle}.
! For step 2 of \class{FindWriter}, \method{characters()} only has to
check \code{inWriterContent}, and if it's true, add the characters to
the string being built up.
--- 554,564 ----
single function call. In the example above, there might be only one
call to \method{characters()} for the string \samp{Peter Milligan}, or
! it might call \method{characters()} once for each character. Another,
! more realistic example: if the content contains an entity reference,
! as in \samp{Wagner \& Seagle}, the parser might call the method
! three times; once for \samp{Wagner\ }, once for \samp{\&}, represented
! by the entity reference, and again for \samp{\ Seagle}.
! For step 2 of the algorithm, \method{characters()} only has to
check \code{inWriterContent}, and if it's true, add the characters to
the string being built up.
***************
*** 571,580 ****
\function{normalize_whitespace()} function is called. This can be
done because we know that leading and trailing whitespace are
! insignificant for this element, in this DTD.
End tags can't have attributes on them, so there's no \var{attrs}
! parameter. Empty elements with attributes, such as \samp{<arc
! name="Season of Mists"/>}, will result in a call to
! \method{startElement()}, followed immediately by a call to \method{endElement()}.
XXX how are external entities handled? Anything special need to be
--- 579,589 ----
\function{normalize_whitespace()} function is called. This can be
done because we know that leading and trailing whitespace are
! insignificant for this application.
End tags can't have attributes on them, so there's no \var{attrs}
! parameter to the \method{endElement()} method. Empty elements with
! attributes, such as \samp{<arc name="Season of Mists"/>}, will result
! in a call to \method{startElement()}, followed immediately by a call
! to \method{endElement()}.
XXX how are external entities handled? Anything special need to be
***************
*** 583,594 ****
\subsection{Related Links}
! \begin{definitions}
! \term{\url{http://www.megginson.com/SAX/}}
! %
! The SAX home page. This has the most recent copy of the
specification, and lists SAX implementations for various languages and
! platforms. At the moment it's somewhat Java-centric.
- \end{definitions}
\section{DOM: The Document Object Model}
--- 592,600 ----
\subsection{Related Links}
! The SAX home page is \url{http://www.megginson.com/SAX/}.
! This has the most recent copy of the
specification, and lists SAX implementations for various languages and
! platforms. At the moment it's somewhat Java-centric, though.
\section{DOM: The Document Object Model}
|