From: Murray C. <mu...@mu...> - 2007-11-29 07:55:09
|
On Wed, 2007-11-28 at 19:42 +0000, Hugo Mills wrote: > Hi, > > I'm trying to use the SAX parser from libxml++ to read a simple XML > file generated from a third-party program. At the head of the file is > an XML declaration specifying the charset encoding: > > <?xml version="1.0" encoding="ISO-8859-1"?> > > A short distance into the file is the following text: > > <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for £80,000. [S]</sub-title> > > (Just in case that's got mangled in transit, that's the > entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1). > > When I pass this to libxml++, I get a Glib::Error thrown, > complaining about "Invalid byte sequence in conversion input". It > seems that libxml++ is reading the &#A3; and converting it to a byte, > then trying to interpret that as UTF-8, which it isn't. I've tried > converting the input chunk before I pass it to the parser (using > Glib::convert), but obviously that isn't working, as it's processing > the entity as its component characters, rather than converting it to a > byte sequence. What does xmllint say? > How do I handle this input correctly with libxml++? Do I have to > preprocess each chunk manually to convert the character entities > before passing it to the parser, or is there some way of persuading > the SaxParser to do it? > > Thanks, > Hugo. -- mu...@mu... www.murrayc.com www.openismus.com |