From: Hugo M. <hug...@ca...> - 2007-11-28 19:42:47
|
Hi, I'm trying to use the SAX parser from libxml++ to read a simple XML file generated from a third-party program. At the head of the file is an XML declaration specifying the charset encoding: <?xml version="1.0" encoding="ISO-8859-1"?> A short distance into the file is the following text: <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for £80,000. [S]</sub-title> (Just in case that's got mangled in transit, that's the entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1). When I pass this to libxml++, I get a Glib::Error thrown, complaining about "Invalid byte sequence in conversion input". It seems that libxml++ is reading the &#A3; and converting it to a byte, then trying to interpret that as UTF-8, which it isn't. I've tried converting the input chunk before I pass it to the parser (using Glib::convert), but obviously that isn't working, as it's processing the entity as its component characters, rather than converting it to a byte sequence. How do I handle this input correctly with libxml++? Do I have to preprocess each chunk manually to convert the character entities before passing it to the parser, or is there some way of persuading the SaxParser to do it? Thanks, Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- "What are we going to do tonight?" "The same thing we do --- every night, Pinky. Try to take over the world!" |
From: Hugo M. <hug...@ca...> - 2007-11-28 20:52:56
|
On Wed, Nov 28, 2007 at 07:42:38PM +0000, Hugo Mills wrote: > When I pass this to libxml++, I get a Glib::Error thrown, > complaining about "Invalid byte sequence in conversion input". It > seems that libxml++ is reading the &#A3; and converting it to a byte, > then trying to interpret that as UTF-8, which it isn't. I've tried > converting the input chunk before I pass it to the parser (using > Glib::convert), but obviously that isn't working, as it's processing > the entity as its component characters, rather than converting it to a > byte sequence. > > How do I handle this input correctly with libxml++? Do I have to > preprocess each chunk manually to convert the character entities > before passing it to the parser, or is there some way of persuading > the SaxParser to do it? As a follow-up, I have tried converting the character entities in two different ways, both failing in the same manner as above: 1) Convert entity to bytes; use Glib::convert to go from ISO-8859-1 to UTF8. 2) Convert entity to bytes; use Glib::convert to go from ISO-8859-1 to UTF8; convert new bytes back to entities. Surely this can't be so difficult to use. The input text is well-formed, and accurately reports its character set. What am I doing wrong, that libxml++ fails to cope with it? Hugo, getting frustrated. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Is it true that "last known good" on Windows XP --- boots into CP/M? |
From: Murray C. <mu...@mu...> - 2007-11-29 07:55:09
|
On Wed, 2007-11-28 at 19:42 +0000, Hugo Mills wrote: > Hi, > > I'm trying to use the SAX parser from libxml++ to read a simple XML > file generated from a third-party program. At the head of the file is > an XML declaration specifying the charset encoding: > > <?xml version="1.0" encoding="ISO-8859-1"?> > > A short distance into the file is the following text: > > <sub-title lang="en">Highlights of the final of the Grand Slam of Darts, played over the best of 35 legs. The winner will be crowned the inaugural champion and receive a cheque for £80,000. [S]</sub-title> > > (Just in case that's got mangled in transit, that's the > entity/character literal 0xa3, for the UK Pound symbol in ISO-8859-1). > > When I pass this to libxml++, I get a Glib::Error thrown, > complaining about "Invalid byte sequence in conversion input". It > seems that libxml++ is reading the &#A3; and converting it to a byte, > then trying to interpret that as UTF-8, which it isn't. I've tried > converting the input chunk before I pass it to the parser (using > Glib::convert), but obviously that isn't working, as it's processing > the entity as its component characters, rather than converting it to a > byte sequence. What does xmllint say? > How do I handle this input correctly with libxml++? Do I have to > preprocess each chunk manually to convert the character entities > before passing it to the parser, or is there some way of persuading > the SaxParser to do it? > > Thanks, > Hugo. -- mu...@mu... www.murrayc.com www.openismus.com |
From: Murray C. <mu...@mu...> - 2007-11-29 12:51:37
|
On Thu, 2007-11-29 at 12:48 +0000, Hugo Mills wrote: > I've done a bit more poking, and it's rejecting the same character > after running through xmllint But does xmllint complain about the document at all? -- mu...@mu... www.murrayc.com www.openismus.com |