Re: [Xmlppm-users] bug with accented caracters in xmlppm
Status: Beta
Brought to you by:
jcheney
From: James C. <jr...@co...> - 2003-01-19 14:06:10
|
OK, I've taken a look. The problem is that expat, the XML parser xmlppm uses, converts=20 everything to UTF-8 before it passes it on to xmlppm. However, xmlppm=20 was saving the old encoding and restoring it in the decoded file, while=20 leaving the character encoding in UTF-8. =20 Changing the encoding declaration in the decoded file (manually) to=20 UTF-8 resulted in the output file being rendered in Netscape/Mozilla the=20 same as the input. Any other XML processor should then be able to deal=20 with the output of xmlppm in a reasonable way, but I'm sure that for=20 some applications (such as if you're editing the xml text by hand) you'd=20 prefer to have the output use the encoding of your choice rather than=20 mine. Unfortunately, I know next to nothing about Unicode and other=20 character encodings in general, so getting it right might take a while.=20 I do know that there's an alternative form of expat based on wide=20 characters and C localization/internationalization, but the compression=20 algorithm xmlppm uses is rather heavily dependent on characters being 8=20 bits, and I don't know how much work it would be to fix this. And just=20 naively compressing UTF-32 by serializing the 4 bytes of each wide=20 character would really mess up compression because it would put separate=20 every "real" character by three bogus codepage ones (which would be=20 almost always the same). (This also implies that the less a language is=20 like English, the worse xmlppm will compress it, since its UTF-8=20 representation will have many useless codepage characters too.) Until I have a better idea, xmlppm will just change the encoding to=20 UTF-8 so it's at least consistent. Ideally, there's some library out there that I can use to postprocess=20 the decoded XML file from UTF-8 to the encoding declared in the input=20 (or, any other encoding as specified by the user). Perhaps there are=20 already tools that do just that out there, in which case you could use=20 those for the time being. Hope this helps, and let me know if it doesn't. --James Vincent Renardias wrote: >Hello, > >I've just given xmlppm a try. A just ran into a little problem. >My sample file is a docbook/XML file (French version of Jules Verne's >"De la terre =E0 la lune"). > >After trying successively bzip2, gzip & xmlppm, here are the final file >sizes. > >-rw-r--r-- 1 root root 414211 Jan 16 18:02 yo.xml >-rw-r--r-- 1 root root 97412 Jan 16 18:03 yo.xml.bz2 >-rw-r--r-- 1 root root 132325 Jan 16 18:03 yo.xml.gz >-rw-r--r-- 1 root root 91940 Jan 16 18:03 yo.xml.xmlppm > >So far so good: xmlppm achieved the highest compression ratio (5.6% >better than bzip2, really not bad at all!). > >Now comes the bad part : when I uncompress the file, all the HTML >entities are messed up. For example, the french accented letters (coded >in my HTML file by 'é', 'è', etc) are not decoded correctly. >If the accents are 'iso-8859-1' encoded, I get the same result. > >NB: I've attached a small xml sample (the 1st chapter of the book >actually) that also triggers this problem, I've on purpose mixed both >encodings for accents. > >I'm somewhat frustrated, because your tool shows great promisses, but >the fact it messes up accents makes it useless for me now. > > Cordialement, > > =20 > > =20 > |