OK, I've taken a look.
The problem is that expat, the XML parser xmlppm uses, converts=20
everything to UTF-8 before it passes it on to xmlppm. However, xmlppm=20
was saving the old encoding and restoring it in the decoded file, while=20
leaving the character encoding in UTF-8. =20
Changing the encoding declaration in the decoded file (manually) to=20
UTF-8 resulted in the output file being rendered in Netscape/Mozilla the=20
same as the input. Any other XML processor should then be able to deal=20
with the output of xmlppm in a reasonable way, but I'm sure that for=20
some applications (such as if you're editing the xml text by hand) you'd=20
prefer to have the output use the encoding of your choice rather than=20
mine. Unfortunately, I know next to nothing about Unicode and other=20
character encodings in general, so getting it right might take a while.=20
I do know that there's an alternative form of expat based on wide=20
characters and C localization/internationalization, but the compression=20
algorithm xmlppm uses is rather heavily dependent on characters being 8=20
bits, and I don't know how much work it would be to fix this. And just=20
naively compressing UTF-32 by serializing the 4 bytes of each wide=20
character would really mess up compression because it would put separate=20
every "real" character by three bogus codepage ones (which would be=20
almost always the same). (This also implies that the less a language is=20
like English, the worse xmlppm will compress it, since its UTF-8=20
representation will have many useless codepage characters too.)
Until I have a better idea, xmlppm will just change the encoding to=20
UTF-8 so it's at least consistent.
Ideally, there's some library out there that I can use to postprocess=20
the decoded XML file from UTF-8 to the encoding declared in the input=20
(or, any other encoding as specified by the user). Perhaps there are=20
already tools that do just that out there, in which case you could use=20
those for the time being.
Hope this helps, and let me know if it doesn't.
--James
Vincent Renardias wrote:
>Hello,
>
>I've just given xmlppm a try. A just ran into a little problem.
>My sample file is a docbook/XML file (French version of Jules Verne's
>"De la terre =E0 la lune").
>
>After trying successively bzip2, gzip & xmlppm, here are the final file
>sizes.
>
>-rw-r--r-- 1 root root 414211 Jan 16 18:02 yo.xml
>-rw-r--r-- 1 root root 97412 Jan 16 18:03 yo.xml.bz2
>-rw-r--r-- 1 root root 132325 Jan 16 18:03 yo.xml.gz
>-rw-r--r-- 1 root root 91940 Jan 16 18:03 yo.xml.xmlppm
>
>So far so good: xmlppm achieved the highest compression ratio (5.6%
>better than bzip2, really not bad at all!).
>
>Now comes the bad part : when I uncompress the file, all the HTML
>entities are messed up. For example, the french accented letters (coded
>in my HTML file by 'é', 'è', etc) are not decoded correctly.
>If the accents are 'iso-8859-1' encoded, I get the same result.
>
>NB: I've attached a small xml sample (the 1st chapter of the book
>actually) that also triggers this problem, I've on purpose mixed both
>encodings for accents.
>
>I'm somewhat frustrated, because your tool shows great promisses, but
>the fact it messes up accents makes it useless for me now.
>
> Cordialement,
>
> =20
>
> =20
>
|