xmlppm-users Mailing List for XML Compression Tools
Status: Beta
                
                Brought to you by:
                
                    jcheney
                    
                
            You can subscribe to this list here.
| 2003 | Jan (2) | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2004 | Jan | Feb | Mar | Apr | May (1) | Jun | Jul | Aug | Sep | Oct | Nov | Dec | 
| 
      
      
      From: <ben...@id...> - 2004-05-21 08:13:53
      
     | 
| Dear Open Source developer I am doing a research project on "Fun and Software Development" in which I kindly invite you to participate. You will find the online survey under http://fasd.ethz.ch/qsf/. The questionnaire consists of 53 questions and you will need about 15 minutes to complete it. With the FASD project (Fun and Software Development) we want to define the motivational significance of fun when software developers decide to engage in Open Source projects. What is special about our research project is that a similar survey is planned with software developers in commercial firms. This procedure allows the immediate comparison between the involved individuals and the conditions of production of these two development models. Thus we hope to obtain substantial new insights to the phenomenon of Open Source Development. With many thanks for your participation, Benno Luthiger PS: The results of the survey will be published under http://www.isu.unizh.ch/fuehrung/blprojects/FASD/. We have set up the mailing list fa...@we... for this study. Please see http://fasd.ethz.ch/qsf/mailinglist_en.html for registration to this mailing list. _______________________________________________________________________ Benno Luthiger Swiss Federal Institute of Technology Zurich 8092 Zurich Mail: benno.luthiger(at)id.ethz.ch _______________________________________________________________________ | 
| 
      
      
      From: James C. <jr...@co...> - 2003-01-19 14:06:10
      
     | 
| OK, I've taken a look. The problem is that expat, the XML parser xmlppm uses, converts=20 everything to UTF-8 before it passes it on to xmlppm. However, xmlppm=20 was saving the old encoding and restoring it in the decoded file, while=20 leaving the character encoding in UTF-8. =20 Changing the encoding declaration in the decoded file (manually) to=20 UTF-8 resulted in the output file being rendered in Netscape/Mozilla the=20 same as the input. Any other XML processor should then be able to deal=20 with the output of xmlppm in a reasonable way, but I'm sure that for=20 some applications (such as if you're editing the xml text by hand) you'd=20 prefer to have the output use the encoding of your choice rather than=20 mine. Unfortunately, I know next to nothing about Unicode and other=20 character encodings in general, so getting it right might take a while.=20 I do know that there's an alternative form of expat based on wide=20 characters and C localization/internationalization, but the compression=20 algorithm xmlppm uses is rather heavily dependent on characters being 8=20 bits, and I don't know how much work it would be to fix this. And just=20 naively compressing UTF-32 by serializing the 4 bytes of each wide=20 character would really mess up compression because it would put separate=20 every "real" character by three bogus codepage ones (which would be=20 almost always the same). (This also implies that the less a language is=20 like English, the worse xmlppm will compress it, since its UTF-8=20 representation will have many useless codepage characters too.) Until I have a better idea, xmlppm will just change the encoding to=20 UTF-8 so it's at least consistent. Ideally, there's some library out there that I can use to postprocess=20 the decoded XML file from UTF-8 to the encoding declared in the input=20 (or, any other encoding as specified by the user). Perhaps there are=20 already tools that do just that out there, in which case you could use=20 those for the time being. Hope this helps, and let me know if it doesn't. --James Vincent Renardias wrote: >Hello, > >I've just given xmlppm a try. A just ran into a little problem. >My sample file is a docbook/XML file (French version of Jules Verne's >"De la terre =E0 la lune"). > >After trying successively bzip2, gzip & xmlppm, here are the final file >sizes. > >-rw-r--r-- 1 root root 414211 Jan 16 18:02 yo.xml >-rw-r--r-- 1 root root 97412 Jan 16 18:03 yo.xml.bz2 >-rw-r--r-- 1 root root 132325 Jan 16 18:03 yo.xml.gz >-rw-r--r-- 1 root root 91940 Jan 16 18:03 yo.xml.xmlppm > >So far so good: xmlppm achieved the highest compression ratio (5.6% >better than bzip2, really not bad at all!). > >Now comes the bad part : when I uncompress the file, all the HTML >entities are messed up. For example, the french accented letters (coded >in my HTML file by 'é', 'è', etc) are not decoded correctly. >If the accents are 'iso-8859-1' encoded, I get the same result. > >NB: I've attached a small xml sample (the 1st chapter of the book >actually) that also triggers this problem, I've on purpose mixed both >encodings for accents. > >I'm somewhat frustrated, because your tool shows great promisses, but >the fact it messes up accents makes it useless for me now. > > Cordialement, > > =20 > > =20 > | 
| 
      
      
      From: Vincent R. <vi...@st...> - 2003-01-17 08:59:09
      
     | 
| Hello, I've just given xmlppm a try. A just ran into a little problem. My sample file is a docbook/XML file (French version of Jules Verne's "De la terre à la lune"). After trying successively bzip2, gzip & xmlppm, here are the final file sizes. -rw-r--r-- 1 root root 414211 Jan 16 18:02 yo.xml -rw-r--r-- 1 root root 97412 Jan 16 18:03 yo.xml.bz2 -rw-r--r-- 1 root root 132325 Jan 16 18:03 yo.xml.gz -rw-r--r-- 1 root root 91940 Jan 16 18:03 yo.xml.xmlppm So far so good: xmlppm achieved the highest compression ratio (5.6% better than bzip2, really not bad at all!). Now comes the bad part : when I uncompress the file, all the HTML entities are messed up. For example, the french accented letters (coded in my HTML file by 'é', 'è', etc) are not decoded correctly. If the accents are 'iso-8859-1' encoded, I get the same result. NB: I've attached a small xml sample (the 1st chapter of the book actually) that also triggers this problem, I've on purpose mixed both encodings for accents. I'm somewhat frustrated, because your tool shows great promisses, but the fact it messes up accents makes it useless for me now. Cordialement, -- Vincent RENARDIAS Directeur Technique StrongHoldNET / http://www.strongholdnet.com |