From: Alan W. I. <ir...@be...> - 2010-12-21 19:03:38
|
On 2010-12-21 11:21+0100 Arjen Markus wrote: > Hi Alan, > > I think you misunderstand the issue. The problem is that a file > consists of bytes. In the old days, each byte corresponded to a > single character, but with the advent of UTF-8 and the like a single > character may be represented by one, two or more bytes. What a program > will do with these bytes depends on the assumption about the > character encoding. > > For Tcl programs the following happens: > - Based on the system encoding, all sequences of bytes are translated > into equivalent UTF-8 characters. > - If the system encoding is NOT UTF-8, the internal resulting sequence > may not be the same as in the file. For instance, on Windows "cp1252" > is one way to connect the bytes above 127 to characters such as > A-umlaut. So a byte that represents A-umlaut according to the cp1252 > encoding is translated to the UTF-8 sequence of bytes that represents > that very same character. In other words: it is a completely different > sequence of bytes. > - Right now we pass that _internal_ sequence of bytes to the PLplot C > library - and assume that it was the original sequence of bytes. > But that is only true if the system encoding is UTF-8. > The code I propose as an alternative reverses the translation. > > Bytes lower/equal 127 represent exactly the same charachters in cp1252 > and UTF-8 (by design), so most examples are not affected by this > distinction. > > (I agree this is highly confusing - but if you simply think of > bytes separated from characters it becomes a bit easier) I agree it is highly confusing and difficult to describe clearly. When I look at the Peace words in the actual files x24.tcl (and x24c.c) with the system tools available to me (the emacs editor in my case), it is clear the bytes in those files can only be interpreted properly with a UTF-8 encoding. Please use your own system analysis tools to confirm that conclusion so that at least our analysis is starting at the same point. In other words, if you had some system tool there that assumed the Peace words in x24.tcl was cp1252, then the result would be displayed as gibberish or blank. Only if you interpret with the UTF-8 encoding _and_ have the Mandarin fonts installed would the Mandarin Peace word be rendered properly as happens for me with the emacs editor. Does that also happen for you with whatever file display tool that is accessible to you that is capable of understanding UTF-8 encoded files? I acknowledge that Tcl often does things in a very complex way so I would advise forgetting Tcl for the moment and instead looking at the example 24 results from C. Does the x24c executable produce the same as http://plplot.sourceforge.net/examples-data/demo24/x24.01.png on your system when you use the pngcairo or pngqt device drivers? If so, that confirms you have the proper system fonts installed, and then there is some hope of getting the same good result with Tcl. On the other hand, if you cannot make the C example give a good example 24 result on Windows with the cairo or qt devices, then there is little hope for Tcl. I will stop now and not comment more on the Tcl case, because I think it is essential to focus on C for now and one of the cairo or qt devices. Of course, as I have stated before the psc device driver is not useful for diagnosis of encoding issues because everything exotic such as the Mandarin Peace word ends up as blanks in any case because the standard Type 1 fonts that the psc device uses have an extremely limited glyph set that does not include Mandarin glyphs or any other non-English glyphs besides Greek (for mathematical purposes). Alan __________________________ Alan W. Irwin Astronomical research affiliation with Department of Physics and Astronomy, University of Victoria (astrowww.phys.uvic.ca). Programming affiliations with the FreeEOS equation-of-state implementation for stellar interiors (freeeos.sf.net); PLplot scientific plotting software package (plplot.org); the libLASi project (unifont.org/lasi); the Loads of Linux Links project (loll.sf.net); and the Linux Brochure Project (lbproject.sf.net). __________________________ Linux-powered Science __________________________ |