From: Alan W. I. <ir...@be...> - 2010-12-24 21:20:39
|
On 2010-12-24 13:05+0100 Arjen Markus wrote: > Hi Alan, > > On 2010-12-21 20:03, Alan W. Irwin wrote: > >> >> I will stop now and not comment more on the Tcl case, because I think >> it is essential to focus on C for now and one of the cairo or qt >> devices. >> >> Of course, as I have stated before the psc device driver is not useful >> for diagnosis of encoding issues because everything exotic such as the >> Mandarin Peace word ends up as blanks in any case because the standard >> Type 1 fonts that the psc device uses have an extremely limited glyph >> set that does not include Mandarin glyphs or any other non-English >> glyphs besides Greek (for mathematical purposes). >> > > I have checked the C and F95 examples 24 with the wxWidgets device on > Windows: they work very nicely, except that my PC does not have all the > required fonts. With Tcl I recognise the odd sequences I also see in > the source code (viewed from a cp1252 perspective). Thus, it appears from your results that C and Fortran on Windows simply accept the byte sequences mentioned in strings without molesting them while your hypothesis is that Tcl does not follow that simple model. Instead it does an implicit transformation of all strings from (presumed) system encoding to UTF-8 which messes up the byte sequences, and your proposed cure is to take all strings input to PLplot from Tcl and do the inverse transformation which uses a call to Tcl_UtfToExternalDString with a NULL encoding. From the man page, what that will do is to convert a (presumed) UTF-8 string to the (presumed) system encoding. If your hypothesis is correct, then your proposed cure might indeed work on all platforms. Certainly on Linux, the implicit transformation is UTF-8 to UTF-8 or the identity transform (which is why Tcl works right now for example 24 on Linux), and your inverse transformation would also be the identify transformation on Linux and should therefore also work on that platform. However, I am concerned with the following issues. 1. All PLplot API arguments that are strings are assumed to be in UTF-8. Thus, the call to Tcl_UtfToExternalDString with NULL has to be made in the Tcl bindings for _every_ function in the PLplot API that has an input string. 2. Does the implicit transformation work for arbitrary UTF-8 (e.g., arbitrary series of 8-bit bytes) or are there some 8-bit bytes which cannot be validly interpreted as cp1252 or which have special meanings. 3. Is Tcl_UtfToExternalDString with NULL encoding the exact inverse of the implicit transformation? All of these issues can be dealt with. Obviously some care in the Tcl bindings should take care of issue 1 and to alleviate concern about issues 2 and 3 completely, it would be a good idea to put together a complete test of all 256 possible 8-bit character combinations. I am thinking along the lines of generating a file from C with a string of all possibilities from 255 down to a zero (to terminate the string). Then using an editor copy that exact string of 256 bytes to Tcl source code that automatically puts that string through the implicit transformation. Then use the Tcl_UtfToExternalDString with NULL to transform that string before calling a C programme that simply outputs the string. Then compare that output file with the original file with 256 characters to see if you get all 256 characters back in their original form. However, to avoid this work it would be better to convince Tcl not to do the implicit string transformation in the first place. The way this is handled in Python is to put the following string in the first or second line of the Python script that identifies the whole Python source file is encoded in utf-8: # -*- coding: utf-8 -*- We do that for the following Python examples: software@raven> grep coding: examples/python/xw??.py examples/python/xw18.py:# -*- coding: utf-8; -*- examples/python/xw24.py:# -*- coding: utf-8; -*- examples/python/xw26.py:# -*- coding: utf-8; -*- examples/python/xw33.py:# -*- coding: utf-8; -*- I think Tcl may have something equivalent in the encoding system utf-8 command. From the documentation users are discouraged from using that command because it affects everything such as system calls. For example, puts would output strings in UTF-8 encoding rather than the actual (e.g., cp1252 on your platform) system encoding on Windows machines. But is that actually an issue for the above examples? First, we don't interact with the operating system (e.g, with puts) as far as I know with those examples, and UTF-8 and cp1252 coincide in any case for ascii strings. Anyhow, if "encoding system utf-8" works for those examples, I think we should use it rather than the more difficult steps outlined above. Of course, we should inform Tcl users browsing our example code via a comment in those examples that PLplot requires utf-8 system encoding for all non-ascii input strings. > > Over the weekend I won't be able to do anything - holiday > obligations :). > > I wish you all a merry Christmas and a happy New Year. I wish everybody here a "Merry Christmas" and "Happy New Year" as well. Enjoy your holidays, Arjen, and I hope when you come back you will find the "encoding system utf-8" solution works without issues for the affected examples. Alan __________________________ Alan W. Irwin Astronomical research affiliation with Department of Physics and Astronomy, University of Victoria (astrowww.phys.uvic.ca). Programming affiliations with the FreeEOS equation-of-state implementation for stellar interiors (freeeos.sf.net); PLplot scientific plotting software package (plplot.org); the libLASi project (unifont.org/lasi); the Loads of Linux Links project (loll.sf.net); and the Linux Brochure Project (lbproject.sf.net). __________________________ Linux-powered Science __________________________ |