From: Arthur L. <ale...@xs...> - 2004-10-09 19:30:11
|
Lutz Euler wrote: > Christophe Rhodes wrote: >> In particular, what would your view be of using some space in the >> private use area to define synthetic other-case characters which would >> be canonized to the normal form when sent to a stream: so that >> string-upcase "ma=DFe" would return something like "MA?E", where the ? >> is a character which "knows" that it's an uppercase eszet, but that >> streams know further should be turned into the character sequence "SS" >> if it is sent to the outside world? > > [...] > On first sight this solution looked ugly, wrong, unnecessary and > unnecessarily complex to me. After a good measure of deliberation, look= ing > through the Unicode data tables and reading about the intricacies of th= e > sharp s in german spelling conventions, this impression was further > tightened. I think your proposal does not solve all problems and > additionally makes things unnecessarily complex. I haven't followed this discussion closely and it's been a few years since I last read the Unicode standard, so please don't hesitate to tell me to RTFM if necessary, but: I totally agree with Lutz that the private use area should not be used for, please excuse me for being blunt as well, such an ugly hack. I think the private use area is intended for, ahem, private use and should not be occupied by a general purpose programming language implementation. > So my impression is: It is hopeless to try to put all this in a clean > way as a unique, standard, widely acceptable extension of the behaviour > of upcase-char and downcase-char into Common Lisp. If the value of > char-upcase depends not only on the input char, but also on locale and > other things, these should by all standards of software development be > provided as additional arguments. I totally agree. > I think there remain two possibilities, namely to let them do case > conversion only for the ascii character subset (the minimum that the > standard requires) or additionally for the characters from Unicode wher= e it > is uniquely defined. I prefer the second option because it is more usef= ul, > i.e. it does the right thing for more of the natural languages of the w= orld > already. You *could* also use a special variable as an extra implicit parameter for char-upcase/char-downcase. I'm not saying that you should, because I know next to nothing about SBCL and I haven't looked at the alternative= s. > To repeat what I wrote: >> I believe the best solution is the one taken by CLISP, namely to >> make only those characters "characters with case" that have such a >> one-to-one correspondence in the Unicode tables. What dooes CLISP do with #\I (i.e. "latin capital letter i") then? In Turkish, this should be downcased to an i without dot ("lating small letter dotless i"; in other languages the lowercase version is a 'normal' #\i ("latin small letter i"). > This can be in the form of access to the unicode data tables directly, = but > preferably by providing case conversion functions in an extension packa= ge > that take the needed additional information as parameters (whether we w= ant > simple or full case folding with or without the "T" mappings, the local= e, > the syntactic position of the character in question etc.) and that conv= ert > strings to strings of possibly different length. Whatever you do, please make it possible to override locale-specific case conversions. The fact that I'm sitting in Amsterdam or am using an English operating system does *not* necessarily mean that I want my "latin capital letter i" to be downcased to "latin small letter i". (For the standard CL char-upcase function that would be OK, of course.) If my remarks are totally beside the point of this discussion, I apologiz= e in advance. Arthur |