firstly, thanks for the patch for the compiler bug.
Secondly, warning, long mail ahead.
Christophe Rhodes wrote:
> In particular, what would your view be of using some space in the
> private use area to define synthetic other-case characters which woul=
> be canonized to the normal form when sent to a stream: so that
> string-upcase "ma=DFe" would return something like "MA?E", where the =
> is a character which "knows" that it's an uppercase eszet, but that
> streams know further should be turned into the character sequence "SS=
> if it is sent to the outside world?
Thanks for your interest in my opinion. Please forgive me if the
following sounds blunt -- my intention is to help making SBCL better.
On first sight this solution looked ugly, wrong, unnecessary and
unnecessarily complex to me. After a good measure of deliberation, look=
through the Unicode data tables and reading about the intricacies of th=
sharp s in german spelling conventions, this impression was further
tightened. I think your proposal does not solve all problems and
additionally makes things unnecessarily complex.
Let me explain:
You wrote (about upcasing "=DF"):
> The same problem exists for #\=AA and #\=BA (I hope those characters =
> the mail round-trip :-), and probably some other nasty ones such as t=
> fl ligature and the capital-D lower-case-z-with-macron combination.
You surely read through CaseFolding.txt and SpecialCasing.txt from
I would not call that only "some" nasty ones. There are several differe=
special cases for ligatures, lots of precomposed characters marked as
"No corresponding uppercase precomposed character" and even some
conditional conversions (e.g. depending on whether we are at the end of=
some kind of syntactic element; I don't know whether a syllable, a word=
or a sentence) and finally some locale dependent conversions.
Citing from "CaseFolding.txt" one sees that there is no single definiti=
of case conversion when dealing with Unicode:
A. To do a simple case folding, use the mappings with status C + S.
B. To do a full case folding, use the mappings with status C + F.
The mappings with status T can be used or omitted depending on the=
desired case-folding behavior. (The default option is to exclude
As a reason to provide the programmer with as much freedom and control
as possible let me add something not found in Unicode:
Sometimes the "=DF" is capitalised as "SZ", e.g. you may find in techni=
drawings the word "MASZSTAB" as the all caps version of "Ma=DFstab". It=
means "scale" there, elsewhere also "ruler" and is composed from "Ma=DF=
("dimension", "gauge", "measure") and "Stab" ("bar", "stick")).
The spelling "SZ" is allowed because there sometimes is a semantic
difference between "=DF" and "ss": in this case "Ma=DFe" (plural of "Ma=
pronounced with a long "a") needs to be distinguished from "Masse" (mea=
"mass", pronounced with a short "a"). (Note that there is no form of th=
word "Masse" that strips the final "e" off, so strictly "MASS" would no=
ambigous in "MASSSTAB", nevertheless the "SZ" form can be used here.)
So my impression is: It is hopeless to try to put all this in a clean
way as a unique, standard, widely acceptable extension of the behaviour=
of upcase-char and downcase-char into Common Lisp. If the value of
char-upcase depends not only on the input char, but also on locale and
other things, these should by all standards of software development be
provided as additional arguments. Also, if its value can be two charact=
it should be defined to return them as multiple values or as a pair or =
list or vector or string or whatever.
What you propose seems to me to try to work around the dilemma that we
can not add these arguments to the functions and can not change the typ=
of their return values since we want to adhere to the standard.
But adding this functionality instead to streams seems for me to be the=
The IMHO obvious solution is that case conversion must in the general c=
(sic!) be done by extension functions with different signatures. It rem=
to decide what char-upcase and frieds should do. The theoretically clea=
way could be to deprecate them and, for the benefit (!) of those other
functions that insist on using them, like "read", let them not convert
anything (just kidding -- but the thought is seductive: no more ugly al=
caps symbols ;-)) but that is disallowed by the standard.
I think there remain two possibilities, namely to let them do case
conversion only for the ascii character subset (the minimum that the
standard requires) or additionally for the characters from Unicode wher=
is uniquely defined. I prefer the second option because it is more usef=
i.e. it does the right thing for more of the natural languages of the w=
To repeat what I wrote:
> I believe the best solution is the one taken by CLISP, namely to
> make only those characters "characters with case" that have such a
> one-to-one correspondence in the Unicode tables.
Surely I would appreciate it if SBCL contained means to access the
functionality needed to do the more complicated case conversions.
This can be in the form of access to the unicode data tables directly, =
preferably by providing case conversion functions in an extension packa=
that take the needed additional information as parameters (whether we w=
simple or full case folding with or without the "T" mappings, the local=
the syntactic position of the character in question etc.) and that conv=
strings to strings of possibly different length.
So I would put this into the same category as normalisation and
denormalisation, e.g. I would expect having an extension function to
convert a string into NFC, but I don't want the reader to do NFC
conversion when reading the source code I wrote (certainly not inside
strings or comments but with the same certainty not inside symbols).
(Even though this may lead to equally looking symbols being distinct if=
one contains a precomposed character and the other one its decomposed
form -- but then I hope my development environment allows me to disting=
these forms on request.)
So much for my explanation.
Keep up the good work on SBCL! Looking forward to using Unicode in SBCL=