Re: [Sbcl-devel] Character branch bugs

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Lutz Euler wrote:

> Christophe Rhodes wrote:
>> In particular, what would your view be of using some space in the
>> private use area to define synthetic other-case characters which would
>> be canonized to the normal form when sent to a stream: so that
>> string-upcase "ma=DFe" would return something like "MA?E", where the ?
>> is a character which "knows" that it's an uppercase eszet, but that
>> streams know further should be turned into the character sequence "SS"
>> if it is sent to the outside world?
>
> [...]
> On first sight this solution looked ugly, wrong, unnecessary and
> unnecessarily complex to me. After a good measure of deliberation, look=
ing
> through the Unicode data tables and reading about the intricacies of th=
e
> sharp s in german spelling conventions, this impression was further
> tightened. I think your proposal does not solve all problems and
> additionally makes things unnecessarily complex.

I haven't followed this discussion closely and it's been a few years
since I last read the Unicode standard, so please don't hesitate to
tell me to RTFM if necessary, but:

I totally agree with Lutz that the private use area should not
be used for, please excuse me for being blunt as well, such an
ugly hack. I think the private use area is intended for, ahem,
private use and should not be occupied by a general purpose
programming language implementation.

> So my impression is: It is hopeless to try to put all this in a clean
> way as a unique, standard, widely acceptable extension of the behaviour
> of upcase-char and downcase-char into Common Lisp. If the value of
> char-upcase depends not only on the input char, but also on locale and
> other things, these should by all standards of software development be
> provided as additional arguments.

I totally agree.

> I think there remain two possibilities, namely to let them do case
> conversion only for the ascii character subset (the minimum that the
> standard requires) or additionally for the characters from Unicode wher=
e it
> is uniquely defined. I prefer the second option because it is more usef=
ul,
> i.e. it does the right thing for more of the natural languages of the w=
orld
> already.

You *could* also use a special variable as an extra implicit parameter
for char-upcase/char-downcase.  I'm not saying that you should, because
I know next to nothing about SBCL and I haven't looked at the alternative=
s.

> To repeat what I wrote:
>> I believe the best solution is the one taken by CLISP, namely to
>> make only those characters "characters with case" that have such a
>> one-to-one correspondence in the Unicode tables.

What dooes CLISP do with #\I (i.e. "latin capital letter i") then?
In Turkish, this should be downcased to an i without dot ("lating small
letter dotless i"; in other languages the lowercase version is a 'normal'
#\i ("latin small letter i").

> This can be in the form of access to the unicode data tables directly, =
but
> preferably by providing case conversion functions in an extension packa=
ge
> that take the needed additional information as parameters (whether we w=
ant
> simple or full case folding with or without the "T" mappings, the local=
e,
> the syntactic position of the character in question etc.) and that conv=
ert
> strings to strings of possibly different length.

Whatever you do, please make it possible to override locale-specific
case conversions.  The fact that I'm sitting in Amsterdam or am using
an English operating system does *not* necessarily mean that I want my
"latin capital letter i" to be downcased to "latin small letter i".
(For the standard CL char-upcase function that would be OK, of course.)

If my remarks are totally beside the point of this discussion, I apologiz=
e
in advance.

Arthur

Re: [Sbcl-devel] Character branch bugs

Common Lisp compiler and runtime

Re: [Sbcl-devel] Character branch bugs