William Harold Newman wrote:
>I'd really like the patch to be bootstrappable. (As per the comment at
>the head of make.sh.:-)
Then I think you'll need to restructure a fair amount of the compiler's
I don't think that I have the knowledge required to restructure the
compiler's build process properly to achieve this, and the hack I used
to smash the host's assumptions in 0.6.13 had some limitations which
I've overcome by using a system with the correct assumptions to build
the 0.7.x code, so I doubt that that approach would work easily.
Alternatively it might be possible to build via clisp at some later
date, I think it has the right type definitions, although I'm not certain.
>As below, I'd like to deal with the patch in smaller pieces if
>possible. That's mostly for other reasons but maybe, as another
>benefit, the deep system-level wide-character bootstrap issues would
>be a little easier to figure out if we don't have so many megabytes of
>application-level Unicode stuff to confuse the issue.
Very little of the patch actually does unicode stuff, most of it is in
untangling character vs' base-char, base-strings vs' character-strings
etc, and then making the rest of the code not violate these differences
anymore, so there's an extremely limited amount of code which can be
removed if you want it to bootstrap.
utf8.lisp does utf8 encoding, unicode.lisp just manages a simple
There isn't much in either, but without the utf8 it becomes hard to make
the unix/filesystem/symbol/syscall interfaces work, since we use this to
drop character-strings down to base-char strings, the utf8 code should
be refactored when the stream external interface code is built, and
should be both thread-safe and able to support mulitple encodings (which
is mostly an issue for the pathnames and the FFI, since different
systems tend to use different encodings for filenames).
>My inclination is to break up the patch along these lines: "part 1"
>and "everything else". Parts 2 and 3, and other Unicode-level stuff
>like locales, may be "relatively trivial", but they're also
> * enormous
> * difficult to get unambiguously correct (since IIRC some ANSI
> operations like upcasing a single character are a poor fit to
> * more-or-less application-level code, as opposed to the kind of code
> which needs to be tightly integrated into low-level SBCL internals
> like GC, dumping/loading, typing, and type-related compiler
Presently there is very little application level code, the only thing
which might be considered at this level presently would be the code to
load in the character database, and while you can easily not load any
character database, you probably want to keep the code to avoid
duplicating the #\uXXXX handling code, which is not optional - things
will be unhappy if they cannot map a character to a name of some kind.
Up/down/title-casing a single character isn't a big problem in unicode -
there are mappings, but it is different from up/down/title-casing a
string in unicode.
>You wrote elsewhere that you'd like to have the code in the SBCL CVS
>codebase. That's entirely understandable after you had to drag the
>patch through all the 0.pre7.* rearrangement (though hopefully that's
>a one-time event, or at least a once-in-a-decade event, not a regular
>event). Anyway, it seems to me that most of the benefit there would
>follow from part 1, for more or less the loose-coupling reasons
What I'd really prefer is to have a side-branch in the SBCL CVS, which
can be used to bring this code into line with the main branch.
I don't think this is really ready for integration with the main line of
code, at least not in an immediate time-frame - I'd like that to happen
over the next 6 months, but this requires co-ordination with the patches
applied to the main line of code, as well as co-ordination for review
and refactoring of various pieces, as well as in identifying what people
need to be better documented.
[On the other hand, I've been using nothing but this code-base for
development since I got it to boot, but I'd rather not impose code which
I consider to be somewhat half-arsed in places on people in general.]
Also I think there are several months more work involved, particularly
in rewriting the horrible readtable code, and the current sbcl character
database code, which I've patched in the most superficial way so far -
they really need to be rewritten and abstracted correctly.
Stream external encoding support is also needed.
I'd like to look at instead of trying to have 500,000 conditions
thoughout, to move across to having #[!]+unicode being the default case,
removing the #[!]-unicode code, and then adding new cases to optimise
for people who want the same representation for character strings as for
I'd also like to look at expanding out the character type to have
repetoire subtypes, and adding a 8 bit and 16 bit specialised arrays
which can be fitted to various char-sets with transformations on the
accessors, which would allow europeans almost the same efficiency as
ascii users in most cases, and allow CJK users to use a 16 bit set, etc,
It might also be possible to defer the transformation indefinitely in
some cases, which would make them just as efficient (ie printing a
string in a given encoding to a stream in the same encoding).
The only other significant issue then is the character-database, which
can easily use reduced data-sets, and the implementation of the
readtable, which should be abstracted properly, as opposed to the
current ugliness which SBCL inherited.
>Would you be willing to restructure the patch along the lines
>discussed above? Or failing that, if I or someone else extracted a
>part-1-like part of your patch and merged it as
>#!-SB-CHARACTER-IS-BASE-CHAR into the main codebase, would you be
Well, the only thing that you can really do there is to cut off the
character database code, although I'd keep the #\uXXXX character names
to avoid crashing in exciting ways if you do use such a character, but
that's easy enough to do, although I'd rather follow the above approach.
Anyhow, to re-iterate.
I don't think it would be wise to try to crowbar a change this big into
the main sbcl release at the moment - and there are two or three minor
releases worth of patches missing for a start, and I'd like a chance to
refactor a whole bunch of stuff.
I think it would be wise to make a temporary side-branch, bring the
patch-level up to sync, and then work on the side-branch to clean things
up, while adding patches in parallel, and then when it can do the job of
the main branch nicely (ie turning off extended character support) and
people are confident about what it does and how it does it, then we can
switch across, or whatever, but that might be 6 months down the track.
It would also facilitate people writing 'What the hell is this for?' in
some standard fashion, and then I could go and either fix it or expand
the comments as necessary, as well as testing, and so on.