[Sbcl-devel] dealing with sbcl-0.7.0-unicode prerelease

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

On Fri, Apr 12, 2002 at 03:51:33PM +1000, Brian Spilsbury wrote:
> William Harold Newman wrote:
> >7. I still don't understand why the +Unicode system can't bootstrap
> >normally under CMU CL or SBCL. Do you have any hints, or should I just
> >try it myself and see how it fails?
> >
> I wouldn't do the latter again :)
> 
> That took about 2 months to figure out.
> 
> In the earlier patch to 0.6.13 in src/compiler/fndb.lisp at the top 
> you'll see a really horrible piece of code which looks like this
> 
> #!+(and sb-xc-host (or sbcl cmu) unicode)
> (progn
> 
>  ; be naughty and allow these types to be redefined
>  ; so that the right type definitions leak in...
>  ; *sigh* [BTS]
> 
>  ; once for loaded things
>  #.(setf (sb-c::info :type :kind 'sb-c::simple-string) :defined)
>  #.(setf (sb-c::info :type :kind 'sb-c::string) :defined)
>  #.(setf (sb-c::info :type :kind 'sb-c::character) :defined)
>  #.(setf (sb-c::info :type :kind 'sb-c::sequence) :defined)
>  #.(setf (sb-c::info :type :kind 'sb-c::vector) :defined)
> 
>  ; once again for compiled things - probably a better way FIXME [BTS]
>  (setf (sb-c::info :type :kind 'sb-c::simple-string) :defined)
>  (setf (sb-c::info :type :kind 'sb-c::string) :defined)
>  (setf (sb-c::info :type :kind 'sb-c::character) :defined)
>  (setf (sb-c::info :type :kind 'sb-c::vector) :defined)
> 
>  (setf (sb-c::info :type :builtin 'sb-c::character) nil)
>  (sb-c::values-specifier-type-cache-clear)
>  (sb!c::values-specifier-type-cache-clear)
> 
>  ...)
> 
> The problem is that the host's type assumptions are used to build things 
> like the defknowns, etc, and in a couple of other places.
> 
> These assumptions then propagate down into the cross-compiler, and then 
> it breaks when the cross-compiler's own code uses the cross-compiler's 
> own type-definitions to try to compile itself into the target.

I'd really like the patch to be bootstrappable. (As per the comment at
the head of make.sh.:-)

As below, I'd like to deal with the patch in smaller pieces if
possible. That's mostly for other reasons but maybe, as another
benefit, the deep system-level wide-character bootstrap issues would
be a little easier to figure out if we don't have so many megabytes of
application-level Unicode stuff to confuse the issue.

> >8. Might it be possible to do the patch in smaller pieces? E.g.
> >in three phases, each adding some testable functionality:
> >	1. Make the system manipulate Unicode data (reading and
> >		writing it, representing it in characters and 
> >		strings) but not know anything about its properties
> >		other than what's a BASE-CHAR and what's not.
> >		(I.e. no upcasing, symbolic names for Unicode chars,
> >		or other messy stuff.)
> >
> Two and three are certainly separatable, but they're also relatively 
> trivial.
> 
> They're included because I wrote them to use the 0.6.13 version to 
> actually do some stuff, as well as for testing.
> 
> >
> >	2. Add Unicode-capable implementation of upcasing.
> >	3. Add Unicode-capable implementation of symbolic char names.

My inclination is to break up the patch along these lines: "part 1"
and "everything else". Parts 2 and 3, and other Unicode-level stuff
like locales, may be "relatively trivial", but they're also
  * enormous
  * difficult to get unambiguously correct (since IIRC some ANSI
    operations like upcasing a single character are a poor fit to
    Unicode)
  * more-or-less application-level code, as opposed to the kind of code
    which needs to be tightly integrated into low-level SBCL internals
    like GC, dumping/loading, typing, and type-related compiler
    optimizations

You wrote elsewhere that you'd like to have the code in the SBCL CVS
codebase. That's entirely understandable after you had to drag the
patch through all the 0.pre7.* rearrangement (though hopefully that's
a one-time event, or at least a once-in-a-decade event, not a regular
event). Anyway, it seems to me that most of the benefit there would
follow from part 1, for more or less the loose-coupling reasons
discussed below.

After merging part 1, I'm not sure what would be the best way to
develop and integrate the other stuff. If it goes exceedingly smoothly
(with obvious solutions to the upcasing problems I worried about
above, with no design issues for locales, etc.), we can just merge it
into the main codebase. At the other extreme, if there's considerable
uncertainty and the design ends up requiring a lot of experiment and
revision, it could probably be developed in a way which is only
loosely coupled to the main SBCL codebase through a few extensions,
e.g.
  * "User" code (meaning in this case the wide-chars-are-Unicode 
    customization layer) is explicitly allowed to redefine 
    the char-set-related functions (e.g. UPPER-CASE-P). (ANSI
    says conforming code can't do this, but we'd relax that in 
    this case.)
  * The reader logic which processes #\FOO, and perhaps other internal
    character-set-related things also, go through redefinable hooks of
    some kind.

> I'm quite happy to go though and clean things up, and would like to 
> incrementally do that, since it isn't quite up to the standard that I 
> would normally like. A lot of hacks went in while working with partially 
> understood code, and not all of them have been sanitised.

Would you be willing to restructure the patch along the lines
discussed above? Or failing that, if I or someone else extracted a
part-1-like part of your patch and merged it as
#!-SB-CHARACTER-IS-BASE-CHAR into the main codebase, would you be
happy?

(Any comments or suggestions from anyone else?)

-- 
William Harold Newman <wil...@ai...>
Users like this are like a mongoose backed into a corner: with its back to
the wall and seeing certain death staring it in the face, it attacks
frantically, because doing something has to be better than doing nothing.
This is not well adapted to the type of problems computers produce.
  -- <http://www.chiark.greenend.org.uk/~sgtatham/bugs.html>
PGP key fingerprint 85 CE 1C BA 79 8D 51 8C  B9 25 FB EE E0 C3 E5 7C

[Sbcl-devel] dealing with sbcl-0.7.0-unicode prerelease

Common Lisp compiler and runtime

[Sbcl-devel] dealing with sbcl-0.7.0-unicode prerelease