From: William H. N. <wil...@ai...> - 2002-04-19 21:08:36
|
On Fri, Apr 12, 2002 at 03:51:33PM +1000, Brian Spilsbury wrote: > William Harold Newman wrote: > >7. I still don't understand why the +Unicode system can't bootstrap > >normally under CMU CL or SBCL. Do you have any hints, or should I just > >try it myself and see how it fails? > > > I wouldn't do the latter again :) > > That took about 2 months to figure out. > > In the earlier patch to 0.6.13 in src/compiler/fndb.lisp at the top > you'll see a really horrible piece of code which looks like this > > #!+(and sb-xc-host (or sbcl cmu) unicode) > (progn > > ; be naughty and allow these types to be redefined > ; so that the right type definitions leak in... > ; *sigh* [BTS] > > ; once for loaded things > #.(setf (sb-c::info :type :kind 'sb-c::simple-string) :defined) > #.(setf (sb-c::info :type :kind 'sb-c::string) :defined) > #.(setf (sb-c::info :type :kind 'sb-c::character) :defined) > #.(setf (sb-c::info :type :kind 'sb-c::sequence) :defined) > #.(setf (sb-c::info :type :kind 'sb-c::vector) :defined) > > ; once again for compiled things - probably a better way FIXME [BTS] > (setf (sb-c::info :type :kind 'sb-c::simple-string) :defined) > (setf (sb-c::info :type :kind 'sb-c::string) :defined) > (setf (sb-c::info :type :kind 'sb-c::character) :defined) > (setf (sb-c::info :type :kind 'sb-c::vector) :defined) > > (setf (sb-c::info :type :builtin 'sb-c::character) nil) > (sb-c::values-specifier-type-cache-clear) > (sb!c::values-specifier-type-cache-clear) > > ...) > > The problem is that the host's type assumptions are used to build things > like the defknowns, etc, and in a couple of other places. > > These assumptions then propagate down into the cross-compiler, and then > it breaks when the cross-compiler's own code uses the cross-compiler's > own type-definitions to try to compile itself into the target. I'd really like the patch to be bootstrappable. (As per the comment at the head of make.sh.:-) As below, I'd like to deal with the patch in smaller pieces if possible. That's mostly for other reasons but maybe, as another benefit, the deep system-level wide-character bootstrap issues would be a little easier to figure out if we don't have so many megabytes of application-level Unicode stuff to confuse the issue. > >8. Might it be possible to do the patch in smaller pieces? E.g. > >in three phases, each adding some testable functionality: > > 1. Make the system manipulate Unicode data (reading and > > writing it, representing it in characters and > > strings) but not know anything about its properties > > other than what's a BASE-CHAR and what's not. > > (I.e. no upcasing, symbolic names for Unicode chars, > > or other messy stuff.) > > > Two and three are certainly separatable, but they're also relatively > trivial. > > They're included because I wrote them to use the 0.6.13 version to > actually do some stuff, as well as for testing. > > > > > 2. Add Unicode-capable implementation of upcasing. > > 3. Add Unicode-capable implementation of symbolic char names. My inclination is to break up the patch along these lines: "part 1" and "everything else". Parts 2 and 3, and other Unicode-level stuff like locales, may be "relatively trivial", but they're also * enormous * difficult to get unambiguously correct (since IIRC some ANSI operations like upcasing a single character are a poor fit to Unicode) * more-or-less application-level code, as opposed to the kind of code which needs to be tightly integrated into low-level SBCL internals like GC, dumping/loading, typing, and type-related compiler optimizations You wrote elsewhere that you'd like to have the code in the SBCL CVS codebase. That's entirely understandable after you had to drag the patch through all the 0.pre7.* rearrangement (though hopefully that's a one-time event, or at least a once-in-a-decade event, not a regular event). Anyway, it seems to me that most of the benefit there would follow from part 1, for more or less the loose-coupling reasons discussed below. After merging part 1, I'm not sure what would be the best way to develop and integrate the other stuff. If it goes exceedingly smoothly (with obvious solutions to the upcasing problems I worried about above, with no design issues for locales, etc.), we can just merge it into the main codebase. At the other extreme, if there's considerable uncertainty and the design ends up requiring a lot of experiment and revision, it could probably be developed in a way which is only loosely coupled to the main SBCL codebase through a few extensions, e.g. * "User" code (meaning in this case the wide-chars-are-Unicode customization layer) is explicitly allowed to redefine the char-set-related functions (e.g. UPPER-CASE-P). (ANSI says conforming code can't do this, but we'd relax that in this case.) * The reader logic which processes #\FOO, and perhaps other internal character-set-related things also, go through redefinable hooks of some kind. > I'm quite happy to go though and clean things up, and would like to > incrementally do that, since it isn't quite up to the standard that I > would normally like. A lot of hacks went in while working with partially > understood code, and not all of them have been sanitised. Would you be willing to restructure the patch along the lines discussed above? Or failing that, if I or someone else extracted a part-1-like part of your patch and merged it as #!-SB-CHARACTER-IS-BASE-CHAR into the main codebase, would you be happy? (Any comments or suggestions from anyone else?) -- William Harold Newman <wil...@ai...> Users like this are like a mongoose backed into a corner: with its back to the wall and seeing certain death staring it in the face, it attacks frantically, because doing something has to be better than doing nothing. This is not well adapted to the type of problems computers produce. -- <http://www.chiark.greenend.org.uk/~sgtatham/bugs.html> PGP key fingerprint 85 CE 1C BA 79 8D 51 8C B9 25 FB EE E0 C3 E5 7C |