William Harold Newman wrote:
>On Tue, Apr 09, 2002 at 12:41:22AM +1000, Brian Spilsbury wrote:
>OK, thank you.
>I've spent some time looking at it (and at my newly-acquired _Unicode:
>First, let me say that it's pretty impressive that you've been able to
>get everything to work all the way up to high-level operations like
>UPCASE and STREAMs and all the way down to GC.
Thanks, although it wasn't really a matter of choice, so many
>Then let me go on to asking various questions, because a 40k-line
>patch is a lot to get my mind around...
>1. Do you have any test cases to exercise the code?
No formal test cases, since I've just gotten it to build though, I do
have some test code which translated from euc-kr to utf-8 and back, as
well as outputting it correctly on a utf-8 enabled xterm, so I've been
able to verify that it basically works.
When character normalisation starts to be implemented, that will provide
much better test cases, but I've more or less just gotten this to the
stage that it builds completely.
These certainly need to be added.
>2. You write "Characters are created as BASE-CHARs where possible
>where there is ambiguity." That sounds scary. Do you mean there is a
>distinction between #\C-the-BASE-CHAR and #\C-the-CHARACTER? If so,
>that doesn't seem ANSI-compliant. (I sort of doubt that's what you
>mean, because there seems to be only one CHARACTER-WIDETAG value, with
>no BASE-CHAR-WIDETAG. But if you don't mean that I don't understand
>what ambiguity you mean.)
I mean that code-char will always create a base-char object if it can,
even if you expect a character object.
Since all base-chars are characters, I don't think that it is an ansi
problem, but it is annoying to have that dispatch.
The problem is that the vops don't seem to consider return type when
being selected for substitution.
There is both a CHARACTER-WIDETAG and a BASE-CHAR-WIDETAG in the 0.7.0
In the 0.6.13 I managed to combine them together, but this turned out to
be not such a great approach in the rewrite.
python really wanted them separated out, and now there are
character-widetags and base-char-widetags, as well as character-regs and
A base-char-widetag'd object is characterp though, since it is a
subtype, and a base-char can use a character-reg as storage, although
not the reverse, I think this is reasonably sane, although I'm a bit
hazy on some of the move ops in src/compiler/x86/char.lisp.
>3. Are the *.txt files just the files distributed by the
Yes, they are unaltered.
>4. Could you give (or point me to a summary you've already given) of
>why stuff is in the patch? E.g. I was confused looking at
>EastAsianWidth.txt, which seems to be there only so that
>WIDE-CHARACTER-P can be defined, which seems to be there for no reason
>at all. Also, I couldn't figure out how dump.unicode is being used.
Yes, it's just there for wide-character-p, which I needed for the
dual-width character support in the buffer code I was developing/testing
That could be removed from the patch.
dump.unicode isn't used, I missed it on my clean-up sweep, it should be
>5. Could you be less telegraphic in comments like
> +#include "monitor.h" /* bogus */
> +#ifdef LISP_FEATURE_UNICODE
> + "character", /* yes, this is dubious */
>I couldn't figure out what's going on in either case. Also
> +#!+unicode ; TODO - think about making this nicer
>I don't mind to-do notes, but it bugs me when I can't tell
>what they mean.
They mean that I should go back and fix them up :)
Mostly those are things that I need to clean up, which I missed.
That was dubious because I had base-char and character both called
'character' in the ldp, which I've just fixed.
Thanks for pointing that out.
>And although I can guess that here
> +#!-unicode ; bloody optimists
> (defun string-to-simple-string* (object)
> (if (simple-string-p object)
>it's still more effort than I want to spend when reading code. (As
>Whitehead said, "Civilization advances by extending the number of
>important operations which we can perform without thinking of them."
>And one of the important operations is reading code.:-)
>6. Could you include explanations (in comments if the changes
>aren't self-explanatory, or in email if the changes are self-explanatory
>and I'm just not understanding them because I'm having a bad day)
>when you make changes like the ones for HEXBUF?
Mostly this is a sign of a quick hack which didn't get cleaned up.
I'll try to clean those up, but I expect that will take a few months to
notice them all, and I will not be a particularly good person for
noticing them, since they will tend to make intuitive sense to me at
If you notice something which doesn't make sense, then marking it or
telling me about it is probably going to be the most effective approach,
I'll try to catch them where I can, though.
>7. I still don't understand why the +Unicode system can't bootstrap
>normally under CMU CL or SBCL. Do you have any hints, or should I just
>try it myself and see how it fails?
I wouldn't do the latter again :)
That took about 2 months to figure out.
In the earlier patch to 0.6.13 in src/compiler/fndb.lisp at the top
you'll see a really horrible piece of code which looks like this
#!+(and sb-xc-host (or sbcl cmu) unicode)
; be naughty and allow these types to be redefined
; so that the right type definitions leak in...
; *sigh* [BTS]
; once for loaded things
#.(setf (sb-c::info :type :kind 'sb-c::simple-string) :defined)
#.(setf (sb-c::info :type :kind 'sb-c::string) :defined)
#.(setf (sb-c::info :type :kind 'sb-c::character) :defined)
#.(setf (sb-c::info :type :kind 'sb-c::sequence) :defined)
#.(setf (sb-c::info :type :kind 'sb-c::vector) :defined)
; once again for compiled things - probably a better way FIXME [BTS]
(setf (sb-c::info :type :kind 'sb-c::simple-string) :defined)
(setf (sb-c::info :type :kind 'sb-c::string) :defined)
(setf (sb-c::info :type :kind 'sb-c::character) :defined)
(setf (sb-c::info :type :kind 'sb-c::vector) :defined)
(setf (sb-c::info :type :builtin 'sb-c::character) nil)
The problem is that the host's type assumptions are used to build things
like the defknowns, etc, and in a couple of other places.
These assumptions then propagate down into the cross-compiler, and then
it breaks when the cross-compiler's own code uses the cross-compiler's
own type-definitions to try to compile itself into the target.
>8. Might it be possible to do the patch in smaller pieces? E.g.
>in three phases, each adding some testable functionality:
> 1. Make the system manipulate Unicode data (reading and
> writing it, representing it in characters and
> strings) but not know anything about its properties
> other than what's a BASE-CHAR and what's not.
> (I.e. no upcasing, symbolic names for Unicode chars,
> or other messy stuff.)
Two and three are certainly separatable, but they're also relatively
They're included because I wrote them to use the 0.6.13 version to
actually do some stuff, as well as for testing.
> 2. Add Unicode-capable implementation of upcasing.
> 3. Add Unicode-capable implementation of symbolic char names.
>Also, a few style quibbles:
> * It's tidier to modify customize-target-features.lisp rather than
> * Please use *FOO* style for names of special variables, not just FOO.
> Since the semantics of things declared special are globally and
> silently changed, I really don't like specialness attached to
> ordinary names.
Hmm, I don't recall doing that much, some things which I exposed for
debugging purposes might be like that, but those should be refactored.
> * It might be better to do #ifdef and #!+ stuff on a smaller scale
> sometimes, e.g. in search_for_symbol(). You duplicated the entire
> function, then mutated one copy, but lots of code remains the same
> between the two versions. Less code might be duplicated if you
> could either (1) wrap the #ifdef around a smaller section of code,
> or (2) factor out the operation which needs to be defined
> differently (e.g. a new smaller function called is_same_symbol()),
> called from inside the loop in search_for_symbol()) and
> then put the #ifdef's around the definition of is_same_symbol().
Oh well, there was a more code in there at some point, iirc, which got
taken out later.
I agree that that should be refactored in.
There is a lot of cleaning up to go though, I've been taking a break
from working with it now that it compiles.
I'm quite happy to go though and clean things up, and would like to
incrementally do that, since it isn't quite up to the standard that I
would normally like. A lot of hacks went in while working with partially
understood code, and not all of them have been sanitised.
Mostly at the moment, I want to work out the direction from this point,
and I don't want to slide further against the main release, since I
cannot afford to take another three months to rework all of this. :)
As a side-note I've gotten McCLIM to run under SBCL with a modified
net-sbcl-sockets quite nicely :).