Hi,
*** Your SBCL Development Team Needs You! ***
Over the past few days I have been engaged in performing some of the
infrastructural work required for supporting Unicode natively in
SBCL. This evening I have reached a milestone where it makes sense to
publicize the work a little, as I need assistance to progress.
Specifically, the code that you can acquire by checking out sbcl with
the command
cvs -d:pserver:anonymous@... \
co -r character_branch sbcl
* builds, including contribs;
* builds itself (make sure you get version 0.8.13.77.character.9);
* passes self-tests;
* passes as many gcl/ansi-tests as the regular branch.
However, it has one essential difference: it has two SIMPLE-STRING
types; one is type-equivalent to (SIMPLE-ARRAY CHARACTER (*)), and one
which is equivalent to SIMPLE-BASE-STRING.[*]
If this all seems a little abstruse, here's what you can usefully do,
assuming that you have access to an x86 machine: build the system
using your favourite host compiler, and then attempt to run your
favourite code. If you maintain a string-processing library, a regexp
compiler, a webserver, a benchmark suite, or indeed anything which
uses strings as a data structure: I want to know if it breaks. I
_also_ want to know if it's substantially slower than CVS HEAD: if it
turns out that an application is noticeably more sluggish, it would be
absolutely wonderful if the reason for such a slowdown could be
isolated: use of both the supplied profilers (sb-profile:profile and
sb-sprof:start-profiling) may well prove invaluable.
To be clear: the preliminary nature of the Unicode support that I
mentioned in the subject of this e-mail is extreme, as there is in
fact no support for Unicode at all. However, it is my hope (though
unfortunately not my expectation :-) that the version on the branch is
no more broken than CVS HEAD; I would like that confirmed or refuted
before proceeding.
My sketch of how implementation proceeds can be seen in the
TODO.character file on the branch, a copy of which I've put at
<http://www-jcsu.jesus.cam.ac.uk/~csr21/TODO.character>. The first
three items on that list are complete on the x86: my belief is that
the third item is technically the most challenging of all, and that
much of the rest of the work is straightforward. However, before
continuing, I think we need to establish whether this is a viable
implementation strategy, whether new bugs or new bottlenecks have been
introduced, and so on. I've exposed my reasoning in the
TODO.character file for each step; however, I am far from an expert in
Unicode implementation, and it is entirely possible that I have made
errors of judgment or fact: please provide corrections if so.
I shall be away until Tuesday or Wednesday, though I may be able to
respond to e-mail. If anyone would like to experiment with supporting
characters 128-255, or an :EBCDIC external format, or something else,
that would be excellent. If anyone thinks that Unicode support is
entirely a waste of time, that's worth knowing too.
If I have got this far, it is by having learnt from the experience
(and the experiences) of Brian Spilsbury and Teemu Kalvas; any errors
or misinterpretations are mine, but discussions with them have been
invaluable over the past four years(!) or so.
Useful links:
<http://www.unicode.org/>
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>
<http://anubis.dkuug.dk/CEN/TC304/guide/gucsch00.htm>
<http://groups.google.com/groups?selm=100umgl91u6uc67%40corp.supernews.com>
<http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&c2coff=1&safe=off&threadm=3FA61CC7.1A99354B%40sonic.net&rnum=1&prev=/groups%3Fq%3Dg:thl932339533d%26dq%3D%26hl%3Den%26lr%3D%26ie%3DUTF-8%26c2coff%3D1%26safe%3Doff%26selm%3D3FA61CC7.1A99354B%2540sonic.net>
(and probably many others. If anyone knows how other languages do
stuff, that would probably be helpful too).
So, anyway, the _most_ important thing is that I get a sense of
whether the branch as it stands basically works or not, and whether it
is drastically slower or not. To this end, too much data is way
better than too little, so don't be shy. Please.
Many thanks,
Christophe
[*] To my knowledge, no other lisp environment currently has such a
distinction in its representations. Was this also the case
historically?
--
http://www-jcsu.jesus.cam.ac.uk/~csr21/ +44 1223 510 299/+44 7729 383 757
(set-pprint-dispatch 'number (lambda (s o) (declare (special b)) (format s b)))
(defvar b "~&Just another Lisp hacker~%") (pprint #36rJesusCollegeCambridge)
|