160 lines (147 with data), 8.6 kB
** turn the VM definition of BASE-CHAR-REG, BASE-CHAR-SC-NUMBER,
etc. into CHARACTER-REG, CHARACTER-REG-SC-NUMBER. (Rationale: we're
never going to want to distinguish the CHARACTERness vs BASE-CHARness
of characters by their widetags, because we can do it based on their
CHAR-CODE; thus, calling the primitive type and storage classes
BASE-CHAR is unneccesarily confusing.)
-- done for x86, ppc (Julian Squires);
-- TODO: sparc, mips, hppa, alpha.
** implement a CHARACTER-SET-TYPE representation for sets of
characters in the CL type system. (Rationale: we are going to need to
describe possibly-large sets of not-necessarily-contiguous characters,
for use in external formats and describing the BASE-CHAR type.)
-- done, implementing the representation of the range as a list of
(low . high) pairs. Note: two alternative representations were
considered and found wanting: a CHARACTER-RANGE-TYPE which could
then be placed in TYPE-UNION for non-contiguous sets has the
disadvantage that (MEMBER #\a #\c #\e) unparses as
(OR (MEMBER #\a) (MEMBER #\c) (MEMBER #\e)); a BIT-VECTOR
representation works well for arbitrarily discontinuous sets, but
is extremely space-inefficient for typical character sets over a
character space of 2^21 characters.
** set BASE-CHAR to be (CHARACTER-SET 0 127), implementing a new
low-level representation of CHARACTER-STRING for (SIMPLE-ARRAY
CHARACTER (*)) (which is now distinct from SIMPLE-BASE-STRING).
(Rationale: exposes issues in the SBCL codebase where BASE-CHAR =
CHARACTER assumptions have been made.) (Rationale: having BASE-STRING
be pure ASCII (and not latin1 or similar) means that, for a wide
variety of external formats (including utf-8, latin-X and POSIX),
writing the string to a stream is as simple as blitting the bits of
the string to the buffer.) (Rationale: while some implementations do
retain BASE-CHAR = CHARACTER, they either (a) bloat all strings by a
factor of four; (b) bloat all strings by a factor of two and don't
support the whole of Unicode; or (c) add an extra level of indirection
for strings so that the storage compaction can be mutated on demand,
with garbage collector support. None of these strategies is
-- done for x86, ppc (Julian Squires):
>> cold init runs;
>> warm load runs to completion;
>> all contribs build and pass self-tests;
>> all sbcl tests pass;
>> checked against Paul Dietz' gcl/ansi-tests.
>> resulting sbcl self-builds.
(Note: this last step was more difficult than anticipated: the
difference between a self-build and a build from another lisp
is that literal strings are dumped as (SIMPLE-ARRAY CHARACTER
(*))s rather than as SIMPLE-BASE-STRINGs, revealing yet
another set of portability problems, and prompting some KLUDGEs
of the #.(coerce "foo" 'base-string) form that might
eventually be reversed.)
-- TODO: sparc, mips, hppa, alpha.
** fix FOREIGN-SYMBOL-ADDRESS to work with general constant strings,
not just (:CONSTANT SIMPLE-BASE-STRING), coercing to base-string
within the VOP implementation. (Rationale: less use of #.(coerce
"foo" 'base-string) throughout the code.)
-- done for x86, ppc (Eric Marsden);
-- TODO: sparc, mips, hppa, alpha.
** fix GENESIS to use SB!XC:CHAR-CODE always. (Rationale: we should
only use STANDARD-CHARs in our source; this may be hard to achieve,
but we should definitely only use STANDARD-CHARs in our strings. This
partially fixes a theoretical portability bug.)
-- done, including adjusting many documentation strings and
condition format controls.
** fix the regular dumper to compute similarity properly for strings,
rather than simply through an EQUAL hash table. (Rationale: it's just
completely broken at present.)
-- done, including test cases.
** define (CHARACTER-SET 128 255) to be the corresponding Latin1 (and
Unicode) characters at those codepoints. (Rationale: attempting to
support locale-dependent character points will generate extreme
confusion, probably. If there is long-term demand for a purely 8-bit
character SBCL, this decision might be revised, but this simplifying
decision allows for infrastructural progress). This requires
modification of the various CHAR-UPCASE/STRING-DOWNCASE/GRAPHIC-CHAR-P
etc. functions, and will probably address the failing test FORMAT.C.4A
-- done (Teemu Kalvas), including a test case for ANSI consistency
wrt GRAPHIC-CHAR-P and CHAR-NAME. The names of the characters
between #x80 and #xa0 might want to be revised. (the inclusion of
the binary data generated from the Unicode data files might also
be suboptimal in the long run. Many other possibilities exist.)
** implement :UTF-8, :ISO-8859-1 and :POSIX external formats, and make
:DEFAULT an alias for the approprate one based on nl_langinfo(CODESET)
information. (Rationale: this is the absolute minimum needed to get
e-acute printed to my terminal, which would be a major milestone.)
Eventually other :ISO-8859-<N> external formats should be supported,
even in 8-bit lisps, but attempts to print characters which are not
representable in those formats should probably error, so it might not
be terribly useful.
-- :UTF-8 external format partially done (Teemu Kalvas) (easier than
it seemed due to a bug rendering FAST-READ-CHAR more-or-less
exactly the same as READ-CHAR).
-- nl_langinfo(CODESET) :DEFAULT processing done.
-- :ISO-8859-1 done (Teemu Kalvas)
** alter the reader and any similar data structures such that they do
not scale linearly in size with the number of characters in the
system. (Rationale: having a readtable with 2^21 entries would make
even current bloated sbcl.core look tiny.)
-- done reader, symbol printer (Teemu Kalvas), format subsystem.
(Note that the symbol printer has multiple bugs in its logic
which have not been fixed by this branch.)
** fix GENESIS (and the cross-compiler in general) to dump
BASE-STRINGs always. (Rationale: SBCL aspires to portability, so
should not use any non-STANDARD-CHAR in its source code. By
definition, therefore, all strings and stringlike objects are dumpable
as BASE-STRING, which allows for identical cold fasls and cores to be
generated from lisps with different BASE-CHAR/CHARACTER distinctions.)
-- done: the cross-compiler type system, the cross-compiler dumper
and genesis cooperate to make every host string look like a
target base-string. Including fixing for self-building (Eric
-- (Note that dubious uses of CL:TYPE-OF in portions of the
compiler such as CONVERT-MEMBER-TYPE, TWO-ARG-DERIVE-TYPE
** increase CHAR-CODE-LIMIT to something larger than 256. (Rationale:
support people other than simply those living in non-Eurozone Western
Europe or the United States of America.) This requires at minimum
adjusting the dumper/fop code and the low-level memory accessors.
-- in progress for x86, ppc (Eric Marsden):
>> rewrote SYMBOLICATE to avoid needing the type system (or a
transformed to low-level bit-bashing CONCATENATE) early in
>> adjusted vm definition (*BYTE-REGS*, MOVE-FROM-CHAR) and array
>> fop, load and genesis adjustments.
>> TODO: restore speed in CONCATENATE, REPLACE, etc.
** implement an SB-ALIEN:UTF8-STRING parallel to SB-ALIEN:C-STRING.
(Rationale: for calling out to Pango or similar. Actually, a valid
use might be in Unix libc/kernel functions: at least under Linux, I
believe that the kernel understands utf-8 for directory entries and
the like. Someone Who Knows might want to check this.) This might
make the #.(coerce "foo" 'base-string) in the filesystem / SB-UNIX
layer go away.
-- done, apart from checking the kernel and checking for extra
-- TODO: UTF16-STRING, LATIN1-STRING, LATINX-STRING.
-- TODO: OAOOization with stream external formats.
** possibly retain a CHAR-CODE-LIMIT = 256 build option, with
character point encoding dependent on locale. This requires
implementing, at a minimum, latin-X eternal formats, so that files
with one or two high-bit characters (e.g. e-acute) can be read in
applicable locales (e.g. latin1 and latin15). (Rationale: possible
reduction of size bloat.)
** implement many useful external formats. (Rationale: now that we
have these characters, we want to be able to use them when talking to
the rest of the world.)
-- done: latin-9 aka iso-8859-15 (Teemu Kalvas).