[10047f]: TODO.character Maximize Restore History

Download this file

TODO.character    160 lines (147 with data), 8.6 kB

** turn the VM definition of BASE-CHAR-REG, BASE-CHAR-SC-NUMBER,
etc. into CHARACTER-REG, CHARACTER-REG-SC-NUMBER.  (Rationale: we're
never going to want to distinguish the CHARACTERness vs BASE-CHARness
of characters by their widetags, because we can do it based on their
CHAR-CODE; thus, calling the primitive type and storage classes
BASE-CHAR is unneccesarily confusing.)
  -- done for x86, ppc (Julian Squires);
  -- TODO: sparc, mips, hppa, alpha.

** implement a CHARACTER-SET-TYPE representation for sets of
characters in the CL type system.  (Rationale: we are going to need to
describe possibly-large sets of not-necessarily-contiguous characters,
for use in external formats and describing the BASE-CHAR type.)
  -- done, implementing the representation of the range as a list of
     (low . high) pairs.  Note: two alternative representations were
     considered and found wanting: a CHARACTER-RANGE-TYPE which could
     then be placed in TYPE-UNION for non-contiguous sets has the
     disadvantage that (MEMBER #\a #\c #\e) unparses as
     (OR (MEMBER #\a) (MEMBER #\c) (MEMBER #\e)); a BIT-VECTOR
     representation works well for arbitrarily discontinuous sets, but
     is extremely space-inefficient for typical character sets over a
     character space of 2^21 characters.

** set BASE-CHAR to be (CHARACTER-SET 0 127), implementing a new
low-level representation of CHARACTER-STRING for (SIMPLE-ARRAY
CHARACTER (*)) (which is now distinct from SIMPLE-BASE-STRING).
(Rationale: exposes issues in the SBCL codebase where BASE-CHAR =
CHARACTER assumptions have been made.)  (Rationale: having BASE-STRING
be pure ASCII (and not latin1 or similar) means that, for a wide
variety of external formats (including utf-8, latin-X and POSIX),
writing the string to a stream is as simple as blitting the bits of
the string to the buffer.)  (Rationale: while some implementations do
retain BASE-CHAR = CHARACTER, they either (a) bloat all strings by a
factor of four; (b) bloat all strings by a factor of two and don't
support the whole of Unicode; or (c) add an extra level of indirection
for strings so that the storage compaction can be mutated on demand,
with garbage collector support.  None of these strategies is
particularly appealing.)
  -- done for x86, ppc (Julian Squires):
     >> cold init runs;
     >> warm load runs to completion;
     >> all contribs build and pass self-tests;
     >> all sbcl tests pass;
     >> checked against Paul Dietz' gcl/ansi-tests.
     >> resulting sbcl self-builds.
        (Note: this last step was more difficult than anticipated: the
        difference between a self-build and a build from another lisp
        is that literal strings are dumped as (SIMPLE-ARRAY CHARACTER
        (*))s rather than as SIMPLE-BASE-STRINGs, revealing yet
        another set of portability problems, and prompting some KLUDGEs
        of the #.(coerce "foo" 'base-string) form that might
        eventually be reversed.)
  -- TODO: sparc, mips, hppa, alpha.

** fix FOREIGN-SYMBOL-ADDRESS to work with general constant strings,
not just (:CONSTANT SIMPLE-BASE-STRING), coercing to base-string
within the VOP implementation.  (Rationale: less use of #.(coerce
"foo" 'base-string) throughout the code.)
  -- done for x86, ppc (Eric Marsden);
  -- TODO: sparc, mips, hppa, alpha.

** fix GENESIS to use SB!XC:CHAR-CODE always.  (Rationale: we should
only use STANDARD-CHARs in our source; this may be hard to achieve,
but we should definitely only use STANDARD-CHARs in our strings.  This
partially fixes a theoretical portability bug.)
  -- done, including adjusting many documentation strings and
     condition format controls.

** fix the regular dumper to compute similarity properly for strings,
rather than simply through an EQUAL hash table.  (Rationale: it's just
completely broken at present.)
  -- done, including test cases.

** define (CHARACTER-SET 128 255) to be the corresponding Latin1 (and
Unicode) characters at those codepoints.  (Rationale: attempting to
support locale-dependent character points will generate extreme
confusion, probably.  If there is long-term demand for a purely 8-bit
character SBCL, this decision might be revised, but this simplifying
decision allows for infrastructural progress).  This requires
modification of the various CHAR-UPCASE/STRING-DOWNCASE/GRAPHIC-CHAR-P
etc. functions, and will probably address the failing test FORMAT.C.4A
from gcl/ansi-tests.
  -- done (Teemu Kalvas), including a test case for ANSI consistency
     wrt GRAPHIC-CHAR-P and CHAR-NAME.  The names of the characters
     between #x80 and #xa0 might want to be revised. (the inclusion of
     the binary data generated from the Unicode data files might also
     be suboptimal in the long run.  Many other possibilities exist.)

** implement :UTF-8, :ISO-8859-1 and :POSIX external formats, and make
:DEFAULT an alias for the approprate one based on nl_langinfo(CODESET)
information.  (Rationale: this is the absolute minimum needed to get
e-acute printed to my terminal, which would be a major milestone.)
Eventually other :ISO-8859-<N> external formats should be supported,
even in 8-bit lisps, but attempts to print characters which are not
representable in those formats should probably error, so it might not
be terribly useful.
  -- :UTF-8 external format partially done (Teemu Kalvas) (easier than
     it seemed due to a bug rendering FAST-READ-CHAR more-or-less
     exactly the same as READ-CHAR).
  -- nl_langinfo(CODESET) :DEFAULT processing done.
  -- :ISO-8859-1 done (Teemu Kalvas)

** alter the reader and any similar data structures such that they do
not scale linearly in size with the number of characters in the
system.  (Rationale: having a readtable with 2^21 entries would make
even current bloated sbcl.core look tiny.)
  -- done reader, symbol printer (Teemu Kalvas), format subsystem.
     (Note that the symbol printer has multiple bugs in its logic
     which have not been fixed by this branch.)

** fix GENESIS (and the cross-compiler in general) to dump
BASE-STRINGs always.  (Rationale: SBCL aspires to portability, so
should not use any non-STANDARD-CHAR in its source code.  By
definition, therefore, all strings and stringlike objects are dumpable
as BASE-STRING, which allows for identical cold fasls and cores to be
generated from lisps with different BASE-CHAR/CHARACTER distinctions.)
  -- done: the cross-compiler type system, the cross-compiler dumper
     and genesis cooperate to make every host string look like a
     target base-string.  Including fixing for self-building (Eric
     Marsden).
  -- (Note that dubious uses of CL:TYPE-OF in portions of the
     compiler such as CONVERT-MEMBER-TYPE, TWO-ARG-DERIVE-TYPE
     remain.)

** increase CHAR-CODE-LIMIT to something larger than 256.  (Rationale:
support people other than simply those living in non-Eurozone Western
Europe or the United States of America.)  This requires at minimum
adjusting the dumper/fop code and the low-level memory accessors.
  -- in progress for x86, ppc (Eric Marsden):
     >> rewrote SYMBOLICATE to avoid needing the type system (or a
        transformed to low-level bit-bashing CONCATENATE) early in
        cold-init.
     >> adjusted vm definition (*BYTE-REGS*, MOVE-FROM-CHAR) and array
        accessors.
     >> fop, load and genesis adjustments.
     >> TODO: restore speed in CONCATENATE, REPLACE, etc.

** implement an SB-ALIEN:UTF8-STRING parallel to SB-ALIEN:C-STRING.
(Rationale: for calling out to Pango or similar.  Actually, a valid
use might be in Unix libc/kernel functions: at least under Linux, I
believe that the kernel understands utf-8 for directory entries and
the like.  Someone Who Knows might want to check this.)  This might
make the #.(coerce "foo" 'base-string) in the filesystem / SB-UNIX
layer go away.
  -- done, apart from checking the kernel and checking for extra
     coercions.
  -- TODO: UTF16-STRING, LATIN1-STRING, LATINX-STRING.
  -- TODO: OAOOization with stream external formats.

** possibly retain a CHAR-CODE-LIMIT = 256 build option, with
character point encoding dependent on locale.  This requires
implementing, at a minimum, latin-X eternal formats, so that files
with one or two high-bit characters (e.g. e-acute) can be read in
applicable locales (e.g. latin1 and latin15).  (Rationale: possible
reduction of size bloat.)

** implement many useful external formats.  (Rationale: now that we
have these characters, we want to be able to use them when talking to
the rest of the world.)
  -- done: latin-9 aka iso-8859-15 (Teemu Kalvas).