Thread: [Sbcl-devel] patch and a new file to character_branch for :EXTERNAL-FORMAT :UTF-8

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-- 
Teemu

Teemu Kalvas <ch...@s2...> writes:

> This is very preliminary support for :EXTERNAL-FORMAT :UTF-8.  It
> lacks important consistency tests for arguments (like EXTERNAL-FORMAT
> = :UTF-8 entailing ELEMENT-TYPE = CHARACTER).  It is also not
> thread-safe at all.  But it does the job.

I continued on it and now have a binary representation of the
character database compiled into the sbcl.core.  The amount of patch
and totally new files has got somewhat large, so I've just put it all
up at

 http://www.s2.org/~chery/projects/sbcl-unicode/

for the moment.  There's a patch against character_branch
(latin-1+utf-8.patch), the new file utf8-stream.lisp, the binary
representation of the character database (ucd.dat) and the generator
program for that (ucd.lisp).

What is yet to be decided is in what way the UCD gets included in the
source distribution.  Possibilities are

1. As the binary object ucd.dat.  Pro: small, 82 kB. Con: Not human
   readable, also CVS won't play nice with binaries.
2. As the source files from unicode.org.  Pro: All text. Con: Megabytes
   of it.
3. Don't include the source files, fetch them automatically in the
   build process.  Pro: Least amount of space in source distribution
   used.  Con: Would introduce a dependency on network connectivity
   to the build process.

Beware that the patch will put a hardcoded path to the file ucd.dat
into src/code/target-char.lisp.  Christophe can fix that if and when
he selects where ucd.dat shall go in the source tree.  I assume he'll
commit some of this to character_branch in the near future, along with
his own changes.

Then more testing!

-- 
Teemu

On Fri, Sep 10, 2004 at 05:24:03PM +0300, Teemu Kalvas wrote:
> What is yet to be decided is in what way the UCD gets included in the
> source distribution.  Possibilities are
> 
> 1. As the binary object ucd.dat.  Pro: small, 82 kB. Con: Not human
>    readable, also CVS won't play nice with binaries.
> 2. As the source files from unicode.org.  Pro: All text. Con: Megabytes
>    of it.
> 3. Don't include the source files, fetch them automatically in the
>    build process.  Pro: Least amount of space in source distribution
>    used.  Con: Would introduce a dependency on network connectivity
>    to the build process.

As a first cut, just putting the text source files into our source
distribution sounds reasonable. It's simple and easy, and depending on
people's bandwidth and storage capabilities, it might turn out be good
enough.

If the simple way turns out to be annoyingly expensive for many
people, I can't guess at the moment from exactly which direction the
complaints will come. I think at least two combinations of your 1/2/3
options are fairly plausible:
  -- "I don't need Unicode, why does SBCL source need to carry all this
     around?" Or "a lot of us need Unicode, but even those of us who do
     find the bulkiness annoying, and we all agree Unicode seldom changes."
     In that case, making a separate ucd package on sourceforge, and 
     putting ucd-1.0-source.tar.bz2 and ucd-1.0-binary.tar.bz2 there, 
     might fit most needs.
  -- "I need Unicode, but all this stuff is available in standard places
     [wgetable URLs, Debian CDs, asdf magic, whatever] and I shouldn't
     need an extra copy just for SBCL." Since at the moment we don't know
     of any such consensus acceptable source this only hypothetical now,
     but if such a source comes into existence then some sort of fetching
     solution would become natural.

The one thing I would propose to do immediately to accommodate
possible future cleverness is something that might be done naturally
anyway: try to structure the CVS project/directory tree in such a way
that possible future solutions unbundling the Unicode data can be done
naturally.

-- 
William Harold Newman <wil...@ai...>
What part of "Gestalt" don't you understand? -- Karsten M. Self

Teemu Kalvas <ch...@s2...> writes:

> I have a very ugly utf-8 external-format implementation native to
> fd-streams included here. It'll probably not stay in this form, when
> other external formats are implemented, but this could be tested and
> benchmarked a bit. 

I've merged this into sbcl-0.8.13.77.character.17.

High on the list of things to change, I feel, is the dependence in
PICK-INPUT-ROUTINE and PICK-OUTPUT-ROUTINE on the order in which the
top-level forms in fd-streams are executed... it's not entirely your
fault, but you have just made it worse.

Thanks anyway. :-)

Cheers,

Christophe

Thread: [Sbcl-devel] patch and a new file to character_branch for :EXTERNAL-FORMAT :UTF-8

Common Lisp compiler and runtime

sbcl-devel