From: Teemu K. <ch...@s2...> - 2004-09-07 17:39:14
Attachments:
utf8-stream.lisp
latin-1+utf-8.patch
|
-- Teemu |
From: Teemu K. <ch...@s2...> - 2004-09-10 14:24:53
|
Teemu Kalvas <ch...@s2...> writes: > This is very preliminary support for :EXTERNAL-FORMAT :UTF-8. It > lacks important consistency tests for arguments (like EXTERNAL-FORMAT > = :UTF-8 entailing ELEMENT-TYPE = CHARACTER). It is also not > thread-safe at all. But it does the job. I continued on it and now have a binary representation of the character database compiled into the sbcl.core. The amount of patch and totally new files has got somewhat large, so I've just put it all up at http://www.s2.org/~chery/projects/sbcl-unicode/ for the moment. There's a patch against character_branch (latin-1+utf-8.patch), the new file utf8-stream.lisp, the binary representation of the character database (ucd.dat) and the generator program for that (ucd.lisp). What is yet to be decided is in what way the UCD gets included in the source distribution. Possibilities are 1. As the binary object ucd.dat. Pro: small, 82 kB. Con: Not human readable, also CVS won't play nice with binaries. 2. As the source files from unicode.org. Pro: All text. Con: Megabytes of it. 3. Don't include the source files, fetch them automatically in the build process. Pro: Least amount of space in source distribution used. Con: Would introduce a dependency on network connectivity to the build process. Beware that the patch will put a hardcoded path to the file ucd.dat into src/code/target-char.lisp. Christophe can fix that if and when he selects where ucd.dat shall go in the source tree. I assume he'll commit some of this to character_branch in the near future, along with his own changes. Then more testing! -- Teemu |
From: William H. N. <wil...@ai...> - 2004-09-10 16:26:16
|
On Fri, Sep 10, 2004 at 05:24:03PM +0300, Teemu Kalvas wrote: > What is yet to be decided is in what way the UCD gets included in the > source distribution. Possibilities are > > 1. As the binary object ucd.dat. Pro: small, 82 kB. Con: Not human > readable, also CVS won't play nice with binaries. > 2. As the source files from unicode.org. Pro: All text. Con: Megabytes > of it. > 3. Don't include the source files, fetch them automatically in the > build process. Pro: Least amount of space in source distribution > used. Con: Would introduce a dependency on network connectivity > to the build process. As a first cut, just putting the text source files into our source distribution sounds reasonable. It's simple and easy, and depending on people's bandwidth and storage capabilities, it might turn out be good enough. If the simple way turns out to be annoyingly expensive for many people, I can't guess at the moment from exactly which direction the complaints will come. I think at least two combinations of your 1/2/3 options are fairly plausible: -- "I don't need Unicode, why does SBCL source need to carry all this around?" Or "a lot of us need Unicode, but even those of us who do find the bulkiness annoying, and we all agree Unicode seldom changes." In that case, making a separate ucd package on sourceforge, and putting ucd-1.0-source.tar.bz2 and ucd-1.0-binary.tar.bz2 there, might fit most needs. -- "I need Unicode, but all this stuff is available in standard places [wgetable URLs, Debian CDs, asdf magic, whatever] and I shouldn't need an extra copy just for SBCL." Since at the moment we don't know of any such consensus acceptable source this only hypothetical now, but if such a source comes into existence then some sort of fetching solution would become natural. The one thing I would propose to do immediately to accommodate possible future cleverness is something that might be done naturally anyway: try to structure the CVS project/directory tree in such a way that possible future solutions unbundling the Unicode data can be done naturally. -- William Harold Newman <wil...@ai...> What part of "Gestalt" don't you understand? -- Karsten M. Self |
From: Christophe R. <cs...@ca...> - 2004-09-15 21:39:08
|
Teemu Kalvas <ch...@s2...> writes: > I have a very ugly utf-8 external-format implementation native to > fd-streams included here. It'll probably not stay in this form, when > other external formats are implemented, but this could be tested and > benchmarked a bit. I've merged this into sbcl-0.8.13.77.character.17. High on the list of things to change, I feel, is the dependence in PICK-INPUT-ROUTINE and PICK-OUTPUT-ROUTINE on the order in which the top-level forms in fd-streams are executed... it's not entirely your fault, but you have just made it worse. Thanks anyway. :-) Cheers, Christophe |