From: Robert J. M. <sf-...@ro...> - 2005-08-23 17:47:12
|
Having been reminded on IRC that I've done this work, and that it might be nice to have in sbcl itself, here's a piece of code that gives unicode characters names, and makes their readable output ascii-safe. The code-data "array" stores pretty much everything there is to know about unicode characters. The idea was to build a full unicode string manipulation library on top of this. Some of it's been done (not included here) but I got a little distracted. Currently, it builds its tables at load-time (because this was simpler) from the files UnicodeData.txt (already used by sbcl, though this has its own copy), PropList.txt, SpecialCasing.txt, and CaseFolding.txt. I envision someone wanting it always-on would dump a core containing it anyway, so I don't see the large load-time that results as a particularly bad side-effect. The fact that large amounts of currently-unused data is loaded is potentially one, but it's fairly easy to cut out if that's wanted, and I'm still hoping to get sufficiently motivated to start building unicode stuff atop what's here again. Rather than attaching it, I've put it at http://www.rojoma.com/sb-char-names.tar.bz2 -- Robert Macomber sf-...@ro... |
From: Nikodemus S. <nik...@ra...> - 2006-02-21 11:36:44
|
<#part type="application/octet-stream" filename="~/src/sb-studio/repos/sbcl/src/code/huffman.lisp" disposition=attachment description="src/code/huffman.lisp"> <#/part> <#part type="text/x-patch" filename="~/src/sb-studio/repos/sbcl/unicode-names.patch" disposition=inline description=patch> <#/part> Date: Tue, 21 Feb 2006 11:36:35 +0000 Message-ID: <87a...@lo...> User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.0.50 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Attached is a resurrected version of my unicode character name patch, against SBCL 0.9.935, plus new file src/code/huffman.lisp. Character-to-name and name-to-character mappings are both O(ln size-of-character-space), which seems reasonable enough for me. This bloats the core by 600k or so, with the character names huffman-encoded. Without encoding the bloat is around 900k, so the win is not huge but reasonable, and slightly better then using 6bits to encode the 32 symbols encountered in character names. I did some quick experiments using extended huffman-encoding, but didn't manage to improve the compression rate. A more sophisticated scheme should do better, but I'm not sure it is worth the effort. Do people consider character names worth +600k bloat (as in, should they be enabled by default or not)? Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: Yaroslav K. <kav...@je...> - 2006-02-21 12:03:48
|
Nikodemus Siivola wrote: > Attached is a resurrected version of my unicode character name patch, > against SBCL 0.9.935, plus new file src/code/huffman.lisp. attach is absent > Character-to-name and name-to-character mappings are both O(ln > size-of-character-space), which seems reasonable enough for me. mappings from what and to what? -- WBR, Yaroslav Kavenchuk. |
From: Harald Hanche-O. <ha...@ma...> - 2006-02-21 12:25:54
|
+ Yaroslav Kavenchuk <kav...@je...>: | Nikodemus Siivola wrote: | > Character-to-name and name-to-character mappings are both O(ln | > size-of-character-space), which seems reasonable enough for me. | | mappings from what and to what? Somewhere at the Unicode web site there is a file UnicodeData.txt which contains lines like this one: 2232;CLOCKWISE CONTOUR INTEGRAL;Sm;0;ON;;;;;Y;;;;; So the character-to-name mapping ought to map #.(code-char #x2232) (∲) to the string "CLOCKWISE CONTOUR INTEGRAL", and the inverse mapping does, well, the inverse. As to whether 600 K is bloat, I wonder how hard it would be to put the conversion tables in a fasl that is automagically loaded on demand? (My guess is it shouldn't be hard at all.) - Harald |
From: Yaroslav K. <kav...@je...> - 2006-02-21 12:40:38
|
Harald Hanche-Olsen wrote: > Somewhere at the Unicode web site there is a file UnicodeData.txt > which contains lines like this one: > > 2232;CLOCKWISE CONTOUR INTEGRAL;Sm;0;ON;;;;;Y;;;;; Oops, thanks :) > As to whether 600 K is bloat, I wonder how hard it would be to put the > conversion tables in a fasl that is automagically loaded on demand? > (My guess is it shouldn't be hard at all.) Maybe not load fasl on demand, but with parser load UnicodeData.txt on demand? -- WBR, Yaroslav Kavenchuk. |
From: Harald Hanche-O. <ha...@ma...> - 2006-02-21 13:23:10
|
+ Yaroslav Kavenchuk <kav...@je...>: | Maybe not load fasl on demand, but with parser load UnicodeData.txt | on demand? I'm assuming that building the compressed tables is time consuming, which would make this an undesirable solution. - Harald |
From: Yaroslav K. <kav...@je...> - 2006-02-21 12:53:04
|
Harald Hanche-Olsen wrote: > So the character-to-name mapping ought to map #.(code-char #x2232) (∲) > to the string "CLOCKWISE CONTOUR INTEGRAL", and the inverse mapping > does, well, the inverse. Maybe, show character as #\<<CHARACTER_NAME>>? Not * (code-char 76) -> #\L but * (code-char 76) -> #\LATIN_CAPITAL_LETTER_L? Thanks. -- WBR, Yaroslav Kavenchuk. |
From: Nikodemus S. <nik...@ra...> - 2006-02-21 13:04:40
|
Harald Hanche-Olsen <ha...@ma...> writes: > As to whether 600 K is bloat, I wonder how hard it would be to put the > conversion tables in a fasl that is automagically loaded on demand? > (My guess is it shouldn't be hard at all.) Not hard, but adds complexity to distributing SBCL and applications derived of it. If lazy loading is desired, I'd rather put the tables in a segment of the core loaded on demand. Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: David L. <da...@li...> - 2006-02-21 13:22:32
|
Quoting Nikodemus Siivola (nik...@ra...): > Not hard, but adds complexity to distributing SBCL and applications > derived of it. If lazy loading is desired, I'd rather put the tables > in a segment of the core loaded on demand. Without having looked at your patch, handling such a file should not be more complex for a distributor than bundling contribs, right? d. |
From: Nikodemus S. <nik...@ra...> - 2006-02-21 13:41:13
|
David Lichteblau <da...@li...> writes: > Quoting Nikodemus Siivola (nik...@ra...): > Without having looked at your patch, handling such a file should not be > more complex for a distributor than bundling contribs, right? I think it would: typical application delivery seems to me to involve loading the code and dumping an image, and bundling it somehow. Contribs are a non-issue for in that case. I can however imagine an application that suddenly needed unicode character names due to user-input... ...so to avoid need to separately bundle the file S-L-A-D would have to be extended with &KEY LOAD-UNICODE-NAMES or similar. Euugh. Making the unicode names a separate contrib seems more attractive if the 600k seems too much: (REQUIRE :SB-UNICODE-NAMES) would be simple enough, and less magical then autoloading. Cheers, -- Nikodemus Schemer: "Buddha is small, clean, and serious." Lispnik: "Buddha is big, has hairy armpits, and laughs." |
From: Robert J. M. <sf-...@ro...> - 2006-03-06 20:42:30
Attachments:
char-names.diff
|
Here's a patch that adds a few things my patch from 23 August does which Nikodemus's does not -- it allows #\Uxxxx and #\Uxxxxxxxx as input character names (as clisp does) and causes non-standard-graphic- chars to be displayed by name when printed, again as clisp and allegro do it. If a character has no name, it (that is, char-name) resorts to the U-prefixed form. This makes all characters safe to print (readably, and in isolation, not as part of strings) in non-full-unicode locales. -- Robert Macomber sf-...@ro... |
From: Juho S. <js...@ik...> - 2006-03-16 03:25:25
|
<sf-...@ro...> wrote: > Here's a patch that adds a few things my patch from 23 August does > which Nikodemus's does not -- it allows #\Uxxxx and #\Uxxxxxxxx as > input character names (as clisp does) and causes non-standard-graphic- > chars to be displayed by name when printed, again as clisp and allegro > do it. If a character has no name, it (that is, char-name) resorts to > the U-prefixed form. This makes all characters safe to print > (readably, and in isolation, not as part of strings) in > non-full-unicode locales. Thanks, applied. -- Juho Snellman |