From: Yoshiki O. <Yos...@ac...> - 2004-04-07 03:41:08
|
Ned and everyone, > On Tuesday 06 April 2004 2:50 pm, Yoshiki Ohshima wrote: > > > * Wide characters (Unicode is defined as 21 bit; but Yoshiki really needs > > > 24) > > I'd like to have 30 bit for a char. > > What's the advantage of doing that? What can't you represent in 24 > bits? > > I'm assuming that you can put most of the extra meta-info into Strings. To > uniquely index glyphs, wouldn't 24 bits be enough (possibly with some > information from the String that holds the characters)? The extra bits are used for the meta-info, yes, and the meta-info is primarily used as "language tag" to label the languages. In Squeak, the tag will be used to select the proper scanner, and font. In this case, we'd like to have more than 8 (2^(24-21)) languages. CJKV, Armenian, Mongolian, Philippine languages, there are languages that require "meta-info". I see that my current design is somewhat hybrid approach of two. One approach would be that a string carries sequence of naked 21-bit quantity (or this can be wider, but naked code point). Glyph info and scanning rules, etc. are supplied by the attributes attached to the string (probably in a form of TextAttribute.) In this case, we can't, say, inspect or print a string. Which may not too bad given that the current inspector doesn't show $<char> form for a slot of a string. However, I think people expect to be able to inspect a "naked" string. An extreme along this line is that we even don't need Character objects; we can make a string with length of 1 behave as if today's character. After all, "a character cannot print itself" is the way Unicode is designed. So, using Unicode is kind of a synonym of this approach. (I wouldn't pursue this extreme in Squeak, though.) The other approach is to make a character self-contained thing. It knows how to print itself, etc. To make this happen, a Character has to carry more than naked code-point and the higher bits in the word is where it goes. I kind of like the former approach. In the latter approach, people would like to add more attribute in the bits but always are under the fear of running out the bits. And, the indirect way of nature is not nice. However the implementation is not there yet. Also, the self-contained nature is still nice in Smalltalk and I would imagine that most of the Smalltalkers expect to be able to inspect strings and characters. So far, the hybrid approach, in which you can put things in the higher bits mostly to provide the default value, well as you can override the default by attaching attribute as TextAttributes seems least destructive. Implementing the former approach is my "future plan," but not for V4. (Someone could try it.) Whether a (wide) character should be an immediate or not, I would say that it doesn't have to; they aren't in today's Squeak. Even if we make them so, we can't use the flyweight pattern (there will be too many). And, the character object floating around (not in the String) are not that many in typical use. I think we can live with non-immediate character objects. This leaves the future extension such as implementing konjaku-mojikyo like thing (http://www.mojikyo.org/). If the "gray-beards' decision" is "we go with 24 bit immediate chars", well, I could do it. In this case, the 3 bits are used to discriminate unified kanji's and ask people who want to use some other languages to put up with a little inconvenience. -- Yoshiki |