Re: [Squeak-VMdev] Versiojn 4 changes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

  Ned and everyone,

> On Tuesday 06 April 2004 2:50 pm, Yoshiki Ohshima wrote:
> > > * Wide characters (Unicode is defined as 21 bit; but Yoshiki really needs
> > > 24)
> >  I'd like to have 30 bit for a char.
> 
> What's the advantage of doing that? What can't you represent in 24
> bits?
> 
> I'm assuming that you can put most of the extra meta-info into Strings. To 
> uniquely index glyphs, wouldn't 24 bits be enough (possibly with some 
> information from the String that holds the characters)?

  The extra bits are used for the meta-info, yes, and the meta-info is
primarily used as "language tag" to label the languages.  In Squeak,
the tag will be used to select the proper scanner, and font.  In this
case, we'd like to have more than 8 (2^(24-21)) languages.  CJKV,
Armenian, Mongolian, Philippine languages, there are languages that
require "meta-info".

  I see that my current design is somewhat hybrid approach of two.

  One approach would be that a string carries sequence of naked 21-bit
quantity (or this can be wider, but naked code point).  Glyph info and
scanning rules, etc. are supplied by the attributes attached to the
string (probably in a form of TextAttribute.)  In this case, we can't,
say, inspect or print a string.  Which may not too bad given that the
current inspector doesn't show $<char> form for a slot of a string.
However, I think people expect to be able to inspect a "naked" string.

 An extreme along this line is that we even don't need Character
objects; we can make a string with length of 1 behave as if today's
character.  After all, "a character cannot print itself" is the way
Unicode is designed.  So, using Unicode is kind of a synonym of this
approach.  (I wouldn't pursue this extreme in Squeak, though.)

  The other approach is to make a character self-contained thing.  It
knows how to print itself, etc.  To make this happen, a Character has
to carry more than naked code-point and the higher bits in the word is
where it goes.

  I kind of like the former approach.  In the latter approach, people
would like to add more attribute in the bits but always are under the
fear of running out the bits. And, the indirect way of nature is not
nice.

  However the implementation is not there yet.  Also, the
self-contained nature is still nice in Smalltalk and I would imagine
that most of the Smalltalkers expect to be able to inspect strings and
characters.

  So far, the hybrid approach, in which you can put things in the
higher bits mostly to provide the default value, well as you can
override the default by attaching attribute as TextAttributes seems
least destructive.  Implementing the former approach is my "future
plan," but not for V4.  (Someone could try it.)

  Whether a (wide) character should be an immediate or not, I would
say that it doesn't have to; they aren't in today's Squeak.  Even if
we make them so, we can't use the flyweight pattern (there will be too
many).  And, the character object floating around (not in the String)
are not that many in typical use.  I think we can live with
non-immediate character objects.  This leaves the future extension
such as implementing konjaku-mojikyo like thing
(http://www.mojikyo.org/).

  If the "gray-beards' decision" is "we go with 24 bit immediate
chars", well, I could do it.  In this case, the 3 bits are used to
discriminate unified kanji's and ask people who want to use some other
languages to put up with a little inconvenience.

-- Yoshiki