[icu-design] Re: [icu-core] Call for agenda items - icu core meeting 10/19/200510AM PST

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

This discussion veered off the icu-core list, but I think it's really
more appropriate on the icu-design list anyway.
Comments below.
markus

> Tex Texin <te...@i1...>
> 2005-10-21 04:57
>
>         To:     Eric Mader <em...@ic...>
>         cc:     Mark Davis/Cupertino/IBM@IBMUS, Badi Kumar
> <ba...@ya...>, ICU Core <icu...@li...>, [...]
>         Subject:        Re: [icu-core] Call for agenda items - icu core
> meeting 10/19/200510AM   PST
> [...]
> =3D=3D=3D
> I would add:
>
> ISO 8859-15, big-5, big5-hkscs.

First, for SBCS charsets, we will need not just the charset to be
detected but the charset+language combination - because we use
language-specific statistics.

So for ISO 8859-15, we might have one detector for Finnish, and
maybe(!) one for French.

The general problem is that the statistics will have a hard time
distinguishing charsets(+languages) that are very, very similar. Since
ISO 8859-1 was used for French and Finnish for a long time, I assume
that the missing characters are rarely used and won't show up in the
statistics. Therefore, I doubt that we will be able to distinguish
fr/ISO-8859-1 and fi/ISO-8859-1 from fr/...-15 and fi/...-15. There is
of course no difference whatsoever between de/...-1 and de/...-15 etc.

Luckily, ISO 8859 part 15 was published at a time when it wasn't
necessary any more because UTF-8 was already widely supported and HTML
provided character entities as workarounds. While part 1 remains
important because it's the default charset in many 80s/90s standards
(like HTML and Java) and is a perfect, algorithmic subset of Unicode,
part 15 was born into obsolescence.

Similarly, while we will certainly add Big5, I doubt that we will be
able to distinguish it from Big5-HKSCS. In this case, we can also not
treat one as a superset of the other because, to my memory, the HKSCS
version _replaces_ a lot of mappings (instead of just adding a bunch).

> I would also add the baltic ISO page iso 8859-4

For which languages? Is it commonly used for all three? (Estonian,
Latvian, Lithuanian)

> and Thai 620-2533.
>
> I would think there is a lot of IBM 850 running around. Although DOS is
> hopefully defunct, there were a lot of unix systems running it. But I
> can't
> say I have run across is recently. Many European progress users were usin=
g
> it.

I am surprised to hear that it was used much at all on Unixes. Even
Microsoft relegated it to the DOS prompt 10 years ago and replaced it
with 1252 for Windows programs. (And IBM replaced it with 858 for Euro
support.) I certainly don't expect a lot of emails and web pages to
use 850. Again, we would need to make a list of languages.

> Where we really mean a family, we should return the name of the largest
> superset.

We do, at least sometimes. All of gb2312/gbk/gb18030 result in gb18030.

> I run into a lot of code that sees iso-8859-1 and then converts (with
> iconv
> or other tools) from iso-8859-1 to unicode and then loses euro support et=
c
> for the added chars in windows-1252.
>
> I would rather see windows-1252 returned so the code does the right thing
> and also uses the most appropriate label.

Possible - this would be good to discuss... I expect a lot of opinions
on this one...

Maybe we can be smart about ISO charsets vs. vendor charsets: If there
are any bytes 0x80..0x9F, it is likely a vendor extension (most likely
Windows), otherwise it may be ISO.

Note that the ISO 8859-1 converter is much faster than any other.

> In general this means returning the windows encoding instead of the ISO.
>
> This is also important for the double byte encodings since the legal
> ranges
> for bytes are different.
> (cp932, cp932, cp949, cp950. Although I am not sure if cp950 and
> big5-hkscs
> have a relationship that is either superset/subset or vice versa. I
> thought
> that some of the cp950 chars conflicted with hkscs.)

As far as I know, Big5 is not a subset of Big5-HKSCS, see above.

What's worse, there are two very different windows-950 charsets now:
By default, windows-950 is a form of Big5. If you download and install
some package from Microsoft, then windows-950 *changes* to become a
form of Big5-HKSCS. This means that on any one Windows machine you
can't have both converters, and if you use Windows codepage 950, you
don't know what you get.

At any rate, as I said above, I doubt we can distinguish them. We
currently only use the encoding scheme for MBCS charsets, and I think
it's the same for both. Statistics would only distinguish based on the
most frequent characters, and the HKSCS ones are relatively rare.

It's similar to ISO 8859-15 in that Big5-HKSCS was pretty much born
into obsolescence. The repertoire definition served as input for
Unicode character assignments, and Unicode will be used mostly to
support HKSCS.

> I hope that helps.
>
> > Mark Davis wrote:
> > >
> > > It is our intention to
> > > support the most commonly used charsets -- big 5 just slipped through
> > > the cracks. However, it is scheduled for this release.
> > >
> > > The list we have now is the following. If there are others that are
> > > important, please let us know. (Note: these are really families of
> > > encodings, since there is practically no data that distinguishes
> > > between, say, GBK and GB18030).
> > > *Character Set *
> > >
> > > *Languages *
> > > UTF-8
> > > UTF-16BE
> > > UTF-16LE
> > > UTF-32BE
> > > UTF-32LE
> > > Shift_JIS
> > > ISO-2022-JP
> > > ISO-2022-CN
> > > ISO-2022-KR
> > > GB18030
> > > EUC-JP
> > > EUC-KR
> > > ISO-8859-1    Danish, Dutch, English, French, German, Italian,
> Norwegian,
> > > Portuguese, Swedish
> > > ISO-8859-2    Czech, Hungarian, Polish, Romanian
> > > ISO-8859-5    Russian
> > > ISO-8859-6    Arabic
> > > ISO-8859-7    Greek
> > > ISO-8859-8    Hebrew
> > > windows-1251  Russian
> > > windows-1256  Arabic
> > > KOI8-R        Russian
> > > ISO-8859-9    Turkish

--
Opinions expressed here may not reflect my company's positions unless
otherwise noted.

[icu-design] Re: [icu-core] Call for agenda items - icu core meeting 10/19/200510AM PST

Open Source C/C++/Java libraries from Unicode

[icu-design] Re: [icu-core] Call for agenda items - icu core meeting 10/19/200510AM PST