|
From: Markus S. <mar...@gm...> - 2005-10-21 16:26:12
|
This discussion veered off the icu-core list, but I think it's really more appropriate on the icu-design list anyway. Comments below. markus > Tex Texin <te...@i1...> > 2005-10-21 04:57 > > To: Eric Mader <em...@ic...> > cc: Mark Davis/Cupertino/IBM@IBMUS, Badi Kumar > <ba...@ya...>, ICU Core <icu...@li...>, [...] > Subject: Re: [icu-core] Call for agenda items - icu core > meeting 10/19/200510AM PST > [...] > =3D=3D=3D > I would add: > > ISO 8859-15, big-5, big5-hkscs. First, for SBCS charsets, we will need not just the charset to be detected but the charset+language combination - because we use language-specific statistics. So for ISO 8859-15, we might have one detector for Finnish, and maybe(!) one for French. The general problem is that the statistics will have a hard time distinguishing charsets(+languages) that are very, very similar. Since ISO 8859-1 was used for French and Finnish for a long time, I assume that the missing characters are rarely used and won't show up in the statistics. Therefore, I doubt that we will be able to distinguish fr/ISO-8859-1 and fi/ISO-8859-1 from fr/...-15 and fi/...-15. There is of course no difference whatsoever between de/...-1 and de/...-15 etc. Luckily, ISO 8859 part 15 was published at a time when it wasn't necessary any more because UTF-8 was already widely supported and HTML provided character entities as workarounds. While part 1 remains important because it's the default charset in many 80s/90s standards (like HTML and Java) and is a perfect, algorithmic subset of Unicode, part 15 was born into obsolescence. Similarly, while we will certainly add Big5, I doubt that we will be able to distinguish it from Big5-HKSCS. In this case, we can also not treat one as a superset of the other because, to my memory, the HKSCS version _replaces_ a lot of mappings (instead of just adding a bunch). > I would also add the baltic ISO page iso 8859-4 For which languages? Is it commonly used for all three? (Estonian, Latvian, Lithuanian) > and Thai 620-2533. > > I would think there is a lot of IBM 850 running around. Although DOS is > hopefully defunct, there were a lot of unix systems running it. But I > can't > say I have run across is recently. Many European progress users were usin= g > it. I am surprised to hear that it was used much at all on Unixes. Even Microsoft relegated it to the DOS prompt 10 years ago and replaced it with 1252 for Windows programs. (And IBM replaced it with 858 for Euro support.) I certainly don't expect a lot of emails and web pages to use 850. Again, we would need to make a list of languages. > Where we really mean a family, we should return the name of the largest > superset. We do, at least sometimes. All of gb2312/gbk/gb18030 result in gb18030. > I run into a lot of code that sees iso-8859-1 and then converts (with > iconv > or other tools) from iso-8859-1 to unicode and then loses euro support et= c > for the added chars in windows-1252. > > I would rather see windows-1252 returned so the code does the right thing > and also uses the most appropriate label. Possible - this would be good to discuss... I expect a lot of opinions on this one... Maybe we can be smart about ISO charsets vs. vendor charsets: If there are any bytes 0x80..0x9F, it is likely a vendor extension (most likely Windows), otherwise it may be ISO. Note that the ISO 8859-1 converter is much faster than any other. > In general this means returning the windows encoding instead of the ISO. > > This is also important for the double byte encodings since the legal > ranges > for bytes are different. > (cp932, cp932, cp949, cp950. Although I am not sure if cp950 and > big5-hkscs > have a relationship that is either superset/subset or vice versa. I > thought > that some of the cp950 chars conflicted with hkscs.) As far as I know, Big5 is not a subset of Big5-HKSCS, see above. What's worse, there are two very different windows-950 charsets now: By default, windows-950 is a form of Big5. If you download and install some package from Microsoft, then windows-950 *changes* to become a form of Big5-HKSCS. This means that on any one Windows machine you can't have both converters, and if you use Windows codepage 950, you don't know what you get. At any rate, as I said above, I doubt we can distinguish them. We currently only use the encoding scheme for MBCS charsets, and I think it's the same for both. Statistics would only distinguish based on the most frequent characters, and the HKSCS ones are relatively rare. It's similar to ISO 8859-15 in that Big5-HKSCS was pretty much born into obsolescence. The repertoire definition served as input for Unicode character assignments, and Unicode will be used mostly to support HKSCS. > I hope that helps. > > > Mark Davis wrote: > > > > > > It is our intention to > > > support the most commonly used charsets -- big 5 just slipped through > > > the cracks. However, it is scheduled for this release. > > > > > > The list we have now is the following. If there are others that are > > > important, please let us know. (Note: these are really families of > > > encodings, since there is practically no data that distinguishes > > > between, say, GBK and GB18030). > > > *Character Set * > > > > > > *Languages * > > > UTF-8 > > > UTF-16BE > > > UTF-16LE > > > UTF-32BE > > > UTF-32LE > > > Shift_JIS > > > ISO-2022-JP > > > ISO-2022-CN > > > ISO-2022-KR > > > GB18030 > > > EUC-JP > > > EUC-KR > > > ISO-8859-1 Danish, Dutch, English, French, German, Italian, > Norwegian, > > > Portuguese, Swedish > > > ISO-8859-2 Czech, Hungarian, Polish, Romanian > > > ISO-8859-5 Russian > > > ISO-8859-6 Arabic > > > ISO-8859-7 Greek > > > ISO-8859-8 Hebrew > > > windows-1251 Russian > > > windows-1256 Arabic > > > KOI8-R Russian > > > ISO-8859-9 Turkish -- Opinions expressed here may not reflect my company's positions unless otherwise noted. |