|
From: Mike F. <mai...@gm...> - 2012-06-12 19:53:10
|
Hi, when testing the AlphabeticIndex API http://icu-project.org/apiref/icu4c/classAlphabeticIndex.html with Chinese stroke count sorting, I stumbled into some problems. When sorting the following characters 1 a b c y z 一 乙 丁 二 卜 万 木 卡 白 石 不 井 in zh_TW locale into index buckets, I get the following result (I did set the underflow and overflow label to something funny just for testing): ----------- ☺UnderflowLabel☺ bucketIndex=0 1 a b c y z ----------- 一 bucketIndex=1 一 乙 ----------- 丁 bucketIndex=2 丁 二 卜 万 <------------------ problem here! 3 strokes but in the 2 stroke bucket! ----------- 丈 bucketIndex=3 ----------- 不 bucketIndex=4 不 井 木 ----------- 且 bucketIndex=5 卡 白 石 ----------- 丞 bucketIndex=6 ----------- 串 bucketIndex=7 ----------- 並 bucketIndex=8 ----------- 亭 bucketIndex=9 ----------- 乘 bucketIndex=10 ----------- 乾 bucketIndex=11 ----------- 傀 bucketIndex=12 ----------- 亂 bucketIndex=13 ----------- 僎 bucketIndex=14 ----------- 僵 bucketIndex=15 ----------- 儐 bucketIndex=16 ----------- 償 bucketIndex=17 ----------- 叢 bucketIndex=18 ----------- 儳 bucketIndex=19 ----------- 嚴 bucketIndex=20 ----------- 儷 bucketIndex=21 ----------- 儻 bucketIndex=22 ----------- 囌 bucketIndex=23 ----------- 囑 bucketIndex=24 ----------- 廳 bucketIndex=25 ----------- ☺OverflowLabel☺ bucketIndex=26 So some characters end up in the wrong stroke count bucket. The reason is this: For zh_Hant, ICU has the follwing index characters: ExemplarCharactersIndex { "[一 丁 丈 不 且 丞 並 串 乘 乾 亂 亭 傀 僎 僵 儐 償 儳 儷 儻 叢 嚴 囌 囑 廳]" } Which comes from http://unicode.org/cldr/trac/browser/trunk/common/main/zh_Hant.xml 1052 <exemplarCharacters type="index">[一 丁 丈 不 且 丞 並 串 乘 乾 亂 亭 傀 僎 僵 儐 償 儳 儷 儻 叢 嚴 囌 囑 廳]</exemplarCharacters> And the collation table from http://unicode.org/cldr/trac/browser/trunk/common/collation/zh.xml (where ICU gets the collation data from) contains: <p>⠃</p><!-- INDEX 3 --> <pc>万丈三上下 ... </pc><!-- 3 --> i.e. 万 is correctly in the list of characters with 3 strokes, *but* before the 3 stroke character 丈 which is used as the index character. With the way the AlphabeticIndex class in ICU is implemented, the characters go into the correct index buckets for stroke count sorting only if the index characters used are the very first characters of the characters with the same stroke count in the collation table. 万 sorts before the index character 丈 and thus sorts wrongly into the previous bucket, i.e. the 丁 bucket. There are more such problems in the above list of index characters: 1 一 OK 2 丁 OK 3 丈 has 3 strokes but is not the first char with 3 strokes 4 不 OK 5 且 has 5 strokes but is not the first char with 5 strokes 6 丞 has 6 strokes but is not the first char with 6 strokes 7 並 <- has 8 strokes! Is the first char with 8 strokes! 8 串 <- has 7 strokes! But is not the first char with 7 strokes. 9 乘 <- has 10 strokes! 10 乾 <- has 11 strokes! ... and more such problems ... The stroke counts in the above list to illustrate the problem are from kTotalStrokes from the Unihan database, they also agree with the stroke counts listed on www.zdic.net and with my personal "feeling" of the right stroke coun (i.e. these are not characters where people might disagree about the correct stroke count depending on whether a more traditional way of writing is used or not). So the list of index characters seems to have obvious errors, especially obvious errors are 並 as the index character for 7 strokes (although it has 8 strokes) and 串 as the index character for 8 strokes (although it has 7 strokes). One way to fix this is to fix the list of index characters to contain the characters which are the very first characters of a certain stroke count. But there are two stroke count collation tables, a short and a long variant. And these may have different "first characters of a certain stroke count". For example for 9 strokes: <collation type='stroke' alt='short'> ... <p>⠉</p><!-- INDEX 9 --> <pc>临举乗 ... and the "long" table: <collation type='stroke'> ... <p>⠉</p><!-- INDEX 9 --> <pc>𠀵𠀶𠀸𠀺𠀻𪜄临 I.e. the long table has some rare characters above the BMP before 临. Therefore, 临 could serve as the 9 stroke index character if the short table is used but not if the long table is used. So one could fix the list of index characters either for the short table or for the long table but not for both. And, by the way, I guess neither 临 nor U+20035 (first in the long table) are "nice" index characters to show in a UI to the user. For the Nokia N9, I used the following list of index characters for stroke count sorting: https://meego.gitorious.org/meegotouch/libmlocale/blobs/master/src/icu-extradata/data/zh_Hant.txt which was tweaked to work correctly together with the long stroke count collation table (which might of course have changed in the meantime, but at least it did fit to the "long" stroke count collation table from http://unicode.org/cldr/trac/browser/trunk/common/collation/zh.xml at that time). Probably in a UI one would show the stroke count numbers instead of weird, unusual index characters. If the sorting into the buckets works correctly, one can calculate the numbers from the bucket indices and show these. But first of all, the sorting into the buckets has to work correctly. What is "the right way™" to fix the bucket sorting for stroke count? Is there any better way than tweaking the list of index characters to fit to the collation table used (and then translate the index characters found to numbers to display on the UI)?? -- Mike FABIAN <mik...@gm...> 睡眠不足はいい仕事の敵だ。 |