[icu-support] Chinese stroke count sorting and AlphabeticIndex class

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

when testing the AlphabeticIndex API

http://icu-project.org/apiref/icu4c/classAlphabeticIndex.html

with Chinese stroke count sorting, I stumbled into some problems.
When sorting the following characters

    1 a b c y z 一 乙 丁 二 卜 万 木 卡 白 石 不 井

in zh_TW locale into index buckets, I get the following result
(I did set the underflow and overflow label to something funny just for
testing):

    ----------- ☺UnderflowLabel☺ bucketIndex=0

    1
    a
    b
    c
    y
    z
    ----------- 一 bucketIndex=1
    一
    乙
    ----------- 丁 bucketIndex=2
    丁
    二
    卜
    万  <------------------ problem here! 3 strokes but in the 2 stroke bucket!
    ----------- 丈 bucketIndex=3
    ----------- 不 bucketIndex=4
    不
    井
    木
    ----------- 且 bucketIndex=5
    卡
    白
    石
    ----------- 丞 bucketIndex=6
    ----------- 串 bucketIndex=7
    ----------- 並 bucketIndex=8
    ----------- 亭 bucketIndex=9
    ----------- 乘 bucketIndex=10
    ----------- 乾 bucketIndex=11
    ----------- 傀 bucketIndex=12
    ----------- 亂 bucketIndex=13
    ----------- 僎 bucketIndex=14
    ----------- 僵 bucketIndex=15
    ----------- 儐 bucketIndex=16
    ----------- 償 bucketIndex=17
    ----------- 叢 bucketIndex=18
    ----------- 儳 bucketIndex=19
    ----------- 嚴 bucketIndex=20
    ----------- 儷 bucketIndex=21
    ----------- 儻 bucketIndex=22
    ----------- 囌 bucketIndex=23
    ----------- 囑 bucketIndex=24
    ----------- 廳 bucketIndex=25
    ----------- ☺OverflowLabel☺ bucketIndex=26

So some characters end up in the wrong  stroke count bucket.

The reason is this:

For zh_Hant, ICU has the follwing index characters:

ExemplarCharactersIndex { "[一 丁 丈 不 且 丞 並 串 乘 乾 亂 亭 傀 僎 僵 儐 償 儳 儷 儻 叢 嚴 囌 囑 廳]" }

Which comes from

http://unicode.org/cldr/trac/browser/trunk/common/main/zh_Hant.xml

   1052                    <exemplarCharacters type="index">[一 丁 丈 不 且 丞 並 串 乘 乾 亂 亭 傀 僎 僵 儐 償 儳 儷 儻 叢 嚴 囌 囑 廳]</exemplarCharacters>

And the collation table from

http://unicode.org/cldr/trac/browser/trunk/common/collation/zh.xml

(where ICU gets the collation data from) contains:

               <p>﷐⠃</p><!-- INDEX 3 -->
               <pc>万丈三上下 ... </pc><!-- 3 -->

i.e. 万 is correctly in the list of characters with 3 strokes,
*but* before the 3 stroke character 丈 which is used as the
index character.

With the way the AlphabeticIndex class in ICU is implemented, the
characters go into the correct index buckets for stroke count sorting
only if the index characters used are the very first characters of the
characters with the same stroke count in the collation table.

万 sorts before the index character 丈 and thus sorts wrongly into the
previous bucket, i.e. the 丁 bucket.

There are more such problems  in the above list of index characters:

    1 一 OK
    2 丁 OK
    3 丈 has 3 strokes but is not the first char with 3 strokes
    4 不 OK
    5 且 has 5 strokes but is not the first char with 5 strokes
    6 丞 has 6 strokes but is not the first char with 6 strokes
    7 並 <- has 8 strokes! Is the first char with 8 strokes!
    8 串 <- has 7 strokes! But is not the first char with 7 strokes.
    9 乘 <- has 10 strokes!
    10 乾 <- has 11 strokes!
    ... and more such problems ...

The stroke counts in the above list to illustrate the problem are from
kTotalStrokes from the Unihan database, they also agree with the stroke
counts listed on www.zdic.net and with my personal "feeling" of the
right stroke coun (i.e. these are not characters where people might
disagree about the correct stroke count depending on whether a more
traditional way of writing is used or not).

So the list of index characters seems to have obvious errors, especially
obvious errors are 並 as the index character for 7 strokes (although it
has 8 strokes) and 串 as the index character for 8 strokes (although it
has 7 strokes).

One way to fix this is to fix the list of index characters to contain
the characters which are the very first characters of a certain stroke
count.

But there are two stroke count collation tables, a short and a long
variant. And these may have different "first characters of a certain
stroke count". For example for 9 strokes:

        <collation type='stroke' alt='short'>
         ...
               <p>﷐⠉</p><!-- INDEX 9 -->
               <pc>临举乗 ...

and the "long" table:

        <collation type='stroke'>
        ...
               <p>﷐⠉</p><!-- INDEX 9 -->
               <pc>𠀵𠀶𠀸𠀺𠀻𪜄临

I.e. the long table has some rare characters above the BMP before 临.
Therefore, 临 could serve as the 9 stroke index character if the short
table is used but not if the long table is used.

So one could fix the list of index characters either for the short table
or for the long table but not for both.

And, by the way, I guess neither 临 nor U+20035 (first in the long
table) are "nice" index characters to show in a UI to the user.

For the Nokia N9, I used the following list of index characters for
stroke count sorting:

https://meego.gitorious.org/meegotouch/libmlocale/blobs/master/src/icu-extradata/data/zh_Hant.txt

which was tweaked to work correctly together with the long stroke count
collation table (which might of course have changed in the meantime, but
at least it did fit to the "long" stroke count collation table from
http://unicode.org/cldr/trac/browser/trunk/common/collation/zh.xml at
that time).

Probably in a UI one would show the stroke count numbers instead of
weird, unusual index characters.  If the sorting into the buckets works
correctly, one can calculate the numbers from the bucket indices and
show these. But first of all, the sorting into the buckets has to work
correctly.

What is "the right way™" to fix the bucket sorting for stroke count?  Is
there any better way than tweaking the list of index characters to fit
to the collation table used (and then translate the index characters
found to numbers to display on the UI)??

-- 
Mike FABIAN   <mik...@gm...>
睡眠不足はいい仕事の敵だ。

[icu-support] Chinese stroke count sorting and AlphabeticIndex class

Open Source C/C++/Java libraries from Unicode

[icu-support] Chinese stroke count sorting and AlphabeticIndex class