From: Andy H. <an...@jt...> - 2003-03-26 21:34:31
|
The word break rules in ICU 2.x are implementing the Unicode default word boundaries as defined in TR 29. See http://www.unicode.org/reports/tr29/tr29-4d1.html There are a few issues, though 1. The Unicode default rules from the TR aren't sufficient to do a good job with Asian languages. Without spaces to mark off words, some form of a dictionary is really needed to do a good job, but this is beyond the level of support provided. 2. ICU's rules don't yet 100% match the TR29 rules. ICU 2.x is closer than ICU 1.8, and 2.6 should get still closer. 3. TR 29 is also undergoing revision, in part to try to get the best results possible for Asian languages, within the constraints of a simple (not dicitionary based) implementation. The syntax for ICU's break rules, such as word.txt, is described at http://oss.software.ibm.com/icu/userguide/boundaryAnalysis.html near the bottom of the document. The seemingly built-in ranges like :Kana: or :Hira: are built in. Within break rules, any character class is handled by UnicodeSet, which will recognize all Unicode properties and script names. See http://oss.software.ibm.com/icu/userguide/properties.html and http://oss.software.ibm.com/icu/userguide/unicodeSet.html for more information on what can be used as a character class. There's an amazing flexibility in what can be specified. -- Andy Heninger hen...@us... ----- Original Message ----- From: "Stockett, Jeff" <sto...@qu...> To: <icu...@os...> Sent: Tuesday, March 25, 2003 1:44 PM Subject: differences in Japanese word breaking between ICU1.8.1 and ICU2.2/2.4 > Understanding that CJK word breaking in ICU isn't ideal, it use to work > after a fashion in ICU1.8.1. However, it looks like that the newer > versions of ICU word break Japanese text very differently from how > version 1.8.1 did. Upon first inspection, it seems that the new > versions don't allow kanji characters in words like 1.8.1 did. Has > anyone else seen this problem - or know of a fix? > > In looking at the old 1.3.1 Java file versus the new 2.x style word.txt > file, Kanji seems to be missing in the new one - which I think explains > the differences we are seeing in break behavior. Is the syntax of the > word.txt file documented anywhere? We'd experiment with adding Kanji > back in, but don't fully understand where the seemingly built-in ranges > like :Kana:, :Hira: and possibly :Kanji: come from. > > |