Re: differences in Japanese word breaking between ICU1.8.1 and ICU2.2/2.4

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

The word break rules in ICU 2.x are implementing the
Unicode default word boundaries as defined in TR 29.  See
http://www.unicode.org/reports/tr29/tr29-4d1.html

There are a few issues, though
1.  The Unicode default rules from the TR aren't sufficient to do
    a good job with Asian languages.  Without spaces to mark off
    words, some form of a dictionary is really needed to do a good
    job, but this is beyond the level of support provided.

2.  ICU's rules don't yet 100% match the TR29 rules.  ICU 2.x is
    closer than ICU 1.8, and 2.6 should get still closer.

3.  TR 29 is also undergoing revision, in part to try to get
    the best results possible for Asian languages, within the
    constraints of a simple (not dicitionary based) implementation.

The syntax for ICU's break rules, such as word.txt, is described
at http://oss.software.ibm.com/icu/userguide/boundaryAnalysis.html
near the bottom of the document.

The seemingly built-in ranges like :Kana: or :Hira: are built in.
Within break rules, any character class is handled by UnicodeSet,
which will recognize all Unicode properties and script names.  See
http://oss.software.ibm.com/icu/userguide/properties.html and
http://oss.software.ibm.com/icu/userguide/unicodeSet.html
for more information on what can be used as a character class.
There's an amazing flexibility in what can be specified.

   -- Andy Heninger
      hen...@us...

----- Original Message -----
From: "Stockett, Jeff" <sto...@qu...>
To: <icu...@os...>
Sent: Tuesday, March 25, 2003 1:44 PM
Subject: differences in Japanese word breaking between ICU1.8.1 and
ICU2.2/2.4

> Understanding that CJK word breaking in ICU isn't ideal, it use to work
> after a fashion in ICU1.8.1.  However, it looks like that the newer
> versions of ICU word break Japanese text very differently from how
> version 1.8.1 did.  Upon first inspection, it seems that the new
> versions don't allow kanji characters in words like 1.8.1 did.  Has
> anyone else seen this problem - or know of a fix?
>
> In looking at the old 1.3.1 Java file versus the new 2.x style word.txt
> file, Kanji seems to be missing in the new one - which I think explains
> the differences we are seeing in break behavior.  Is the syntax of the
> word.txt file documented anywhere?  We'd experiment with adding Kanji
> back in, but don't fully understand where the seemingly built-in ranges
> like :Kana:, :Hira: and possibly :Kanji: come from.
>
>

Re: differences in Japanese word breaking between ICU1.8.1 and ICU2.2/2.4

Open Source C/C++/Java libraries from Unicode

Re: differences in Japanese word breaking between ICU1.8.1 and ICU2.2/2.4