From: Markus S. <mar...@jt...> - 2001-10-10 21:01:45
|
Another interesting exchange, for people using Thai word break iteration: (Editing out a couple of confidential or irrelevant pieces.) markus ---------------------------------- Ralf Hauser@IBMDE 10/10/2001 07:50 AM To: Eric Mader/Cupertino/IBM@IBMUS cc: Andy Heninger/Cupertino/IBM@IBMUS, Helena S Chapman/Cupertino/IBM@IBMUS, Markus Scherer/Cupertino/IBM@IBMUS, Monika Matschke/Germany/IBM, Dieter Gruner/Germany/IBM From: Ralf Hauser/Germany/IBM@IBMDE Subject: Re: Problem with ICU BreakIterator Eric, Andy, thanks for your support! I forgot to mention one important thing in the first place: the reason why we use the ICU BreakIterator at all is to be able to support the Thai language. [We have] a (crude) tokenizer which works by recognizing white space characters and taking them as word boundaries. Unfortunately, there is no possibility for us to bypass this tokenizer (please, do not ask why...) so we have to feed the tokenizer with texts interspersed with space characters. In order to be able to process Thai text, the idea was to tokenize the Thai text using the ([dictionary] driven) ICU break iterator and insert spaces whenever a Thai token was identified. These spaces would later then be recognized by the tokenizer and handled properly. This is why I had developed the code that I sent to you with all that space insertion stuff. I have changed the logic of the code according to your suggestion and it works. (I checked with one of our colleagues here, she is a native-speaking Thai) - Great! [...] thanks a lot! Mit freundlichen Gruessen/Kind regards Ralf Hauser ------- IBM Content Management e-mail: rh...@de... Message of today: "There's no time to stop for gas, we're already late" Eric Mader@IBMUS 09.10.2001 23:45 To: Ralf Hauser/Germany/IBM@IBMDE cc: Helena S Chapman/Cupertino/IBM@IBMUS, Markus Scherer/Cupertino/IBM@IBMUS, Andy Heninger/Cupertino/IBM@IBMUS From: Eric Mader/Cupertino/IBM@IBMUS Subject: Re: Problem with ICU BreakIterator Hi Ralf, To amplify what Andy said: the problem with your code is that at the bottom of the loop you say: uOffset1 = pclBreakIterator->next(); This means that you're skipping every other word in the input. What you should say is: uOffset1 = uOffset2; This will make the next word start where the previous word ended, so that you won't skip any words. (Remember that the break locations are actually between characters) Hope this helps, Eric Mader Andy Heninger 10/09/2001 09:16 AM To: Ralf Hauser/Germany/IBM@IBMDE cc: Markus Scherer/Cupertino/IBM, Eric Mader/Cupertino/IBM@IBMUS, Helena S Chapman/Cupertino/IBM@IBMUS From: Andy Heninger/Cupertino/IBM Subject: Re: Problem with ICU BreakIterator Hello Ralf, >As you can see, a lot of valid Thai characters are skipped The problem is that in Thai there are no spaces between the words, so the assumption that every other item identified by the break iterator will be inter-word spaces is incorrect. Because there are no spaces or other separating characters between the words, the Thai break iterator uses a dictionary to locate the boundaries. Even for English (and other European languages), any punctuation will confuse the code as shown, because the word break iterator will identify in two (or more) break positions between each word, one for each punctuation character, and one for the grouped spaces. There's a paper with examples of word breaking at http://www.ibm.com/developerworks/unicode/library/boundaries/boundaries.html It talks more about internals than is necessary for just using the break iterators, but the examples are good. I hope this helps. Best regards, -- Andy Heninger hen...@us... Ralf Hauser@IBMDE 2001-10-08 08:48 To: Markus Scherer/Cupertino/IBM@IBMUS cc: Monika Matschke/Germany/IBM, Thomas Hampp/Germany/IBM, Dieter Gruner/Germany/IBM From: Ralf Hauser/Germany/IBM@IBMDE Subject: Problem with ICU BreakIterator Markus, I have a problem with the ICU break iterator (v 1.8) in the context of the Thai language. I wonder whether you or any of your team could help us with that. i have the following function: size_t itlThaiProcessing(UChar * pw16Target, size_t uTargetLen, const UChar * cpw16Source, size_t uSourceLen) { UErrorCode enRc = U_ZERO_ERROR; Locale clLocale("th", "TH"); BreakIterator * pclBreakIterator; pclBreakIterator = BreakIterator::createWordInstance(clLocale, enRc); UnicodeString clString; UTextOffset uOffset1; size_t uCharsProcessed = 0; /* assign the string to read only memory */ clString.setTo(cpw16Source, uSourceLen); pclBreakIterator->setText(clString); /* process string */ uOffset1 = pclBreakIterator->first(); while(uOffset1 != BreakIterator::DONE) { UTextOffset uOffset2; size_t uLength; /* get the end of this token */ DEBUG << "Boundary at position1: " << uOffset1 << endl; uOffset2 = pclBreakIterator->next(); cerr << "Boundary at position2: " << uOffset2 << endl; assert(uOffset2 != BreakIterator::DONE); assert(uOffset2 > uOffset1); uLength = (size_t) (uOffset2 - uOffset1); DEBUG << "length: " << uLength << endl; /* if there is not enough space in the buffer for this token */ if((uCharsProcessed + uLength) > uTargetLen) { break; /* we must break! */ } memcpy(pw16Target, cpw16Source + uOffset1, uLength * sizeof(UChar)); pw16Target += uLength; uCharsProcessed += uLength; /* if there is not enough space in the buffer for the token separator */ if((uCharsProcessed + 1) > uTargetLen) { break; /* we must break! */ } *pw16Target++ = ITL_UCHAR_SPACE; ++uCharsProcessed; /* process next token */ uOffset1 = pclBreakIterator->next(); } DEBUG << "uCharsProcessed: " << uCharsProcessed << endl; return(uCharsProcessed); } This code iterates through the words of an input buffer and copies each word followed by a SPACE U+0020 to a target buffer, thus removing additional white space characters between words. This function applied to several input buffers (in UTF-16) yields the following output: Input 1 "this is a test" [itl_main2.cpp:121] ICU RC: 0 ***** Thai buffer ************************************************************ 00000000: 74 00 68 00 69 00 73 00 | 20 00 69 00 73 00 20 00 | t.h.i.s. .i.s. . 00000010: 61 00 20 00 74 00 65 00 | 73 00 74 00 __ __ __ __ | a. .t.e.s.t.____ ****************************************************************************** [itl_ta_thai.cpp:78] Target length: 2000 [itl_ta_thai.cpp:79] Source length: 14 [itl_ta_thai.cpp:104] Boundary at position1: 0 [itl_ta_thai.cpp:106] Boundary at position2: 4 [itl_ta_thai.cpp:110] length: 4 [itl_ta_thai.cpp:104] Boundary at position1: 5 [itl_ta_thai.cpp:106] Boundary at position2: 7 [itl_ta_thai.cpp:110] length: 2 [itl_ta_thai.cpp:104] Boundary at position1: 8 [itl_ta_thai.cpp:106] Boundary at position2: 9 [itl_ta_thai.cpp:110] length: 1 [itl_ta_thai.cpp:104] Boundary at position1: 10 [itl_ta_thai.cpp:106] Boundary at position2: 14 [itl_ta_thai.cpp:110] length: 4 [itl_ta_thai.cpp:130] uCharsProcessed: 15 ***** Final Thai buffer ****************************************************** 00000000: 74 00 68 00 69 00 73 00 | 20 00 69 00 73 00 20 00 | t.h.i.s. .i.s. . 00000010: 61 00 20 00 74 00 65 00 | 73 00 74 00 20 00 __ __ | a. .t.e.s.t. .__ ****************************************************************************** Input 2 "this is a test" NOTE THE SPACES [itl_main2.cpp:121] ICU RC: 0 ***** Thai buffer ************************************************************ 00000000: 74 00 68 00 69 00 73 00 | 20 00 20 00 20 00 20 00 | t.h.i.s. . . . . 00000010: 69 00 73 00 20 00 20 00 | 20 00 20 00 20 00 20 00 | i.s. . . . . . . 00000020: 20 00 20 00 61 00 20 00 | 74 00 65 00 73 00 74 00 | . .a. .t.e.s.t. ****************************************************************************** [itl_ta_thai.cpp:78] Target length: 2000 [itl_ta_thai.cpp:79] Source length: 24 [itl_ta_thai.cpp:104] Boundary at position1: 0 [itl_ta_thai.cpp:106] Boundary at position2: 4 [itl_ta_thai.cpp:110] length: 4 [itl_ta_thai.cpp:104] Boundary at position1: 8 [itl_ta_thai.cpp:106] Boundary at position2: 10 [itl_ta_thai.cpp:110] length: 2 [itl_ta_thai.cpp:104] Boundary at position1: 18 [itl_ta_thai.cpp:106] Boundary at position2: 19 [itl_ta_thai.cpp:110] length: 1 [itl_ta_thai.cpp:104] Boundary at position1: 20 [itl_ta_thai.cpp:106] Boundary at position2: 24 [itl_ta_thai.cpp:110] length: 4 [itl_ta_thai.cpp:130] uCharsProcessed: 15 ***** Final Thai buffer ****************************************************** 00000000: 74 00 68 00 69 00 73 00 | 20 00 69 00 73 00 20 00 | t.h.i.s. .i.s. . 00000010: 61 00 20 00 74 00 65 00 | 73 00 74 00 20 00 __ __ | a. .t.e.s.t. .__ ****************************************************************************** The first and the second example yield what i expected, the break iterator identifies the words correctly and allows me to skip any unwanted white spaces between them. However, if applied to an input buffer of the Thai language, I do get unexpected results: Input 3 - A Thai sentence ***** Thai buffer ************************************************************ 00000000: 08 0e 32 0e 01 0e 21 0e | 15 0e 34 0e 04 0e 13 0e | ..2...!...4..... 00000010: 30 0e 23 0e 31 0e 10 0e | 21 0e 19 0e 15 0e 23 0e | 0.#.1...!.....#. 00000020: 35 0e 40 0e 21 0e 37 0e | 48 0e 2d 0e 27 0e 31 0e | 5.@.!.7.H.-.'.1. 00000030: 19 0e 17 0e 35 0e 48 0e | 31 00 37 00 1e 0e 24 0e | ....5.H.1.7...$. 00000040: 29 0e 20 0e 32 0e 04 0e | 21 0e 32 00 35 00 33 00 | ). .2...!.2.5.3. 00000050: 30 00 17 0e 35 0e 48 0e | 43 0e 2b 0e 49 0e 1a 0e | 0...5.H.C.+.I... 00000060: 23 0e 34 0e 29 0e 31 0e | 17 0e 40 0e 14 0e 34 0e | #.4.).1...@...4. 00000070: 19 0e 2d 0e 32 0e 01 0e | 32 0e 28 0e 44 0e 17 0e | ..-.2...2.(.D... 00000080: 22 0e 23 0e 27 0e 21 0e | 01 0e 34 0e 08 0e 01 0e | ".#.'.!...4..... 00000090: 32 0e 23 0e 40 0e 1b 0e | 47 0e 19 0e 2d 0e 31 0e | 2.#.@...G...-.1. ****************************************************************************** [itl_ta_thai.cpp:78] Target length: 2000 [itl_ta_thai.cpp:79] Source length: 80 [itl_ta_thai.cpp:104] Boundary at position1: 0 [itl_ta_thai.cpp:106] Boundary at position2: 3 [itl_ta_thai.cpp:110] length: 3 [itl_ta_thai.cpp:104] Boundary at position1: 6 [itl_ta_thai.cpp:106] Boundary at position2: 9 [itl_ta_thai.cpp:110] length: 3 [itl_ta_thai.cpp:104] Boundary at position1: 17 [itl_ta_thai.cpp:106] Boundary at position2: 22 [itl_ta_thai.cpp:110] length: 5 [itl_ta_thai.cpp:104] Boundary at position1: 25 [itl_ta_thai.cpp:106] Boundary at position2: 28 [itl_ta_thai.cpp:110] length: 3 [itl_ta_thai.cpp:104] Boundary at position1: 30 [itl_ta_thai.cpp:106] Boundary at position2: 37 [itl_ta_thai.cpp:110] length: 7 [itl_ta_thai.cpp:104] Boundary at position1: 41 [itl_ta_thai.cpp:106] Boundary at position2: 44 [itl_ta_thai.cpp:110] length: 3 [itl_ta_thai.cpp:104] Boundary at position1: 47 [itl_ta_thai.cpp:106] Boundary at position2: 53 [itl_ta_thai.cpp:110] length: 6 [itl_ta_thai.cpp:104] Boundary at position1: 57 [itl_ta_thai.cpp:106] Boundary at position2: 62 [itl_ta_thai.cpp:110] length: 5 [itl_ta_thai.cpp:104] Boundary at position1: 65 [itl_ta_thai.cpp:106] Boundary at position2: 68 [itl_ta_thai.cpp:110] length: 3 [itl_ta_thai.cpp:104] Boundary at position1: 74 [itl_ta_thai.cpp:106] Boundary at position2: 80 [itl_ta_thai.cpp:110] length: 6 [itl_ta_thai.cpp:130] uCharsProcessed: 54 ***** Final Thai buffer ****************************************************** 00000000: 08 0e 32 0e 01 0e 20 00 | 04 0e 13 0e 30 0e 20 00 | ..2... .....0. . 00000010: 40 0e 21 0e 37 0e 48 0e | 2d 0e 20 00 17 0e 35 0e | @.!.7.H.-. ...5. 00000020: 48 0e 20 00 1e 0e 24 0e | 29 0e 20 0e 32 0e 04 0e | H. ...$.). .2... 00000030: 21 0e 20 00 17 0e 35 0e | 48 0e 20 00 1a 0e 23 0e | !. ...5.H. ...#. 00000040: 34 0e 29 0e 31 0e 17 0e | 20 00 2d 0e 32 0e 01 0e | 4.).1... .-.2... 00000050: 32 0e 28 0e 20 00 23 0e | 27 0e 21 0e 20 00 40 0e | 2.(. .#.'.!. .@. 00000060: 1b 0e 47 0e 19 0e 2d 0e | 31 0e 20 00 __ __ __ __ | ..G...-.1. .____ ****************************************************************************** As you can see, a lot of valid Thai characters are skipped (even the embedded arabian numbers are not identified correctly) Do we use the break iterator in an invalid way? what is wrong with that? I have provided the Thai fragment here, it is encoded in UTF-16. any help is greatly appreciated! Mit freundlichen Gruessen/Kind regards Ralf Hauser |