Re: Problem with ICU BreakIterator

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Another interesting exchange, for people using Thai word break iteration:
(Editing out a couple of confidential or irrelevant pieces.)
markus
----------------------------------
	Ralf Hauser@IBMDE
	10/10/2001 07:50 AM

		 To: Eric Mader/Cupertino/IBM@IBMUS
		 cc: Andy Heninger/Cupertino/IBM@IBMUS, Helena S Chapman/Cupertino/IBM@IBMUS, Markus Scherer/Cupertino/IBM@IBMUS, Monika Matschke/Germany/IBM, Dieter Gruner/Germany/IBM
		 From: Ralf Hauser/Germany/IBM@IBMDE
		 Subject: Re: Problem with ICU BreakIterator

Eric, 
Andy,

thanks for your support!
I forgot to mention one important thing in the first place: the reason why we use the ICU BreakIterator at all is to be able to support the Thai language. [We have] a (crude) tokenizer which works by recognizing white space characters and taking them as word boundaries. 

Unfortunately, there is no possibility for us to bypass this tokenizer (please, do not ask why...) so we have to feed the tokenizer with texts interspersed with space characters. In order to be able to process Thai text, the idea was to tokenize the Thai text using the ([dictionary] driven) ICU break iterator and insert spaces whenever a Thai token was identified. These spaces would later then be recognized by the tokenizer and handled properly.

This is why I had developed the code that I sent to you with all that space insertion stuff.

I have changed the logic of the code according to your suggestion and it works. (I checked with one of our colleagues here, she is a native-speaking Thai) - Great!

[...]

thanks a lot!

Mit freundlichen Gruessen/Kind regards

          Ralf Hauser

-------
IBM Content Management
e-mail: rh...@de...
Message of today: "There's no time to stop for gas, we're already late"

	Eric Mader@IBMUS
	09.10.2001 23:45

		 To: Ralf Hauser/Germany/IBM@IBMDE
		 cc: Helena S Chapman/Cupertino/IBM@IBMUS, Markus Scherer/Cupertino/IBM@IBMUS, Andy Heninger/Cupertino/IBM@IBMUS
		 From: Eric Mader/Cupertino/IBM@IBMUS
		 Subject: Re: Problem with ICU BreakIterator

Hi Ralf,

To amplify what Andy said: the problem with your code is that at the bottom of the loop you say:

    uOffset1 = pclBreakIterator->next();

This means that you're skipping every other word in the input. What you should say is:

    uOffset1 = uOffset2;

This will make the next word start where the previous word ended, so that you won't skip any words. (Remember that the break locations are actually between characters)

Hope this helps,
Eric Mader

	Andy Heninger
	10/09/2001 09:16 AM

		 To: Ralf Hauser/Germany/IBM@IBMDE
		 cc: Markus Scherer/Cupertino/IBM, Eric Mader/Cupertino/IBM@IBMUS, Helena S Chapman/Cupertino/IBM@IBMUS
		 From: Andy Heninger/Cupertino/IBM
		 Subject: Re: Problem with ICU BreakIterator

Hello Ralf,

>As you can see, a lot of valid Thai characters are skipped

The problem is that in Thai there are no  spaces between the words, so the assumption that every other item identified by the break iterator will be inter-word spaces is incorrect.  Because there are no spaces or other separating characters between the words, the Thai break iterator uses a dictionary to locate the boundaries.

Even for English (and other European languages),  any punctuation will confuse the code as shown, because the word break iterator will identify in two (or more) break positions between each word, one for each punctuation character, and one for the grouped spaces.

There's a paper with examples of word breaking at
http://www.ibm.com/developerworks/unicode/library/boundaries/boundaries.html
It talks more about internals than is necessary for just using the break iterators, but the examples are good.

I hope this helps.

   Best regards,

  -- Andy Heninger
      hen...@us...

	Ralf Hauser@IBMDE
	2001-10-08 08:48

		 To: Markus Scherer/Cupertino/IBM@IBMUS
		 cc: Monika Matschke/Germany/IBM, Thomas Hampp/Germany/IBM, Dieter Gruner/Germany/IBM
		 From: Ralf Hauser/Germany/IBM@IBMDE
		 Subject: Problem with ICU BreakIterator

Markus,

I have a problem with the ICU break iterator (v 1.8) in the context of the Thai language.
I wonder whether you or any of your team could help us with that.

i have the following function:

size_t itlThaiProcessing(UChar * pw16Target,
                         size_t uTargetLen,
                         const UChar * cpw16Source,
                         size_t uSourceLen)
{
   UErrorCode              enRc = U_ZERO_ERROR;
   Locale                  clLocale("th", "TH");
   BreakIterator *         pclBreakIterator;

   pclBreakIterator = BreakIterator::createWordInstance(clLocale, enRc);

   UnicodeString           clString;
   UTextOffset             uOffset1;
   size_t                  uCharsProcessed = 0;

   /* assign the string to read only memory */
   clString.setTo(cpw16Source, uSourceLen);
   pclBreakIterator->setText(clString);

   /* process string */
   uOffset1 = pclBreakIterator->first();
   while(uOffset1 != BreakIterator::DONE)
   {
      UTextOffset          uOffset2;
      size_t               uLength;

      /* get the end of this token */
      DEBUG << "Boundary at position1: " << uOffset1 << endl;
      uOffset2 = pclBreakIterator->next();
      cerr << "Boundary at position2: " << uOffset2 << endl;
      assert(uOffset2 != BreakIterator::DONE);
      assert(uOffset2 > uOffset1);
      uLength = (size_t) (uOffset2 - uOffset1);
      DEBUG << "length: " << uLength << endl;

      /* if there is not enough space in the buffer for this token */
      if((uCharsProcessed + uLength) > uTargetLen)
      {
         break;                                       /* we must break! */
      }
      memcpy(pw16Target, cpw16Source + uOffset1, uLength * sizeof(UChar));
      pw16Target += uLength;
      uCharsProcessed += uLength;
      /* if there is not enough space in the buffer for the token separator */
      if((uCharsProcessed + 1) > uTargetLen)
      {
         break;                                       /* we must break! */
      }
      *pw16Target++ = ITL_UCHAR_SPACE;
      ++uCharsProcessed;
      /* process next token */
      uOffset1 = pclBreakIterator->next();
   }
   DEBUG << "uCharsProcessed: " << uCharsProcessed << endl;
   return(uCharsProcessed);
}

This code iterates through the words of an input buffer and copies each word followed by a SPACE U+0020 to a target buffer, thus removing additional white space characters between words.

This function applied to several input buffers (in UTF-16) yields the following output:

Input 1 "this is a test"

[itl_main2.cpp:121] ICU RC: 0
***** Thai buffer ************************************************************
00000000: 74 00 68 00 69 00 73 00 | 20 00 69 00 73 00 20 00 | t.h.i.s. .i.s. .
00000010: 61 00 20 00 74 00 65 00 | 73 00 74 00 __ __ __ __ | a. .t.e.s.t.____
******************************************************************************
[itl_ta_thai.cpp:78] Target length: 2000
[itl_ta_thai.cpp:79] Source length: 14
[itl_ta_thai.cpp:104] Boundary at position1: 0
[itl_ta_thai.cpp:106] Boundary at position2: 4
[itl_ta_thai.cpp:110] length: 4
[itl_ta_thai.cpp:104] Boundary at position1: 5
[itl_ta_thai.cpp:106] Boundary at position2: 7
[itl_ta_thai.cpp:110] length: 2
[itl_ta_thai.cpp:104] Boundary at position1: 8
[itl_ta_thai.cpp:106] Boundary at position2: 9
[itl_ta_thai.cpp:110] length: 1
[itl_ta_thai.cpp:104] Boundary at position1: 10
[itl_ta_thai.cpp:106] Boundary at position2: 14
[itl_ta_thai.cpp:110] length: 4
[itl_ta_thai.cpp:130] uCharsProcessed: 15
***** Final Thai buffer ******************************************************
00000000: 74 00 68 00 69 00 73 00 | 20 00 69 00 73 00 20 00 | t.h.i.s. .i.s. .
00000010: 61 00 20 00 74 00 65 00 | 73 00 74 00 20 00 __ __ | a. .t.e.s.t. .__
******************************************************************************

Input 2 "this    is        a test"
           NOTE THE SPACES
[itl_main2.cpp:121] ICU RC: 0
***** Thai buffer ************************************************************
00000000: 74 00 68 00 69 00 73 00 | 20 00 20 00 20 00 20 00 | t.h.i.s. . . . .
00000010: 69 00 73 00 20 00 20 00 | 20 00 20 00 20 00 20 00 | i.s. . . . . . .
00000020: 20 00 20 00 61 00 20 00 | 74 00 65 00 73 00 74 00 |  . .a. .t.e.s.t.
******************************************************************************
[itl_ta_thai.cpp:78] Target length: 2000
[itl_ta_thai.cpp:79] Source length: 24
[itl_ta_thai.cpp:104] Boundary at position1: 0
[itl_ta_thai.cpp:106] Boundary at position2: 4
[itl_ta_thai.cpp:110] length: 4
[itl_ta_thai.cpp:104] Boundary at position1: 8
[itl_ta_thai.cpp:106] Boundary at position2: 10
[itl_ta_thai.cpp:110] length: 2
[itl_ta_thai.cpp:104] Boundary at position1: 18
[itl_ta_thai.cpp:106] Boundary at position2: 19
[itl_ta_thai.cpp:110] length: 1
[itl_ta_thai.cpp:104] Boundary at position1: 20
[itl_ta_thai.cpp:106] Boundary at position2: 24
[itl_ta_thai.cpp:110] length: 4
[itl_ta_thai.cpp:130] uCharsProcessed: 15
***** Final Thai buffer ******************************************************
00000000: 74 00 68 00 69 00 73 00 | 20 00 69 00 73 00 20 00 | t.h.i.s. .i.s. .
00000010: 61 00 20 00 74 00 65 00 | 73 00 74 00 20 00 __ __ | a. .t.e.s.t. .__
******************************************************************************

The first and the second example yield what i expected, the break iterator identifies the words correctly and allows me to skip any unwanted white spaces between them.

However, if applied to an input buffer of the Thai language, I do get unexpected results:

Input 3 - A Thai sentence

***** Thai buffer ************************************************************
00000000: 08 0e 32 0e 01 0e 21 0e | 15 0e 34 0e 04 0e 13 0e | ..2...!...4.....
00000010: 30 0e 23 0e 31 0e 10 0e | 21 0e 19 0e 15 0e 23 0e | 0.#.1...!.....#.
00000020: 35 0e 40 0e 21 0e 37 0e | 48 0e 2d 0e 27 0e 31 0e | 5.@.!.7.H.-.'.1.
00000030: 19 0e 17 0e 35 0e 48 0e | 31 00 37 00 1e 0e 24 0e | ....5.H.1.7...$.
00000040: 29 0e 20 0e 32 0e 04 0e | 21 0e 32 00 35 00 33 00 | ). .2...!.2.5.3.
00000050: 30 00 17 0e 35 0e 48 0e | 43 0e 2b 0e 49 0e 1a 0e | 0...5.H.C.+.I...
00000060: 23 0e 34 0e 29 0e 31 0e | 17 0e 40 0e 14 0e 34 0e | #.4.).1...@...4.
00000070: 19 0e 2d 0e 32 0e 01 0e | 32 0e 28 0e 44 0e 17 0e | ..-.2...2.(.D...
00000080: 22 0e 23 0e 27 0e 21 0e | 01 0e 34 0e 08 0e 01 0e | ".#.'.!...4.....
00000090: 32 0e 23 0e 40 0e 1b 0e | 47 0e 19 0e 2d 0e 31 0e | 2.#.@...G...-.1.
******************************************************************************
[itl_ta_thai.cpp:78] Target length: 2000
[itl_ta_thai.cpp:79] Source length: 80
[itl_ta_thai.cpp:104] Boundary at position1: 0
[itl_ta_thai.cpp:106] Boundary at position2: 3
[itl_ta_thai.cpp:110] length: 3
[itl_ta_thai.cpp:104] Boundary at position1: 6
[itl_ta_thai.cpp:106] Boundary at position2: 9
[itl_ta_thai.cpp:110] length: 3
[itl_ta_thai.cpp:104] Boundary at position1: 17
[itl_ta_thai.cpp:106] Boundary at position2: 22
[itl_ta_thai.cpp:110] length: 5
[itl_ta_thai.cpp:104] Boundary at position1: 25
[itl_ta_thai.cpp:106] Boundary at position2: 28
[itl_ta_thai.cpp:110] length: 3
[itl_ta_thai.cpp:104] Boundary at position1: 30
[itl_ta_thai.cpp:106] Boundary at position2: 37
[itl_ta_thai.cpp:110] length: 7
[itl_ta_thai.cpp:104] Boundary at position1: 41
[itl_ta_thai.cpp:106] Boundary at position2: 44
[itl_ta_thai.cpp:110] length: 3
[itl_ta_thai.cpp:104] Boundary at position1: 47
[itl_ta_thai.cpp:106] Boundary at position2: 53
[itl_ta_thai.cpp:110] length: 6
[itl_ta_thai.cpp:104] Boundary at position1: 57
[itl_ta_thai.cpp:106] Boundary at position2: 62
[itl_ta_thai.cpp:110] length: 5
[itl_ta_thai.cpp:104] Boundary at position1: 65
[itl_ta_thai.cpp:106] Boundary at position2: 68
[itl_ta_thai.cpp:110] length: 3
[itl_ta_thai.cpp:104] Boundary at position1: 74
[itl_ta_thai.cpp:106] Boundary at position2: 80
[itl_ta_thai.cpp:110] length: 6
[itl_ta_thai.cpp:130] uCharsProcessed: 54
***** Final Thai buffer ******************************************************
00000000: 08 0e 32 0e 01 0e 20 00 | 04 0e 13 0e 30 0e 20 00 | ..2... .....0. .
00000010: 40 0e 21 0e 37 0e 48 0e | 2d 0e 20 00 17 0e 35 0e | @.!.7.H.-. ...5.
00000020: 48 0e 20 00 1e 0e 24 0e | 29 0e 20 0e 32 0e 04 0e | H. ...$.). .2...
00000030: 21 0e 20 00 17 0e 35 0e | 48 0e 20 00 1a 0e 23 0e | !. ...5.H. ...#.
00000040: 34 0e 29 0e 31 0e 17 0e | 20 00 2d 0e 32 0e 01 0e | 4.).1... .-.2...
00000050: 32 0e 28 0e 20 00 23 0e | 27 0e 21 0e 20 00 40 0e | 2.(. .#.'.!. .@.
00000060: 1b 0e 47 0e 19 0e 2d 0e | 31 0e 20 00 __ __ __ __ | ..G...-.1. .____
******************************************************************************

As you can see, a lot of valid Thai characters are skipped (even the embedded arabian numbers are not identified correctly)
Do we use the break iterator in an invalid way?
what is wrong with that?

I have provided the Thai fragment here, it is encoded in UTF-16.

any help is greatly appreciated!

Mit freundlichen Gruessen/Kind regards

          Ralf Hauser

Re: Problem with ICU BreakIterator

Open Source C/C++/Java libraries from Unicode

Re: Problem with ICU BreakIterator