Proposal: ICU4J Break Iterator API Additions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Proposal for new Java Break Iterator API.

For ICU 3.0, ICU for Java will be getting a port of the break iterator 
engine from ICU4C.  The existing Java API will remain unchanged, but 
there are a few new capabilities that come with the new engine, and 
these will require new API to be added to Java.

The new features are
1.  The ability to determine which specific break rule
     or rules was responsible for identifying a boundary.

2.  The ability to create a break iterator from a
     pre-compiled set of rules.

For ICU 3.0, only the RBBI runtime engine will be available.  Break Rule 
compilation will require the use if ICU4C.  The new rule builder will be 
ported to Java in a subsequent ICU release.

The following members are added to class 
com.ibm.icu.text.RuleBasedBreakIterator:

    /** Tag value for "words" that do not fit into any of
      *   other categories.
      *  Includes spaces and most punctuation.
      *  @draft ICU 3.0  */
    public static final int UBRK_WORD_NONE           = 0;

    /** Upper bound for tags for uncategorized words. */
    public static final int UBRK_WORD_NONE_LIMIT     = 100;

    /** Tag value for words that appear to be numbers, lower limit. */
    public static final int UBRK_WORD_NUMBER         = 100;

    /** Tag value for words that appear to be numbers, upper limit. */
    public static final int UBRK_WORD_NUMBER_LIMIT   = 200;

    /** Tag value for words that contain letters, excluding
     *  hiragana, katakana or ideographic characters, lower limit. */
    public static final int UBRK_WORD_LETTER         = 200;

    /** Tag value for words containing letters, upper limit  */
    public static final int UBRK_WORD_LETTER_LIMIT   = 300;

    /** Tag value for words containing kana characters, lower limit */
    public static final int UBRK_WORD_KANA           = 300;

    /** Tag value for words containing kana characters, upper limit */
    public static final int UBRK_WORD_KANA_LIMIT     = 400;

    /** Tag value for words containing ideographic characters,
      * lower limit */
     public static final int UBRK_WORD_IDEO           = 400;

    /** Tag value for words containing ideographic characters,
         upper limit */
     public static final int UBRK_WORD_IDEO_LIMIT     = 500;

/**
  * Return the status tag from the break rule that determined the most
  * recently
  * returned break position.  The values appear in the rule source
  * within brackets, {123}, for example.  For rules that do not specify a
  * status, a default value of 0 is returned.  If more than one rule
  * applies,
  * the numerically largest of the possible status values is returned.
  * <p>
  * Of the standard types of ICU break iterators, only the word break
  * iterator provides status values.  The values  allow distinguishing
  * between words
  * that contain alphabetic letters, "words" that appear to be numbers,
  * punctuation and spaces, words containing ideographic characters, and
  * more.  Call <code>getRuleStatus</code> after obtaining a boundary
  * position from <code>next()<code>, <code>previous()</code>, or
  * any other break iterator functions that returns a boundary position.
  * <p>
  * @return the status from the break rule that determined the most
  *  recently
  * returned break position.
  *
  * @draft ICU 3.0
  */
public int  getRuleStatus() {

/**
* Get the status (tag) values from the break rule(s) that determined the
* most
* recently returned break position. The values appear in the rule source
* within brackets, {123}, for example. The default status value for
* rules that do not explicitly provide one is zero.
* <p>
* If the size of the output array is insufficient to hold the data,
* the output will be truncated to the available length. No exception
* will be thrown.
*
  * @param fillInArray an array to be filled in with the status values.
  * @return  The number of rule status values from rules that determined
  *          the most recent boundary returned by the break iterator.
  *          In the event that the array is too small, the return value
  *          is the total number of status values that were available,
  *          not the reduced number that were actually returned.
  * @draft ICU 3.0
  */
  public int getRuleStatusVec(int[] fillInArray) {

/**
  * Get a break iterator based on a set of pre-compiled break rules.
  *
  * @param is An input stream that supplies the compiled rule data.  The
  * format of the rule data on the stream is that of a rule data file
  * produced by the ICU4C tool "genbrk".
  * @return A RuleBasedBreakIterator based on the supplied break rules.
  * @throws IOException
  */
public static RuleBasedBreakIterator getInstanceFromCompiledRules(
		InputStream is) throws IOException {

---

In getRuleStatusVec(), how to best handle array overflows was the 
subject of a bit of discussion.  The options considered were

1.  Throw an exception.  This seemed overly heavyweight for such
     simple error.  An exception type that could carry the
     needed array size would need to be defined, and user code
     would need to catch and process the exception.

2.  Have the function delete the supplied array and return a
     bigger one if needed.  The problem with this is that there is
     no convenient place to return the number of status values,
     assuming that the array was big enough in the first place.
     Also, applications must be careful with multiple references
     to the array; they may go stale if the function reallocates.

3.  Option 2, return the number of values in the first element
     of the array itself.

4.  Use some collection class.   Inefficient

The choice in the proposal, while not overly java-like, seems to be the 
simplest and easiest to work with, both from the application side and 
the  implementation side.

-- 
   -- Andy Heninger
      hen...@us...

Proposal: ICU4J Break Iterator API Additions

Open Source C/C++/Java libraries from Unicode

Proposal: ICU4J Break Iterator API Additions