|
From: Andy H. <and...@gm...> - 2017-05-01 23:12:21
|
I would like to propose the following API for ICU 60 Please provide feedback by Tuesday, May 17, 2017 Designated API reviewer: Markus Ticket: 7130 <http://bugs.icu-project.org/trac/ticket/7141> Add API and rule syntax for Break Iterator types (word, line, etc.). In ubrk.h is this definition for the available break iterator types: /** The possible types of text boundaries. @stable ICU 2.0 */ typedef enum UBreakIteratorType { /** Character breaks @stable ICU 2.0 */ UBRK_CHARACTER = 0, /** Word breaks @stable ICU 2.0 */ UBRK_WORD = 1, /** Line breaks @stable ICU 2.0 */ UBRK_LINE = 2, /** Sentence breaks @stable ICU 2.0 */ UBRK_SENTENCE = 3, ... // omitting deprecated values } UBreakIteratorType; The type is passed to the factory methods, to specify the desired type of break iterator. There is no corresponding getter function to query the type of a break iterator. There is no rule syntax to specify the type via the break rules. For break iterators created directly through the rule APIs, this leaves no mechanism to indicate the break iterator type. They default to type WORD. In the existing implementation, the type of a break iterator affects the dictionary behavior, although a case could be made that it shouldn't. The $dictionary set declared in the rules should be sufficient for enabling dictionary based breaking. The proposal: To break iterator rule syntax, add a type declaration, one of !!type character; !!type word; !!type line; !!type sentence; Notes: - The !! syntax is already a convention in the RBBI rules for various kinds of declarations about the set of rules. - "character" is actually grapheme cluster, but the "character" name is already used throughout the BreakIterator API and documentation. - The type declaration may be omitted, in which case it will default to word. This is the current ICU behavior. - If rules with a declared type are loaded by a factory function such BreakIterator::createLineInstance(), the types must match or an error will be set. To ubrk.h, add /** * Get the type of the break iterator. If no type was declared by the rules, default to UBRK_WORD. * @param bi The break iterator to use. * @param status A UErrorCode to receive any errors detected by this function. * @return The type of this break iterator. * @draft ICU 60 */ UBreakIteratorType ubrk_getType(UBreakIteartor *bi, UErrorCode *status); To brkiter.h, class BreakIterator add /** * Get the type of the break iterator. If no type was declared by the rules, default to UBRK_WORD. * @param status A UErrorCode to receive any errors detected by this function. * @return The type of this break iterator. * @draft ICU 60 */ virtual UBreakIteratorType getType(UErrorCode &status); Notes: - BreakIterator is an abstract class. It will need a default implementation to keep any subclasses we don't know about alive. - Should the default type remain Word, or should we introduce a new unknown type. To rbbi.h, class RuleBasedBreakIterator add /** * Get the type of the break iterator. If no type was declared by the rules, default to UBRK_WORD. * @param status A UErrorCode to receive any errors detected by this function. * @return The type of this break iterator. * @draft ICU 60 */ UBreakIteratorType getType(UErrorCode &status) override; In main/classes/core/src/com/ibm/icu/text/BreakIterator.java /** * Get the type of the break iterator, one of KIND_CHARACTER, KIND_WORD, etc. * If no type was declared by the rules, default to KIND_WORD. * @return The type of this break iterator. * @draft ICU 60 * @provisional This API might change or be removed in a future release. */ int getType(); In main/classes/core/src/com/ibm/icu/text/RuleBasedBreakIterator.java /** * {@inheritDoc} * @draft ICU 60 * @provisional This API might change or be removed in a future release. */ @override int getType(); --- Andy |