From: Andy H. <and...@gm...> - 2012-08-30 02:50:39
|
On Wed, Aug 29, 2012 at 5:50 PM, Glenn Adams <gl...@sk...> wrote: > Thanks! That should be enough to get me started tweaking the default > rules, though I may have some further questions as I proceed. If there is a > better forum where I should raise questions, let me know. The best forum for these questions would be the icu support mailing list. See http://site.icu-project.org/contacts Also, take a look at the CSS3 discussion for Japanese line break in Unicode UAX-14, here http://www.unicode.org/reports/tr14/proposed.html#CJ And for the implementation of that in ICU, see the differences between data/brkitr/line.txt and line_ja.txt -- Andy > I see also there is a U_RBBIDEBUG environment variable that could be > useful. > > > On Thu, Aug 30, 2012 at 1:44 AM, Andy Heninger <and...@gm...>wrote: > >> >> >> On Wed, Aug 29, 2012 at 1:04 AM, Glenn Adams <gl...@sk...> wrote: >> >>> Hi Andy and Asmus, >>> >>> I'm in the process of upgrading WebKit [1] to support the CSS3 >>> line-break rules [2]. My plan is to use alternate sets of RBBI rules for >>> the different modes. However, I have not used RBBI rules before, so I'm >>> encountering a bit of a black wall when it comes to ICU documentation of >>> semantics. Perhaps you can answer a few basic questions: >>> >>> (1) in chain mode, are all the rules in all sets ([safe] forward, [safe] >>> reverse) applied simultaneously? or are the sets processed in some order, >>> e.g., forward first, reverse second, etc? >>> >> >> The set of rules to be applied depends on which function was called. >> Plain forward rules move from one break position to the next, and are >> invoked by next(). Plain reverse rules move from one break position to the >> previous, invoked by previous(). Safe forward and Safe reverse can start at >> any arbitrary position in the text (not necessarily a break position) and >> move to a safe point from which the plain forward or reverse rules can be >> applied. The position reached by the safe rules may or may not be a >> boundary position. The function following(n) finds the next boundary after >> position n; it is implemented by applying the safe reverse rule to move >> backwards to a point that is not within some complex multi-character >> context, then applying the plain forward rules to reach the actual boundary >> position. >> >> >> >>> (2) are all sets applicable for all position movement operations, e.g., >>> do all sets apply for next(), following(), etc.? or is only a subset of the >>> sets apply? e.g., only forward and safe forward are used for next(), only >>> reverse and safe reverse used for previous()? >>> >>> (3) can one write a rule that reverses the effect of another rule? e.g., >>> if an existing rule prevents a break for some pair B A, can I write another >>> rule that overrides this rule to permit a break at B A? >>> >> >> In general no, with a special case exception for the '/', see the next >> question. >> ICU break rules define chunks of text that stay together. Breaks occur >> when no rule applies. >> >>> >>> (4) what is the purpose of the 'break-point' token ('/')? >>> >> >> It forces a break, overriding all other rules. But it comes with >> restrictions - If there is any following context you can only have one rule >> in a set of rules with a '/' in it. And pretty much all use of '/' ends up >> with following context - even if the forward rules don't have it, the >> corresponding reverse rules will. >> >> I've been wanting to remove this restriction for a long time, but haven't >> had the time to do it. It would simplify some of the rules. >> >> The ICU line break rules are tricky to get right, largely because they >> are applied in parallel, not sequentially. The ICU monkey test is an >> essential tool - it compares the results from ICU break iterator results >> for randomly generated text against a reference implementation that is a >> simple, very literal implementation of the UAX 14 rules. >> >> -- Andy >> >> >>> Regards, >>> Glenn >>> >>> [1] https://bugs.webkit.org/show_bug.cgi?id=89235 >>> [2] http://dev.w3.org/csswg/css3-text/#line-break >>> >> >> > |