From: Alan L. <al...@fi...> - 2001-02-10 01:05:19
|
If we have a bunch of rules which are really of the nature "don't break after/before x", then we should really think about extending the engine to support this explicitly. At 12:59 PM 2/9/2001 -0800, Edward J. Batutis wrote: >--- Eric Mader <er...@us...> wrote: >> >> Hi Ed, >> >> I've looked at the rules in the Unicode book, but >> didn't get very far. I >> found the same thing that you did. The book >> specifies a lot of rules that >> say "don't break here" which are not easy to express >> as RBBI rules. > >It's clear how to implement rules like "don't break >between x and y" but rules like "don't break after p" >are mind-boggling. Rules like "don't break before r" >aren't too bad, but you have to apply these rules over >and over to subsequent rules. There might be a similar >way to implement the "don't break after p" type rules >- reapply them to every other subsequent rule in a >cumulative fashion. It is a puzzle. I'll let you know >if I come up with anything reasonable. > >=Ed > > >__________________________________________________ >Do You Yahoo!? >Get personalized email addresses from Yahoo! Mail - only $35 >a year! http://personal.mail.yahoo.com/ |
From: Eric M. <er...@us...> - 2001-02-12 21:28:56
|
Alan, I think it may be possible to come up with an equivalent set of rules expressed as regular expressions - I just need to think about it for awhile. Eric Alan Liu <al...@fi...> on 02/09/2001 01:13:00 PM To: "Edward J. Batutis" <ejb...@ya...>, Eric Mader/Cupertino/IBM@IBMUS cc: ic...@os..., icu...@os... Subject: Re: Line breaking aaa(aaa: ICU 1.7 If we have a bunch of rules which are really of the nature "don't break after/before x", then we should really think about extending the engine to support this explicitly. At 12:59 PM 2/9/2001 -0800, Edward J. Batutis wrote: >--- Eric Mader <er...@us...> wrote: >> >> Hi Ed, >> >> I've looked at the rules in the Unicode book, but >> didn't get very far. I >> found the same thing that you did. The book >> specifies a lot of rules that >> say "don't break here" which are not easy to express >> as RBBI rules. > >It's clear how to implement rules like "don't break >between x and y" but rules like "don't break after p" >are mind-boggling. Rules like "don't break before r" >aren't too bad, but you have to apply these rules over >and over to subsequent rules. There might be a similar >way to implement the "don't break after p" type rules >- reapply them to every other subsequent rule in a >cumulative fashion. It is a puzzle. I'll let you know >if I come up with anything reasonable. > >=Ed > > >__________________________________________________ >Do You Yahoo!? >Get personalized email addresses from Yahoo! Mail - only $35 >a year! http://personal.mail.yahoo.com/ |
From: Eric M. <er...@us...> - 2001-02-12 21:43:39
|
Ed, In general, the pair table approach doesn't work quite as well as regular expressions because there are cases where you need more context than the two surrounding characters. (cf. the last paragraph on page 125 - section 5.15 right before "Example Specifications.") There's been some discussion here of rewritting the parser to make it easier to maintan, extend and port to C++. If we want to extend the engine, this would be a good time to think about that. I'll be on vacation starting 2/14 through 3/2. Let's talk about this more after I get back. Eric "Edward J. Batutis" <ejb...@ya...> on 02/09/2001 01:45:53 PM To: Alan Liu <al...@fi...>, Eric Mader/Cupertino/IBM@IBMUS cc: ic...@os..., icu...@os... Subject: Re: Line breaking aaa(aaa: ICU 1.7 --- Alan Liu <al...@fi...> wrote: > If we have a bunch of rules which are really of the > nature "don't break after/before x", then we should > really think about extending the engine to support > this explicitly. > I find the engine code to be a bit daunting personally, but this is a good idea. Another possibility is to implement a new kind of break iterator like the one suggested in UTR 14: a pair-table. It looks very simple and would probably be a good performing algorithm. One can almost implement the pair table using the regular expression syntax, but not quite - I believe - although I haven't quite given up on that idea. Perl's RE engine could do it, but it isn't clear how to do it using the ICU one. =Ed __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/ |
From: Edward J. B. <ejb...@ya...> - 2001-02-16 06:37:37
|
--- Eric Mader <er...@us...> wrote: > > Ed, > > In general, the pair table approach doesn't work > quite as well as regular > expressions because there are cases where you need > more context than the > two surrounding characters. (cf. the last paragraph > on page 125 - section > 5.15 right before "Example Specifications.") > I've re-implemented the line breaking rules based on the line breaking properties file on unicode.org and based partially on UTR 14. My new line??.brk files implement line breaking that is closer to UTR 14. After some additional testing I hope to contribute it to ICU/ICU4J. In any case, after struggling with it for several days I'm not too happy with UTR 14. UTR 14 attempts to describe proper line breaking using both regular expressions and pairs, but it is clear that the author had a pair implementation in mind. He tries to break some of the regular expressions down into pairs, but admits that this is only approximate. On the other hand although the regular expressions can be implemented using a regular expression engine, the pairs cannot (at least not with the ICU engine). The result is a description of line breaking that isn't entirely satisfactory for either a pair table implementation or a regular expression implementation. I would rather see a spec that aims clearly at one target - or both targets separately - and hits it directly. Line breaking is inherently a bit messy. Ideally it would vary based on the content of the text it was operating on and the like, but it seems that a clearer and more implementable description of line breaking for general text should be possible. I've attempted to contact the author and will try to forward my comments on to him directly. =Ed __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/ |
From: Mark D. <mar...@us...> - 2001-02-22 02:24:29
|
That text is messy, I agree. However, it says (somewhere) that the regular expression version is the reference; the pair implementation is only an approximation to it. Mark ___ Mark Davis, IBM GCoC, Cupertino (408) 777-5850 [fax: 5891], mar...@us..., pre...@un... http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014 "Edward J. Batutis" <ejb...@ya...>@dwoss.lotus.com on 02-15-2001 19:40:23 Sent by: own...@dw... To: Eric Mader/Cupertino/IBM@IBMUS cc: Alan Liu <al...@fi...>, ic...@dw..., icu...@dw... Subject: Re: Line breaking aaa(aaa: ICU 1.7 --- Eric Mader <er...@us...> wrote: > > Ed, > > In general, the pair table approach doesn't work > quite as well as regular > expressions because there are cases where you need > more context than the > two surrounding characters. (cf. the last paragraph > on page 125 - section > 5.15 right before "Example Specifications.") > I've re-implemented the line breaking rules based on the line breaking properties file on unicode.org and based partially on UTR 14. My new line??.brk files implement line breaking that is closer to UTR 14. After some additional testing I hope to contribute it to ICU/ICU4J. In any case, after struggling with it for several days I'm not too happy with UTR 14. UTR 14 attempts to describe proper line breaking using both regular expressions and pairs, but it is clear that the author had a pair implementation in mind. He tries to break some of the regular expressions down into pairs, but admits that this is only approximate. On the other hand although the regular expressions can be implemented using a regular expression engine, the pairs cannot (at least not with the ICU engine). The result is a description of line breaking that isn't entirely satisfactory for either a pair table implementation or a regular expression implementation. I would rather see a spec that aims clearly at one target - or both targets separately - and hits it directly. Line breaking is inherently a bit messy. Ideally it would vary based on the content of the text it was operating on and the like, but it seems that a clearer and more implementable description of line breaking for general text should be possible. I've attempted to contact the author and will try to forward my comments on to him directly. =Ed __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/ |
From: Edward J. B. <ejb...@ya...> - 2001-02-10 01:20:44
|
--- Alan Liu <al...@fi...> wrote: > If we have a bunch of rules which are really of the > nature "don't break after/before x", then we should > really think about extending the engine to support > this explicitly. > I find the engine code to be a bit daunting personally, but this is a good idea. Another possibility is to implement a new kind of break iterator like the one suggested in UTR 14: a pair-table. It looks very simple and would probably be a good performing algorithm. One can almost implement the pair table using the regular expression syntax, but not quite - I believe - although I haven't quite given up on that idea. Perl's RE engine could do it, but it isn't clear how to do it using the ICU one. =Ed __________________________________________________ Do You Yahoo!? Get personalized email addresses from Yahoo! Mail - only $35 a year! http://personal.mail.yahoo.com/ |