Re: [icu-design] API Proposal: Multiple passes in RBT rules

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

MessageThanks.

Yes, you remember right. I don't think there is any need to have syntax =
like ::[:Lu:]BEGIN; we can just use the normal syntax that is currently =
used, as in your second example. I agree with you that begin/end is the =
way to go, even if some of the more advanced uses for it are not =
implemented in a first pass.

I'll sketch it out a bit for others; this also has relevance to your =
question #1.

A. In the general case, one can currently have:

::filter1;
::translit1 (rev1);
::translit2 (rev2);
rule1a;
...
rule1z;
::translit3 (rev3);
::translit4 (rev4);
::filter2;

B. When you build a transliterator from that, you actually build 2 =
compounds, one in each direction. Number 1 is:

filter1 ;  translit2, translit2; translit-temp; translit3; translit4;

and the reverse one is

filter2 ; rev4; rev3; translit-temp-rev; rev3; rev1

where translit-temp is built from rule1a..rule1z, and=20
translit-temp-rev is built from reverse(rule1z)...reverse(1a)

But we'll concentrate one direction in the following, for simplicity.=20

C. What we had envisioned was is that if you put a translit in the =
middle of rules, that it cut them into two pieces. So that=20
=20
::filter1;
::translit1 (rev1);
::translit2 (rev2);
rule1a;
...
rule1m;
::translitA;
rule1n;
...
rule1z;
::translit3 (rev3);
::translit4 (rev4);

that that would produce:

filter1;  translit2, translit2; translit-temp1; translitA; =
translit-temp2; translit3; translit4;

Note that this is recursive: each of the translits can be itself a =
compound, with filters.

D. Having begin/end would simply act exactly as if we had pulled out all =
the rules between them, and made a temporary "translit-TempA". Thus

::filter1;
::translit1 (rev1);
::translit2 (rev2);
rule1a;
...
rule1m;

::begin;
::filter3;
::translit5;
::rule2a;
...
::rule2z;
::end;

rule1n;
...
rule1z;
::translit3 (rev3);
::translit4 (rev4);

that that would produce:

filter1;  translit2, translit2; translit-temp1; translit-TempA; =
translit-temp2; translit3; translit4;

Where translit-TempA was exactly what we would have gotten by a separate =
file

::filter3;
::translit5;
::rule2a;
...
::rule2z;

E. Notice that this means that embedding is almost, but not quite the =
same as separating. Filter1 applies to the whole sequence of following =
actions in the file. And if you embed, the same is true. That is:

rule0;
::begin
:: filter1
:: rule1;
::begin
:: filter2;
:: rule2;
::end
::rule3;
::end
rule4;

produces a compound that looks like: (rule1) (filter1 (rule1) (filter2 =
rule2) (rule3) ) (rule4)

where the parentheses enclose a compound. This would be different than =
just separation:

(rule1) (filter1 (rule1)) (filter2 rule2) (rule3) (rule4)

F. > 1) They might be an argument for loosening the restriction on =
having ID calls inside ::BEGIN and ::END (can CompoundTransliterators =
nest?).

Yes, that is what we envisioned, as in C above.

G. > 2) I'd probably be inclined to implement the named blocks by =
registering them with the framework-- otherwise, you have to maintain a =
second name registry in TransliteratorParser (and have =
TransliteratorRuleParser get access to it to make the $ syntax work).  =
Is it worth the extra effort to have the namespace for named blocks be =
local?

We tossed that back and forth. I think it would be fine to use the same =
kind of hack that Java uses for anonymous inner classes, eg.

ID: any-foo
...
::begin "internal1"
...
::end

creates a registered id called any-foo$internal1. We had wanted to put =
all the named ones at the top, so that it was clear that they were not =
part of the overall flow.

=E2=80=8EMark

  ----- Original Message -----=20
  From: Richard T. Gillam=20
  To: icu...@li...=20
  Sent: Friday, May 27, 2005 08:34
  Subject: RE: [icu-design] API Proposal: Multiple passes in RBT rules

  Mark--

  Thanks as always for your insightful comments.

  Re filters:

  I hadn't really thought about filters.  If I remember right, you can =
have filters in two places in a normal set of rules: a global filter at =
the beginning (and/or a reverse global filter at the end) and a filter =
on an individual ID rule.  With ::BEGIN/::END, I think these would =
devolve to the same thing as far as any ::BEGIN/::END blocks are =
concerned: In other words,

  abc > xyz;
  ::[:Lu:]BEGIN;
      ABC > XYZ;
      DEF > ZYX;
  ::END
  def > zyx;

  would be equivalent to

  abc > xyz;
  ::BEGIN;
      ::[:Lu:];
      ABC > XYZ;
      DEF > ZYX;
  ::END;
  def > zyx;

  Of the two, I'd be more inclined to allow people to stick filters on =
the BEGIN, but I could go either way.  Mostly, though I'm wondering =
whether this buys us anything.  Since the inner set of rules is =
specified inline at the call site, and since the inner set of rules =
can't (currently) include any ID rules of its own, you could just have =
the left-hand sides of the inner rules operate on the characters they =
should operate on.  Using a filter would be syntactic sugar.  Am I =
misunderstanding something here?

  Re nesting:

  Does nesting buy us anything?  What would it mean?  Consider the =
following example:

  rule1;
  ::BEGIN;
      rule2;
      rule3;
  ::END;
  rule4;

  In my proposal, this means:

  - Go through the whole string and apply rule1 wherever it applies.
  - Go back to the beginning, then go through the whole string and apply =
rule2 and rule3 wherever they apply (if they have overlapping matches, =
the normal behavior applies-- a match earlier in the string wins, and if =
both rules match at the same place, rule2 wins).
  - Go back to the beginning again, and apply rule4 wherever it applies.

  If we have nesting, this seems like it'd mean something like:

  - Apply rule1.
  - Go back to the beginning and apply rule2 and rule3 to the whole =
string.
  - THEN RESUME WHERE YOU LEFT OFF and apply rule4 to the remainder.

  But what does "resume where you left off" mean?  How would you know =
when/where to do the BEGIN/END block relative to rule1 and rule4?  One =
possibility might be to apply the rules from the inside out-- first do =
rule2 and rule3, then go back to the beginning and do rule1 and rule4, =
but this doesn't seem intuitive.

  So is there another meaning of nesting you had in mind?  If not, =
levels of nesting are irrelevant:

  rule1;
  ::BEGIN;
      rule2;
      ::BEGIN;
          rule3;
      ::END;
      rule4;
  ::END;
  rule5;

  is exactly the same as

  ::BEGIN;
  rule1;
  ::END;
  ::BEGIN;
  rule2;
  ::END;
  ::BEGIN;
  rule3;
  ::END;
  ::BEGIN;
  rule4;
  ::END;
  ::BEGIN;
  rule5;
  ::END;

  . ..which is what my current implementation of toRules() will print =
out.  (Some of the ::BEGINs and ::ENDs are redundant: You can actually =
express the same thing without the ::BEGIN and ::END around rules 1, 3, =
and 5.)

  I could, of course, make the syntax more regular by requiring that =
normal conversion rules always have to appear inside ::BEGIN and ::END, =
but this breaks backward compatibility and makes everything more =
verbose.

  Re named blocks:

  I was going to argue here more or less the same way I was arguing =
against the filter thing: How often would I want to reset to the =
beginning of the string and apply some set of rules more than once?  If =
you did it more than once with different filters, it might make a little =
more sense, but even then, what it mainly gives you is a decrease in =
verbosity.

  But I didn't know (or had forgotten) about the ability to call a =
transliterator as part of the right-hand side of a conversion rule ("a =
$1 > b &any-tamil($1) ;").  Then you're applying the rules to a =
different string, and you might want to use the same set of rules in =
multiple places.  This seems potentially very useful, and it seems like =
a good argument for the ::BEGIN/::END syntax (people can always use =
"::Null;" as a separator anyway).

  I'm with you that doing that at the same time I'm doing everything =
seems like biting off more than I can chew, but I think it's worth using =
::BEGIN/::END syntax to maintain forward compatibility with this.  It =
also solves the filter problem, since you could define the rule set and =
give it a name and then apply filters to the name the same way you can =
with any other ID call.

  Two other thoughts on named blocks: 1) They might be an argument for =
loosening the restriction on having ID calls inside ::BEGIN and ::END =
(can CompoundTransliterators nest?).  2) I'd probably be inclined to =
implement the named blocks by registering them with the framework-- =
otherwise, you have to maintain a second name registry in =
TransliteratorParser (and have TransliteratorRuleParser get access to it =
to make the $ syntax work).  Is it worth the extra effort to have the =
namespace for named blocks be local?

  Thoughts?

  --Rich

Re: [icu-design] API Proposal: Multiple passes in RBT rules

Open Source C/C++/Java libraries from Unicode

Re: [icu-design] API Proposal: Multiple passes in RBT rules