Menu

#125 compile-time/bin-size explosion for long patterns in transfer rules

closed
nobody
2017-05-26
2017-05-24
No
apertium-validate-transfer apertium-sme-nob.sme-nob.t1x
apertium-preprocess-transfer apertium-sme-nob.sme-nob.t1x sme-nob.t1x.bin

with the attached file gives a 6.3 bin.

Removing any pattern-item from the rule gives a ~10x smaller file.

Duplicating a pattern-item in the rule, e.g. det_mod gives a ~10x larger file.

apertium-preprocess-transfer takes unusually long and uses lots of memory.

mailing list post: https://sourceforge.net/p/apertium/mailman/message/35857037/

1 Attachments

Discussion

  • Kevin Brubeck Unhammer

    It seems it's simply a function of cat-items ⨯ pattern-items, no special *'s or anything like that needed.

    (I'm guessing, if we used lttoolbox instead of pcre to compile the patterns, this probably wouldn't be a problem?)

     
    • Kevin Brubeck Unhammer

      Actually, it seems we are using lttoolbox for the patterns (pcre is only used for attr_items, no idea why, but that's a different issue). But we are not minimizing the transducer! Adding a transducer.minimize() in TransferData::write before writing it takes the full sme-nob t1x down from 15M to 3.5M. Still big, but a lot better. Startup time for apertium-transfer is down from 3.5s to 0.5s, which is acceptable.

      However, simply calling minimize seems to change the transducer so it doesn't match any more :-(

      The reason minimising gives wrong behaviour is line 28 and 31 in trx_reader.cc, which record that the final state corresponds to this rule count:

            td.getTransducer().setFinal(*it);
            …
              td.getFinals()[*it] = count;
      

      So transfer actually numbers the final states, interpreting the nth final as pointing to the nth rule. E.g. we've parsed 33 rules from t1x, then on seeing the next one, the pattern items get turned into a new FST matching, say, ^(prn.ind|prn.ref.gen)$ ?^(prn.ind|prn.ref.gen)$ where the state after the last $ is, say, 102, and we record that finals[102]=34. But after minimisation, state numbers change, and we end up with just one final state, so this hack no longer works.

      Wouldn't it be better to have a special symbol at the end of the pattern, e.g. ɛ:RULE-34 (this would be the only symbol with an epsilon on the left, so you know it's special). Then we could safely call transducer.minimize(). Just doing this change, and removing the finals set from TransferData, reduces the size of the full sme-nob.t1x.bin further to 74K – however, I haven't yet wrapped my head around the stuff in transfer.cc which applies the FST, so I don't know if it's a viable solution yet.

       

      Last edit: Kevin Brubeck Unhammer 2017-05-24
  • Kevin Brubeck Unhammer

    • Description has changed:

    Diff:

    --- old
    +++ new
    @@ -11,3 +11,5 @@
     Duplicating a pattern-item in the rule, e.g. det_mod gives a ~10x *larger* file.
    
     apertium-preprocess-transfer takes unusually long and uses lots of memory.
    +
    +mailing list post: https://sourceforge.net/p/apertium/mailman/message/35857037/
    
     
  • Kevin Brubeck Unhammer

    • status: open --> closed
     
  • Kevin Brubeck Unhammer

    Closed in r78825 of apertium (requires r78821 of lttoolbox).

     

Log in to post a comment.

MongoDB Logo MongoDB