Menu

#309 hfst-regexp2fst looses weights for large regexes

future
open
nobody
None
1
2015-09-30
2015-09-01
sjurum
No

The following regex compiles fine:

[[? - "<CORR>"]*]
.o.
["" ->  [ "" |
    [ "ö"  "<CORR>"  ]::10.0 |
    [ "æ" "<CORR>" ]::10.0 |
    [ "-" "<CORR>" ]::110.0 |
    [ "z"  "<CORR>"  ]::10.0 ] ,,
"e" ->  [ "e" |
    [ "" "<CORR>" ]::10.0 |
    [ "æ" "<CORR>" ]::10.0 |
    [ "-" "<CORR>" ]::110.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"æ" ->  [ "æ" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"n" ->  [ "n" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"o" ->  [ "o" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"t" ->  [ "t" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"Å" ->  [ "Å" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"å" ->  [ "å" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"Ä" ->  [ "Ä" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"ø" ->  [ "ø" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::0.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"Æ" ->  [ "Æ" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ] ,,
"-" ->  [ "-" |
    [ "" "<CORR>" ]::10.0 |
    [ "ö" "<CORR>" ]::10.0 |
    [ "z" "<CORR>" ]::10.0 ]]
.o.
[[ [? - "<CORR>"]* ( "<CORR>":0 ) [? - "<CORR>"]* ]^1];
~~~~~~~~

Compile like:

~~~~~~~bash
$ hfst-regexp2fst -v -S test.regex | hfst-fst2fst -w -v -o test.hfstol

And run like:

$ echo onterligksh | hfst-lookup test.hfstol 
> onterligksh   onterligksh 0,000000
onterligksh nterligksh  10,000000
onterligksh oterligksh  10,000000
onterligksh onerligksh  10,000000
onterligksh ontrligksh  10,000000
[...]
onterligksh onterligk-sh    110,000000
onterligksh onterligks-h    110,000000
onterligksh onterligksh-    110,000000

All weights seems to be fine and as expected.

Now repeat the above with the attached regex, and all weights are 0.0! The attached regex is made by the following make target:

editdist.%.regex: editdist.%.txt
    $(AM_V_GEN)$(GTCORE)/scripts/editdist.py \
        --verbose \
        --swap \
        --epsilon='@0@' \
        --distance=$(DEFAULT_EDIT_DIST) \
        --default-weight=$(DEFAULT_WEIGHT) \
        --no-string-initial-correction \
        --regex -i $< \
        --output-file=$@

# DEFAULT_EDIT_DIST=1
# DEFAULT_WEIGHT=10

Output from the (compiled) large regex:

~~~~~~~bash
$ echo onterligksh | hfst-lookup -v editdist.default.hfstol
onterligksh nterligksh 0,000000
onterligksh -nterligksh 0,000000
onterligksh -onterligksh 0,000000
onterligksh .nterligksh 0,000000
onterligksh .onterligksh 0,000000
onterligksh 0nterligksh 0,000000
onterligksh 0onterligksh 0,000000
onterligksh 1nterligksh 0,000000
onterligksh 1onterligksh 0,000000
onterligksh 2nterligksh 0,000000
onterligksh 2onterligksh 0,000000
onterligksh 3nterligksh 0,000000
onterligksh 3onterligksh 0,000000
onterligksh 4nterligksh 0,000000
onterligksh 4onterligksh 0,000000
onterligksh 5nterligksh 0,000000
onterligksh 5onterligksh 0,000000
~~~~~~~~~

I can't find any other explanation for the difference in weights except the size of the regex.

1 Attachments

Discussion

  • Sam Hardwick

    Sam Hardwick - 2015-09-16

    Verified bug. It could possibly have something to do with weight-zeroing when the parser thinks it's parsing a context, but I'm not sure yet.

     
    • sjurum

      sjurum - 2015-09-16

      Thanks for verifying. I'll just add that this bug has a rather detrimental effect on the speller suggestions (as can probably be imagined). This bug is a showstopper for us.

       
  • Erik Axelson

    Erik Axelson - 2015-09-21

    This could have something to do with negative weights. At least the rule

    "i" -> [ "ï" "<CORR>" ]::-10.0

    makes the expression very slow to compile. I also remember that Openfst sometimes handles negative weights in a strange way.

     

    Last edit: Erik Axelson 2015-09-21
  • Erik Axelson

    Erik Axelson - 2015-09-29

    Cycles with negative weights in determinization (and in minimization as it includes determinization) are the reason. It is impossible to calculate the cheapest path (i.e. path having the lowest weight) from a given state to a final state. Every traversal of the cycle with a negative weight lowers the cost, so the cheapest path would be infinitely long. Solution:

    • Remove epsilons
    • Search the smallest weight in the transducer, let's call it p
    • If p < 0 , add -p to all weights so that the transducer will have only positive or zero weights
    • Determinize
    • (Minimize)
    • Subtract -p from all weights

    Epsilons must be removed before modifying weights, else we will get wrong results. Consider e.g. the case

    0 1 a a 0
    1 2 ε ε -1
    2 0
    

    where the algorithm would produce the result

    0 1 a a 0
    1 0
    

    if epsilons were not removed first.

     

    Last edit: Erik Axelson 2015-09-29
  • Erik Axelson

    Erik Axelson - 2015-09-30

    In svn revision 4460, weights are modified in minimization and determinization functions as described above. The slowdown is barely noticeable.

     
    • sjurum

      sjurum - 2015-09-30

      Thanks a lot. I can confirm that the bug is fixed on my computer.