The following regex compiles fine:
[[? - "<CORR>"]*]
.o.
["" -> [ "" |
[ "ö" "<CORR>" ]::10.0 |
[ "æ" "<CORR>" ]::10.0 |
[ "-" "<CORR>" ]::110.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"e" -> [ "e" |
[ "" "<CORR>" ]::10.0 |
[ "æ" "<CORR>" ]::10.0 |
[ "-" "<CORR>" ]::110.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"æ" -> [ "æ" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"n" -> [ "n" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"o" -> [ "o" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"t" -> [ "t" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"Å" -> [ "Å" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"å" -> [ "å" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"Ä" -> [ "Ä" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"ø" -> [ "ø" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::0.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"Æ" -> [ "Æ" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ] ,,
"-" -> [ "-" |
[ "" "<CORR>" ]::10.0 |
[ "ö" "<CORR>" ]::10.0 |
[ "z" "<CORR>" ]::10.0 ]]
.o.
[[ [? - "<CORR>"]* ( "<CORR>":0 ) [? - "<CORR>"]* ]^1];
~~~~~~~~
Compile like:
~~~~~~~bash
$ hfst-regexp2fst -v -S test.regex | hfst-fst2fst -w -v -o test.hfstol
And run like:
$ echo onterligksh | hfst-lookup test.hfstol
> onterligksh onterligksh 0,000000
onterligksh nterligksh 10,000000
onterligksh oterligksh 10,000000
onterligksh onerligksh 10,000000
onterligksh ontrligksh 10,000000
[...]
onterligksh onterligk-sh 110,000000
onterligksh onterligks-h 110,000000
onterligksh onterligksh- 110,000000
All weights seems to be fine and as expected.
Now repeat the above with the attached regex, and all weights are 0.0! The attached regex is made by the following make target:
editdist.%.regex: editdist.%.txt
$(AM_V_GEN)$(GTCORE)/scripts/editdist.py \
--verbose \
--swap \
--epsilon='@0@' \
--distance=$(DEFAULT_EDIT_DIST) \
--default-weight=$(DEFAULT_WEIGHT) \
--no-string-initial-correction \
--regex -i $< \
--output-file=$@
# DEFAULT_EDIT_DIST=1
# DEFAULT_WEIGHT=10
Output from the (compiled) large regex:
~~~~~~~bash
$ echo onterligksh | hfst-lookup -v editdist.default.hfstol
onterligksh nterligksh 0,000000
onterligksh -nterligksh 0,000000
onterligksh -onterligksh 0,000000
onterligksh .nterligksh 0,000000
onterligksh .onterligksh 0,000000
onterligksh 0nterligksh 0,000000
onterligksh 0onterligksh 0,000000
onterligksh 1nterligksh 0,000000
onterligksh 1onterligksh 0,000000
onterligksh 2nterligksh 0,000000
onterligksh 2onterligksh 0,000000
onterligksh 3nterligksh 0,000000
onterligksh 3onterligksh 0,000000
onterligksh 4nterligksh 0,000000
onterligksh 4onterligksh 0,000000
onterligksh 5nterligksh 0,000000
onterligksh 5onterligksh 0,000000
~~~~~~~~~
I can't find any other explanation for the difference in weights except the size of the regex.
Verified bug. It could possibly have something to do with weight-zeroing when the parser thinks it's parsing a context, but I'm not sure yet.
Thanks for verifying. I'll just add that this bug has a rather detrimental effect on the speller suggestions (as can probably be imagined). This bug is a showstopper for us.
This could have something to do with negative weights. At least the rule
makes the expression very slow to compile. I also remember that Openfst sometimes handles negative weights in a strange way.
Last edit: Erik Axelson 2015-09-21
Cycles with negative weights in determinization (and in minimization as it includes determinization) are the reason. It is impossible to calculate the cheapest path (i.e. path having the lowest weight) from a given state to a final state. Every traversal of the cycle with a negative weight lowers the cost, so the cheapest path would be infinitely long. Solution:
Epsilons must be removed before modifying weights, else we will get wrong results. Consider e.g. the case
where the algorithm would produce the result
if epsilons were not removed first.
Last edit: Erik Axelson 2015-09-29
In svn revision 4460, weights are modified in minimization and determinization functions as described above. The slowdown is barely noticeable.
Thanks a lot. I can confirm that the bug is fixed on my computer.