Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

It turns out the problem was probably caused  by the end of-sentence
symbol </s> appearing in inappropriate places in the LM, at the start
of n-grams rather than the end.  Probably the training data was
contaminated somehow by </s>.
Dan

On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote:
>> I am currently trying to get a minimal reproduction with a script. Let it run for a while. I'll send you what remains of it, and hope it might give me an idea too.
>>
>> Looks like that fstdeterminize may have completed on this grammar (how do you call the thing symbolized as $G$? "grammar" sounded confusing, as I understand, but I have no other word not exceeding 2 syllables :))
>
> I would call it an LM.
>
>>> I have left one running by mistake before going to sleep, and it was done. I am running one again with the time command to make sure this is not a fluke. So it is possible that it is not exactly non-determinizable, but instead takes enormous time (hours on one LM, < 1 sec on another). Which is the same thing from the engineering standpoint, close enough, as those engineering vs mathematics jokes go. But jokes aside, I want something more bounded for a production system, so I need to understand what throws it off so badly.
>
> I would still call it a problem.  Check if your ARPA contains <eps> or
> #0.  I may need to add checks for this into arpa2fst (which we will
> rewrite at some point anyway).  Another problem could be weird things
> like stray \r's which make one word seem like two in some
> circumstances.
> If I saw the output of arpa2fst I could probably figure out fairly
> quickly what the problem was.  The way I would debug this is to trace
> through your LM FST from the start and follow those symbols (or
> epsilons) on that trace from the determinization failure, and see how
> there are two different paths.
> It's better if you share a couple different traces, not just one, so
> we can see what's in common.
>
>> Is fstdeterminizestar more than fstrmepsilon ∘  fstdeterminize (the latter with the kaldi patch)?
>
> No, it should be faster.   fstrmepsilon ∘  fstdeterminize should fail too.
>
>> Ah, and this is a Linux machine. So everything looks very very standard (oops. Did I just create an infinite loop by repeating a word?).
>
> I am considering changing the way the LM disambig symbols are used to
> make this kind of problem less likely to happen in future, by having
> several disambig symbols for the LM, one per order, instead of just
> one.
>
> Dan
>
>
>
>>> -----Original Message-----
>>> From: Daniel Povey [mailto:dp...@gm...]
>>> Sent: 2015-06-15 2340
>>> To: Kirill Katsnelson
>>> Cc: kal...@li...
>>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
>>>
>>> In general SRILM language models are OK, but something weird could have
>>> happened, especially on an unusual platform like Windows.
>>> Look for duplicate lines with apparently the same n-gram on, and also
>>> send to me (but not to kaldi-user) the arpa LM.
>>> Dan
>>>
>>>
>>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson
>>> <kir...@sm...> wrote:
>>> > Nope. The only thing I am thinking of doing is to bisect it somehow,
>>> to get a minimal grammar that still refuses to determinize. I tried
>>> different smoothing and played with other switches to ngram_count, but
>>> it still does loop. Are there any known problems with srilm-generated
>>> models?
>>> >
>>> >  -kkm
>>> >
>>> >> -----Original Message-----
>>> >> From: Daniel Povey [mailto:dp...@gm...]
>>> >> Sent: 2015-06-15 2248
>>> >> To: Kirill Katsnelson
>>> >> Cc: kal...@li...
>>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
>>> >>
>>> >> OOVs should be OK.
>>> >> Make sure there are no n-grams with things like <s> <s>
>>> >>
>>> >> e.g. see the lines
>>> >>     grep -v '<s> <s>' | \
>>> >>     grep -v '</s> <s>' | \
>>> >>     grep -v '</s> </s>' | \
>>> >>
>>> >> in the WSJ script:
>>> >>
>>> >>  gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \
>>> >>     grep -v '<s> <s>' | \
>>> >>     grep -v '</s> <s>' | \
>>> >>     grep -v '</s> </s>' | \
>>> >>     arpa2fst - | fstprint | \
>>> >>     utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \
>>> >>     utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
>>> >> isymbols=$test/words.txt \
>>> >>       --osymbols=$test/words.txt  --keep_isymbols=false --
>>> >> keep_osymbols=false | \
>>> >>      fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst
>>> >>
>>> >> Dan
>>> >>
>>> >>
>>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson
>>> >> <kir...@sm...> wrote:
>>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a
>>> >> > second to determinize). And the bad one loops at the word "zero"
>>> >> > like this
>>> >> >
>>> >> > #0
>>> >> > unsure  unsure
>>> >> > #0
>>> >> > of      of
>>> >> > #0
>>> >> > yours   yours
>>> >> > #0
>>> >> > is      is
>>> >> > #0
>>> >> > your    your
>>> >> > #0
>>> >> > zip     zip
>>> >> > #0
>>> >> > wrong   wrong
>>> >> > #0
>>> >> > with    with
>>> >> > #0
>>> >> > zero    zero
>>> >> > #0
>>> >> > zero    zero
>>> >> > ....
>>> >> >
>>> >> > I am taking the LM straight from ngram_counts to the standard
>>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs:
>>> >> >
>>> >> > remove_oovs.pl: removed 4646 lines.
>>> >> >
>>> >> > Is this generally a problem? So does my "good" arpa LM. I grepped
>>> >> both for the word zero, but could not spot anything outrageous. Can
>>> >> you think of anything I can look for?
>>> >> >
>>> >> > My source is no longer than 10 days old. Here's the pipeline, just
>>> >> > in
>>> >> case.
>>> >> >
>>> >> > cat $src/$arpalm | tr -d '\r' | \
>>> >> >   utils/find_arpa_oovs.pl $lang/words.txt  > $lang/lm_oovs.txt
>>> >> >
>>> >> > cat $src/$arpalm | tr -d '\r' | \
>>> >> >   arpa2fst - | fstprint | \
>>> >> >   utils/remove_oovs.pl $lang/lm_oovs.txt | \
>>> >> >   utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
>>> >> isymbols=$lang/words.txt \
>>> >> >     --osymbols=$lang/words.txt  --keep_isymbols=false --
>>> >> keep_osymbols=false | \
>>> >> >    fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst
>>> >> >
>>> >> >  -kkm
>>> >> >
>>> >> >
>>> >> >> -----Original Message-----
>>> >> >> From: Daniel Povey [mailto:dp...@gm...]
>>> >> >> Sent: 2015-06-15 2206
>>> >> >> To: Kirill Katsnelson
>>> >> >> Cc: kal...@li...
>>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
>>> >> >> completes
>>> >> >>
>>> >> >> I don't recommend to look at the fstdeterminizestar algorithm
>>> >> itself-
>>> >> >> it's very complicated.  Instead focus on the definition of
>>> >> >> "determinizable" and the twins property, and figure out what path
>>> >> you
>>> >> >> are taking through L.fst and G.fst.  Trying to fstdeterminizestar
>>> >> >> G.fst directly, and seeing whether it terminates or not, may tell
>>> >> you
>>> >> >> something; if it fails, send the signal and see what happens.
>>> >> >> fstdeterminizestar does care about the weights, but only to the
>>> >> >> extent that they are the same or different from each other; and
>>> if
>>> >> >> your G.fst is generated from arpa2fst the pipeline should work
>>> for
>>> >> >> any ARPA-format language model- make sure you are using an up-to-
>>> >> date
>>> >> >> Kaldi though, there have been fixes as recently as a few months
>>> ago.
>>> >> >> The presence of SIL is not surprising, it is the optional-silence
>>> >> >> added by the lexicon.  I think that script is adding #16 if it
>>> >> >> does
>>> >> >> *not* take the optional silence, otherwise it adds the phone SIL.
>>> >> >> Since you are calling your FST a "grammar" I'm wondering whether
>>> >> >> you have done something fancy with mapping words to FSTs or
>>> >> >> something like that, which is causing the result to not be
>>> determinizable.
>>> >> >>
>>> >> >> Dan
>>> >> >>
>>> >> >>
>>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson
>>> >> >> <kir...@sm...> wrote:
>>> >> >> > Thank you very much for your help Dan, but I am still stuck.
>>> >> >> >
>>> >> >> > First of all, a question: does the fstdeterminizestar algorithm
>>> >> >> depend on actual backoff and n-gram probabilities, i.e. will it
>>> >> >> behave differently if the numbers in arpa model file are
>>> different?
>>> >> >> Or does it depend only on arc labels but not weights? I am
>>> looking
>>> >> at
>>> >> >> the code but certainly I am far from being able to understand it.
>>> >> >> I cheated by looking at all if conditions in it, and this one in
>>> >> >> EpsilonClosure is seemingly the only one dealing with weights:
>>> >> >> >
>>> >> >> >             if (! ApproxEqual(weight, iter->second.weight,
>>> >> delta_))
>>> >> >> > {
>>> >> >> // add extra part of weight to queue.
>>> >> >> >
>>> >> >> > (In ProcessFinal it also has "if (this_final_weight !=
>>> >> >> > Weight::Zero())" but I do not believe it is relevant?)
>>> >> >> >
>>> >> >> > I am trying to understand how to dig into the problem--are
>>> >> >> > weights in
>>> >> >> the picture actually.
>>> >> >> >
>>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v
>>> 'real
>>> >> >> real'", and indeed got a similar loop on the word "very" which is
>>> >> >> also often repeated. But the "real real" 2- and 3-grams are there
>>> >> >> in the "good" grammar too.
>>> >> >> >
>>> >> >> > Another thing I do not understand is the presence of the SIL
>>> >> ilabel
>>> >> >> in the backtrace. Here's the beginning of the trace that leads to
>>> >> the
>>> >> >> infinite loop as decoded with a little script I wrote (format is
>>> >> >> ilabel [ TAB olabel ]:
>>> >> >> >
>>> >> >> > #16
>>> >> >> > #0
>>> >> >> > V_B
>>> >> >> > Y_I
>>> >> >> > UW1_I
>>> >> >> > Z_E     views
>>> >> >> > #2
>>> >> >> > SIL
>>> >> >> > #0
>>> >> >> > AH0_B
>>> >> >> > N_I
>>> >> >> > SH_I    unsure
>>> >> >> > UH1_I
>>> >> >> > R_E
>>> >> >> >
>>> >> >> > Note the presence of SIL at line 8. This is not in lexicon:
>>> >> >> >
>>> >> >> > $ grep SIL
>>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt
>>> >> >> > !SIL    1       0.20    1.00    1.00    SIL_S
>>> >> >> > $
>>> >> >> >
>>> >> >> > Is this a hint? How did it get there at all? I am using a
>>> >> >> > standard
>>> >> >> script to build the L_disambig.fst:
>>> >> >> >
>>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}'
>>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print
>>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl
>>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \
>>> >> >> >               data/local/dict/silprob.txt $silphone
>>> >> >> > '#'$ndisambig
>>> >> | \
>>> >> >> >      fstcompile --isymbols=$lang/phones.txt --
>>> >> >> osymbols=$lang/words.txt \
>>> >> >> >      --keep_isymbols=false --keep_osymbols=false |   \
>>> >> >> >      fstaddselfloops  "echo $phone_disambig_symbol |" "echo
>>> >> >> $word_disambig_symbol |" | \
>>> >> >> >      fstarcsort --sort_type=olabel > $lang/L_disambig.fst ||
>>> >> >> > exit 1;
>>> >> >> >
>>> >> >> > I checked the lexicon, and there are indeed only real phones at
>>> >> the
>>> >> >> beginning of each word, no empty positions and no #N symbols.
>>> >> >> >
>>> >> >> >  -kkm
>>> >> >> >
>>> >> >> >> -----Original Message-----
>>> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
>>> >> >> >> Sent: 2015-06-15 1944
>>> >> >> >> To: Kirill Katsnelson
>>> >> >> >> Cc: kal...@li...
>>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
>>> >> >> >> completes
>>> >> >> >>
>>> >> >> >> I think the confusion is probably between two loops with
>>> "real"
>>> >> on
>>> >> >> >> them in G.fst: one loop where you always take the bigram
>>> >> >> probability,
>>> >> >> >> and one where you always take the unigram probability.  Or
>>> >> >> >> maybe
>>> >> a
>>> >> >> >> similar confusion between a loop where you use the trigram
>>> >> >> >> "real
>>> >> >> real
>>> >> >> >> real" and the bigram "real real".  Those loops are expected to
>>> >> >> exist.
>>> >> >> >> Probably the issue is that something happened at the start of
>>> >> >> >> the sequence which caused the FST to be confused about which
>>> of
>>> >> >> >> those
>>> >> >> two
>>> >> >> >> states it was in.  If you have any empty words (words with
>>> >> >> >> empty
>>> >> >> >> pronunciation) in your lexicon this could possibly happen, as
>>> >> >> >> it would be confused between  taking a normal word, then the
>>> >> >> >> backoff
>>> >> >> symbol, vs.
>>> >> >> >> taking a normal word, then the empty word, then the backoff
>>> >> symbol.
>>> >> >> >> I think the current Kaldi graph-creation script check for
>>> empty
>>> >> >> words
>>> >> >> >> in the lexicon, for this reason.
>>> >> >> >>
>>> >> >> >> Dan
>>> >> >> >>
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0
>>> (
>>> >> >> >> > )
>>> >> >> >> generally almost makes sense, given that #16 is the last one
>>> in
>>> >> >> >> table, the silence disambiguation symbol. (Not sure why "real"
>>> >> >> >> is emitted at L_E--I would rather expect it to be emitted at
>>> >> >> >> #1.) What
>>> >> >> I
>>> >> >> >> do not understand is what exactly the debug trace represents,
>>> >> >> >> and what should I make out if it. It is a path through the FST
>>> >> >> >> graph,
>>> >> >> but
>>> >> >> >> I do not understand what is this path exactly, and what does
>>> >> >> >> this endless walk of this loop mean.
>>> >> >> >> >
>>> >> >> >> >  -kkm
>>> >> >> >> >
>>> >> >> >> >> -----Original Message-----
>>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
>>> >> >> >> >> Sent: 2015-06-15 1858
>>> >> >> >> >> To: Kirill Katsnelson
>>> >> >> >> >> Cc: kal...@li...
>>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
>>> >> >> >> >> completes
>>> >> >> >> >>
>>> >> >> >> >> Look into the "backoff disambiguation symbol", normally
>>> >> >> >> >> called
>>> >> >> #0.
>>> >> >> >> >> The reason why it is needed should be explained in the
>>> >> hbka.pdf
>>> >> >> >> paper.
>>> >> >> >> >> Dan
>>> >> >> >> >>
>>> >> >> >> >>
>>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
>>> >> >> >> >> <kir...@sm...> wrote:
>>> >> >> >> >> > Thank you! The output consists of some sequences as you
>>> >> >> >> >> > described,
>>> >> >> >> >> quickly falling into a short ever repeated loop.
>>> >> >> >> >> >
>>> >> >> >> >> > The non-repeated section ends up with osymbols (excluding
>>> >> >> >> epsilons)
>>> >> >> >> >> "whatsoever on vacation up", and then the repeated part
>>> >> >> >> >> looks
>>> >> >> like "
>>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The
>>> >> >> >> >> word
>>> >> >> "real"
>>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig.
>>> >> >> >> >> >
>>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram
>>> >> >> >> "vacation
>>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up
>>> >> real
>>> >> >> >> quick"
>>> >> >> >> >> and "up real quickly". "up real" is also a tail of a few
>>> >> >> >> >> other 3-grams, but these are also same in both models (up
>>> to
>>> >> >> >> >> their
>>> >> >> >> weights).
>>> >> >> >> >> >
>>> >> >> >> >> > It looks I do not understand what should I make in the
>>> end
>>> >> >> >> >> > out of
>>> >> >> >> >> this
>>> >> >> >> >> > debug data :(
>>> >> >> >> >> >
>>> >> >> >> >> >  -kkm
>>> >> >> >> >> >
>>> >> >> >> >> >> -----Original Message-----
>>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
>>> >> >> >> >> >> Sent: 2015-06-15 1821
>>> >> >> >> >> >> To: Kirill Katsnelson
>>> >> >> >> >> >> Cc: kal...@li...
>>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G)
>>> never
>>> >> >> >> >> >> completes
>>> >> >> >> >> >>
>>> >> >> >> >> >> > I have a small set of sentences with repeat counts,
>>> and
>>> >> >> >> >> >> > generating an
>>> >> >> >> >> >> LM out of it. One is generated by a horrible local tool
>>> I
>>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G
>>> >> >> >> >> >> composition
>>> >> >> takes
>>> >> >> >> >> about
>>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of
>>> >> the
>>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has
>>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G composition
>>> >> >> >> >> >> step for about 30
>>> >> >> >> minutes,
>>> >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is
>>> >> >> >> >> >> running at
>>> >> >> >> >> 100%.
>>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks
>>> like
>>> >> >> >> >> >> the
>>> >> >> G
>>> >> >> >> >> >> making it not determinizable, although I have no idea
>>> how
>>> >> it
>>> >> >> >> >> >> came to
>>> >> >> >> >> be.
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > Anyone could share an advice on tracking down the
>>> >> problem?
>>> >> >> >> Thanks.
>>> >> >> >> >> >>
>>> >> >> >> >> >> You can send a signal to that program like  kill -
>>> SIGUSR1
>>> >> >> >> >> >> process-id and it will print out some info about the
>>> >> >> >> >> >> symbol sequences involved, I think it is like
>>> >> >> >> >> >>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
>>> >> >> >> >> >> Usually there is a particular word sequence that is
>>> >> >> problematic.
>>> >> >> >> >> >> Dan
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >>
>>> >> >> >> >> >> >
>>> >> >> >> >> >> >  -kkm
>>> >> >> >> >> >> >
>>> >> >> >> >> >> > ------------------------------------------------------
>>> -
>>> >> >> >> >> >> > --
>>> >> -
>>> >> >> >> >> >> > --
>>> >> >> -
>>> >> >> >> >> >> > --
>>> >> >> >> -
>>> >> >> >> >> >> > --
>>> >> >> >> >> -
>>> >> >> >> >> >> > --
>>> >> >> >> >> >> -
>>> >> >> >> >> >> > --------
>>> >> >> >> >> >> > _______________________________________________
>>> >> >> >> >> >> > Kaldi-users mailing list
>>> >> >> >> >> >> > Kal...@li...
>>> >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-
>>> user
>>> >> >> >> >> >> > s