Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I am currently trying to get a minimal reproduction with a script. Let it run for a while. I'll send you what remains of it, and hope it might give me an idea too.

Looks like that fstdeterminize may have completed on this grammar (how do you call the thing symbolized as $G$? "grammar" sounded confusing, as I understand, but I have no other word not exceeding 2 syllables :)). I have left one running by mistake before going to sleep, and it was done. I am running one again with the time command to make sure this is not a fluke. So it is possible that it is not exactly non-determinizable, but instead takes enormous time (hours on one LM, < 1 sec on another). Which is the same thing from the engineering standpoint, close enough, as those engineering vs mathematics jokes go. But jokes aside, I want something more bounded for a production system, so I need to understand what throws it off so badly.

Is fstdeterminizestar more than fstrmepsilon ∘  fstdeterminize (the latter with the kaldi patch)? 

Ah, and this is a Linux machine. So everything looks very very standard (oops. Did I just create an infinite loop by repeating a word?).

 -kkm

> -----Original Message-----
> From: Daniel Povey [mailto:dp...@gm...]
> Sent: 2015-06-15 2340
> To: Kirill Katsnelson
> Cc: kal...@li...
> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> 
> In general SRILM language models are OK, but something weird could have
> happened, especially on an unusual platform like Windows.
> Look for duplicate lines with apparently the same n-gram on, and also
> send to me (but not to kaldi-user) the arpa LM.
> Dan
> 
> 
> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson
> <kir...@sm...> wrote:
> > Nope. The only thing I am thinking of doing is to bisect it somehow,
> to get a minimal grammar that still refuses to determinize. I tried
> different smoothing and played with other switches to ngram_count, but
> it still does loop. Are there any known problems with srilm-generated
> models?
> >
> >  -kkm
> >
> >> -----Original Message-----
> >> From: Daniel Povey [mailto:dp...@gm...]
> >> Sent: 2015-06-15 2248
> >> To: Kirill Katsnelson
> >> Cc: kal...@li...
> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> >>
> >> OOVs should be OK.
> >> Make sure there are no n-grams with things like <s> <s>
> >>
> >> e.g. see the lines
> >>     grep -v '<s> <s>' | \
> >>     grep -v '</s> <s>' | \
> >>     grep -v '</s> </s>' | \
> >>
> >> in the WSJ script:
> >>
> >>  gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \
> >>     grep -v '<s> <s>' | \
> >>     grep -v '</s> <s>' | \
> >>     grep -v '</s> </s>' | \
> >>     arpa2fst - | fstprint | \
> >>     utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \
> >>     utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
> >> isymbols=$test/words.txt \
> >>       --osymbols=$test/words.txt  --keep_isymbols=false --
> >> keep_osymbols=false | \
> >>      fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst
> >>
> >> Dan
> >>
> >>
> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson
> >> <kir...@sm...> wrote:
> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a
> >> > second to determinize). And the bad one loops at the word "zero"
> >> > like this
> >> >
> >> > #0
> >> > unsure  unsure
> >> > #0
> >> > of      of
> >> > #0
> >> > yours   yours
> >> > #0
> >> > is      is
> >> > #0
> >> > your    your
> >> > #0
> >> > zip     zip
> >> > #0
> >> > wrong   wrong
> >> > #0
> >> > with    with
> >> > #0
> >> > zero    zero
> >> > #0
> >> > zero    zero
> >> > ....
> >> >
> >> > I am taking the LM straight from ngram_counts to the standard
> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs:
> >> >
> >> > remove_oovs.pl: removed 4646 lines.
> >> >
> >> > Is this generally a problem? So does my "good" arpa LM. I grepped
> >> both for the word zero, but could not spot anything outrageous. Can
> >> you think of anything I can look for?
> >> >
> >> > My source is no longer than 10 days old. Here's the pipeline, just
> >> > in
> >> case.
> >> >
> >> > cat $src/$arpalm | tr -d '\r' | \
> >> >   utils/find_arpa_oovs.pl $lang/words.txt  > $lang/lm_oovs.txt
> >> >
> >> > cat $src/$arpalm | tr -d '\r' | \
> >> >   arpa2fst - | fstprint | \
> >> >   utils/remove_oovs.pl $lang/lm_oovs.txt | \
> >> >   utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
> >> isymbols=$lang/words.txt \
> >> >     --osymbols=$lang/words.txt  --keep_isymbols=false --
> >> keep_osymbols=false | \
> >> >    fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst
> >> >
> >> >  -kkm
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> Sent: 2015-06-15 2206
> >> >> To: Kirill Katsnelson
> >> >> Cc: kal...@li...
> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >> >> completes
> >> >>
> >> >> I don't recommend to look at the fstdeterminizestar algorithm
> >> itself-
> >> >> it's very complicated.  Instead focus on the definition of
> >> >> "determinizable" and the twins property, and figure out what path
> >> you
> >> >> are taking through L.fst and G.fst.  Trying to fstdeterminizestar
> >> >> G.fst directly, and seeing whether it terminates or not, may tell
> >> you
> >> >> something; if it fails, send the signal and see what happens.
> >> >> fstdeterminizestar does care about the weights, but only to the
> >> >> extent that they are the same or different from each other; and
> if
> >> >> your G.fst is generated from arpa2fst the pipeline should work
> for
> >> >> any ARPA-format language model- make sure you are using an up-to-
> >> date
> >> >> Kaldi though, there have been fixes as recently as a few months
> ago.
> >> >> The presence of SIL is not surprising, it is the optional-silence
> >> >> added by the lexicon.  I think that script is adding #16 if it
> >> >> does
> >> >> *not* take the optional silence, otherwise it adds the phone SIL.
> >> >> Since you are calling your FST a "grammar" I'm wondering whether
> >> >> you have done something fancy with mapping words to FSTs or
> >> >> something like that, which is causing the result to not be
> determinizable.
> >> >>
> >> >> Dan
> >> >>
> >> >>
> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson
> >> >> <kir...@sm...> wrote:
> >> >> > Thank you very much for your help Dan, but I am still stuck.
> >> >> >
> >> >> > First of all, a question: does the fstdeterminizestar algorithm
> >> >> depend on actual backoff and n-gram probabilities, i.e. will it
> >> >> behave differently if the numbers in arpa model file are
> different?
> >> >> Or does it depend only on arc labels but not weights? I am
> looking
> >> at
> >> >> the code but certainly I am far from being able to understand it.
> >> >> I cheated by looking at all if conditions in it, and this one in
> >> >> EpsilonClosure is seemingly the only one dealing with weights:
> >> >> >
> >> >> >             if (! ApproxEqual(weight, iter->second.weight,
> >> delta_))
> >> >> > {
> >> >> // add extra part of weight to queue.
> >> >> >
> >> >> > (In ProcessFinal it also has "if (this_final_weight !=
> >> >> > Weight::Zero())" but I do not believe it is relevant?)
> >> >> >
> >> >> > I am trying to understand how to dig into the problem--are
> >> >> > weights in
> >> >> the picture actually.
> >> >> >
> >> >> > Also, just for a test, I ran the grammar trough a "grep -v
> 'real
> >> >> real'", and indeed got a similar loop on the word "very" which is
> >> >> also often repeated. But the "real real" 2- and 3-grams are there
> >> >> in the "good" grammar too.
> >> >> >
> >> >> > Another thing I do not understand is the presence of the SIL
> >> ilabel
> >> >> in the backtrace. Here's the beginning of the trace that leads to
> >> the
> >> >> infinite loop as decoded with a little script I wrote (format is
> >> >> ilabel [ TAB olabel ]:
> >> >> >
> >> >> > #16
> >> >> > #0
> >> >> > V_B
> >> >> > Y_I
> >> >> > UW1_I
> >> >> > Z_E     views
> >> >> > #2
> >> >> > SIL
> >> >> > #0
> >> >> > AH0_B
> >> >> > N_I
> >> >> > SH_I    unsure
> >> >> > UH1_I
> >> >> > R_E
> >> >> >
> >> >> > Note the presence of SIL at line 8. This is not in lexicon:
> >> >> >
> >> >> > $ grep SIL
> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt
> >> >> > !SIL    1       0.20    1.00    1.00    SIL_S
> >> >> > $
> >> >> >
> >> >> > Is this a hint? How did it get there at all? I am using a
> >> >> > standard
> >> >> script to build the L_disambig.fst:
> >> >> >
> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}'
> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print
> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl
> >> >> $lang/dict/lexiconp_silprob_disambig.txt \
> >> >> >               data/local/dict/silprob.txt $silphone
> >> >> > '#'$ndisambig
> >> | \
> >> >> >      fstcompile --isymbols=$lang/phones.txt --
> >> >> osymbols=$lang/words.txt \
> >> >> >      --keep_isymbols=false --keep_osymbols=false |   \
> >> >> >      fstaddselfloops  "echo $phone_disambig_symbol |" "echo
> >> >> $word_disambig_symbol |" | \
> >> >> >      fstarcsort --sort_type=olabel > $lang/L_disambig.fst ||
> >> >> > exit 1;
> >> >> >
> >> >> > I checked the lexicon, and there are indeed only real phones at
> >> the
> >> >> beginning of each word, no empty positions and no #N symbols.
> >> >> >
> >> >> >  -kkm
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> >> Sent: 2015-06-15 1944
> >> >> >> To: Kirill Katsnelson
> >> >> >> Cc: kal...@li...
> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >> >> >> completes
> >> >> >>
> >> >> >> I think the confusion is probably between two loops with
> "real"
> >> on
> >> >> >> them in G.fst: one loop where you always take the bigram
> >> >> probability,
> >> >> >> and one where you always take the unigram probability.  Or
> >> >> >> maybe
> >> a
> >> >> >> similar confusion between a loop where you use the trigram
> >> >> >> "real
> >> >> real
> >> >> >> real" and the bigram "real real".  Those loops are expected to
> >> >> exist.
> >> >> >> Probably the issue is that something happened at the start of
> >> >> >> the sequence which caused the FST to be confused about which
> of
> >> >> >> those
> >> >> two
> >> >> >> states it was in.  If you have any empty words (words with
> >> >> >> empty
> >> >> >> pronunciation) in your lexicon this could possibly happen, as
> >> >> >> it would be confused between  taking a normal word, then the
> >> >> >> backoff
> >> >> symbol, vs.
> >> >> >> taking a normal word, then the empty word, then the backoff
> >> symbol.
> >> >> >> I think the current Kaldi graph-creation script check for
> empty
> >> >> words
> >> >> >> in the lexicon, for this reason.
> >> >> >>
> >> >> >> Dan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0
> (
> >> >> >> > )
> >> >> >> generally almost makes sense, given that #16 is the last one
> in
> >> >> >> table, the silence disambiguation symbol. (Not sure why "real"
> >> >> >> is emitted at L_E--I would rather expect it to be emitted at
> >> >> >> #1.) What
> >> >> I
> >> >> >> do not understand is what exactly the debug trace represents,
> >> >> >> and what should I make out if it. It is a path through the FST
> >> >> >> graph,
> >> >> but
> >> >> >> I do not understand what is this path exactly, and what does
> >> >> >> this endless walk of this loop mean.
> >> >> >> >
> >> >> >> >  -kkm
> >> >> >> >
> >> >> >> >> -----Original Message-----
> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> >> >> Sent: 2015-06-15 1858
> >> >> >> >> To: Kirill Katsnelson
> >> >> >> >> Cc: kal...@li...
> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >> >> >> >> completes
> >> >> >> >>
> >> >> >> >> Look into the "backoff disambiguation symbol", normally
> >> >> >> >> called
> >> >> #0.
> >> >> >> >> The reason why it is needed should be explained in the
> >> hbka.pdf
> >> >> >> paper.
> >> >> >> >> Dan
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
> >> >> >> >> <kir...@sm...> wrote:
> >> >> >> >> > Thank you! The output consists of some sequences as you
> >> >> >> >> > described,
> >> >> >> >> quickly falling into a short ever repeated loop.
> >> >> >> >> >
> >> >> >> >> > The non-repeated section ends up with osymbols (excluding
> >> >> >> epsilons)
> >> >> >> >> "whatsoever on vacation up", and then the repeated part
> >> >> >> >> looks
> >> >> like "
> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The
> >> >> >> >> word
> >> >> "real"
> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig.
> >> >> >> >> >
> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram
> >> >> >> "vacation
> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up
> >> real
> >> >> >> quick"
> >> >> >> >> and "up real quickly". "up real" is also a tail of a few
> >> >> >> >> other 3-grams, but these are also same in both models (up
> to
> >> >> >> >> their
> >> >> >> weights).
> >> >> >> >> >
> >> >> >> >> > It looks I do not understand what should I make in the
> end
> >> >> >> >> > out of
> >> >> >> >> this
> >> >> >> >> > debug data :(
> >> >> >> >> >
> >> >> >> >> >  -kkm
> >> >> >> >> >
> >> >> >> >> >> -----Original Message-----
> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> >> >> >> Sent: 2015-06-15 1821
> >> >> >> >> >> To: Kirill Katsnelson
> >> >> >> >> >> Cc: kal...@li...
> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G)
> never
> >> >> >> >> >> completes
> >> >> >> >> >>
> >> >> >> >> >> > I have a small set of sentences with repeat counts,
> and
> >> >> >> >> >> > generating an
> >> >> >> >> >> LM out of it. One is generated by a horrible local tool
> I
> >> >> >> >> >> have trouble tracing exactly how. For this one L*G
> >> >> >> >> >> composition
> >> >> takes
> >> >> >> >> about
> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of
> >> the
> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has
> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G composition
> >> >> >> >> >> step for about 30
> >> >> >> minutes,
> >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is
> >> >> >> >> >> running at
> >> >> >> >> 100%.
> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks
> like
> >> >> >> >> >> the
> >> >> G
> >> >> >> >> >> making it not determinizable, although I have no idea
> how
> >> it
> >> >> >> >> >> came to
> >> >> >> >> be.
> >> >> >> >> >> >
> >> >> >> >> >> > Anyone could share an advice on tracking down the
> >> problem?
> >> >> >> Thanks.
> >> >> >> >> >>
> >> >> >> >> >> You can send a signal to that program like  kill -
> SIGUSR1
> >> >> >> >> >> process-id and it will print out some info about the
> >> >> >> >> >> symbol sequences involved, I think it is like
> >> >> >> >> >>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
> >> >> >> >> >> Usually there is a particular word sequence that is
> >> >> problematic.
> >> >> >> >> >> Dan
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> >
> >> >> >> >> >> >  -kkm
> >> >> >> >> >> >
> >> >> >> >> >> > ------------------------------------------------------
> -
> >> >> >> >> >> > --
> >> -
> >> >> >> >> >> > --
> >> >> -
> >> >> >> >> >> > --
> >> >> >> -
> >> >> >> >> >> > --
> >> >> >> >> -
> >> >> >> >> >> > --
> >> >> >> >> >> -
> >> >> >> >> >> > --------
> >> >> >> >> >> > _______________________________________________
> >> >> >> >> >> > Kaldi-users mailing list
> >> >> >> >> >> > Kal...@li...
> >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-
> user
> >> >> >> >> >> > s