Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Thank you very much for your help Dan, but I am still stuck.

First of all, a question: does the fstdeterminizestar algorithm depend on actual backoff and n-gram probabilities, i.e. will it behave differently if the numbers in arpa model file are different? Or does it depend only on arc labels but not weights? I am looking at the code but certainly I am far from being able to understand it. I cheated by looking at all if conditions in it, and this one in EpsilonClosure is seemingly the only one dealing with weights:

            if (! ApproxEqual(weight, iter->second.weight, delta_)) {  // add extra part of weight to queue.

(In ProcessFinal it also has "if (this_final_weight != Weight::Zero())" but I do not believe it is relevant?)

I am trying to understand how to dig into the problem--are weights in the picture actually.

Also, just for a test, I ran the grammar trough a "grep -v 'real real'", and indeed got a similar loop on the word "very" which is also often repeated. But the "real real" 2- and 3-grams are there in the "good" grammar too.

Another thing I do not understand is the presence of the SIL ilabel in the backtrace. Here's the beginning of the trace that leads to the infinite loop as decoded with a little script I wrote (format is ilabel [ TAB olabel ]:

#16
#0
V_B
Y_I
UW1_I
Z_E     views
#2
SIL
#0
AH0_B
N_I
SH_I    unsure
UH1_I
R_E

Note the presence of SIL at line 8. This is not in lexicon:

$ grep SIL data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt
!SIL    1       0.20    1.00    1.00    SIL_S
$

Is this a hint? How did it get there at all? I am using a standard script to build the L_disambig.fst:

phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt)
word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt)
utils/make_lexicon_fst_silprob.pl $lang/dict/lexiconp_silprob_disambig.txt \
              data/local/dict/silprob.txt $silphone '#'$ndisambig | \
     fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/words.txt \
     --keep_isymbols=false --keep_osymbols=false |   \
     fstaddselfloops  "echo $phone_disambig_symbol |" "echo $word_disambig_symbol |" | \
     fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1;

I checked the lexicon, and there are indeed only real phones at the beginning of each word, no empty positions and no #N symbols.

 -kkm

> -----Original Message-----
> From: Daniel Povey [mailto:dp...@gm...]
> Sent: 2015-06-15 1944
> To: Kirill Katsnelson
> Cc: kal...@li...
> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> 
> I think the confusion is probably between two loops with "real" on them
> in G.fst: one loop where you always take the bigram probability, and
> one where you always take the unigram probability.  Or maybe a similar
> confusion between a loop where you use the trigram "real real real" and
> the bigram "real real".  Those loops are expected to exist.
> Probably the issue is that something happened at the start of the
> sequence which caused the FST to be confused about which of those two
> states it was in.  If you have any empty words (words with empty
> pronunciation) in your lexicon this could possibly happen, as it would
> be confused between  taking a normal word, then the backoff symbol, vs.
> taking a normal word, then the empty word, then the backoff symbol.
> I think the current Kaldi graph-creation script check for empty words
> in the lexicon, for this reason.
> 
> Dan
> 
> 
> 
> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( )
> generally almost makes sense, given that #16 is the last one in table,
> the silence disambiguation symbol. (Not sure why "real" is emitted at
> L_E--I would rather expect it to be emitted at #1.) What I do not
> understand is what exactly the debug trace represents, and what should
> I make out if it. It is a path through the FST graph, but I do not
> understand what is this path exactly, and what does this endless walk
> of this loop mean.
> >
> >  -kkm
> >
> >> -----Original Message-----
> >> From: Daniel Povey [mailto:dp...@gm...]
> >> Sent: 2015-06-15 1858
> >> To: Kirill Katsnelson
> >> Cc: kal...@li...
> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> >>
> >> Look into the "backoff disambiguation symbol", normally called #0.
> >> The reason why it is needed should be explained in the hbka.pdf
> paper.
> >> Dan
> >>
> >>
> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
> >> <kir...@sm...> wrote:
> >> > Thank you! The output consists of some sequences as you described,
> >> quickly falling into a short ever repeated loop.
> >> >
> >> > The non-repeated section ends up with osymbols (excluding
> epsilons)
> >> "whatsoever on vacation up", and then the repeated part looks like "
> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real"
> >> is spelled "R_B IY1_I L_E #1" in L_disambig.
> >> >
> >> > Both LMs contain a bigram for "vacation up" and a trigram
> "vacation
> >> up there". "up real" is a bigram in both, with 3-grams "up real
> quick"
> >> and "up real quickly". "up real" is also a tail of a few other
> >> 3-grams, but these are also same in both models (up to their
> weights).
> >> >
> >> > It looks I do not understand what should I make in the end out of
> >> this
> >> > debug data :(
> >> >
> >> >  -kkm
> >> >
> >> >> -----Original Message-----
> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> Sent: 2015-06-15 1821
> >> >> To: Kirill Katsnelson
> >> >> Cc: kal...@li...
> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >> >> completes
> >> >>
> >> >> > I have a small set of sentences with repeat counts, and
> >> >> > generating an
> >> >> LM out of it. One is generated by a horrible local tool I have
> >> >> trouble tracing exactly how. For this one L*G composition takes
> >> about
> >> >> 20 seconds on my CPU. Another LM I just generated out of the same
> >> >> files with srilm 1.7.1 ngram-count. This one has been sitting in
> >> >> mkgraphs.sh on L_disambig*G composition step for about 30
> minutes,
> >> >> and still churning. fstdeterminizestar --use-log=true is running
> >> >> at
> >> 100%.
> >> >> L_disambig.fst is the same file in both cases. Looks like the G
> >> >> making it not determinizable, although I have no idea how it came
> >> >> to
> >> be.
> >> >> >
> >> >> > Anyone could share an advice on tracking down the problem?
> Thanks.
> >> >>
> >> >> You can send a signal to that program like  kill -SIGUSR1
> >> >> process-id and it will print out some info about the symbol
> >> >> sequences involved, I think it is like
> >> >>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
> >> >> Usually there is a particular word sequence that is problematic.
> >> >> Dan
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> >
> >> >> >  -kkm
> >> >> >
> >> >> > ---------------------------------------------------------------
> -
> >> >> > --
> >> -
> >> >> > --
> >> >> -
> >> >> > -------- _______________________________________________
> >> >> > Kaldi-users mailing list
> >> >> > Kal...@li...
> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users