Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Bingo. G.fst is not determinizable (the "good" G.fst takes under a second to determinize). And the bad one loops at the word "zero" like this

#0
unsure  unsure
#0
of      of
#0
yours   yours
#0
is      is
#0
your    your
#0
zip     zip
#0
wrong   wrong
#0
with    with
#0
zero    zero
#0
zero    zero
....

I am taking the LM straight from ngram_counts to the standard pipeline, nothing fancy. The only thing is it has a lot of OOVs:

remove_oovs.pl: removed 4646 lines.

Is this generally a problem? So does my "good" arpa LM. I grepped both for the word zero, but could not spot anything outrageous. Can you think of anything I can look for?

My source is no longer than 10 days old. Here's the pipeline, just in case.

cat $src/$arpalm | tr -d '\r' | \
  utils/find_arpa_oovs.pl $lang/words.txt  > $lang/lm_oovs.txt

cat $src/$arpalm | tr -d '\r' | \
  arpa2fst - | fstprint | \
  utils/remove_oovs.pl $lang/lm_oovs.txt | \
  utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt \
    --osymbols=$lang/words.txt  --keep_isymbols=false --keep_osymbols=false | \
   fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst

 -kkm

> -----Original Message-----
> From: Daniel Povey [mailto:dp...@gm...]
> Sent: 2015-06-15 2206
> To: Kirill Katsnelson
> Cc: kal...@li...
> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> 
> I don't recommend to look at the fstdeterminizestar algorithm itself-
> it's very complicated.  Instead focus on the definition of
> "determinizable" and the twins property, and figure out what path you
> are taking through L.fst and G.fst.  Trying to fstdeterminizestar G.fst
> directly, and seeing whether it terminates or not, may tell you
> something; if it fails, send the signal and see what happens.
> fstdeterminizestar does care about the weights, but only to the extent
> that they are the same or different from each other; and if your G.fst
> is generated from arpa2fst the pipeline should work for any ARPA-format
> language model- make sure you are using an up-to-date Kaldi though,
> there have been fixes as recently as a few months ago.
> The presence of SIL is not surprising, it is the optional-silence added
> by the lexicon.  I think that script is adding #16 if it does
> *not* take the optional silence, otherwise it adds the phone SIL.
> Since you are calling your FST a "grammar" I'm wondering whether you
> have done something fancy with mapping words to FSTs or something like
> that, which is causing the result to not be determinizable.
> 
> Dan
> 
> 
> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson
> <kir...@sm...> wrote:
> > Thank you very much for your help Dan, but I am still stuck.
> >
> > First of all, a question: does the fstdeterminizestar algorithm
> depend on actual backoff and n-gram probabilities, i.e. will it behave
> differently if the numbers in arpa model file are different? Or does it
> depend only on arc labels but not weights? I am looking at the code but
> certainly I am far from being able to understand it. I cheated by
> looking at all if conditions in it, and this one in EpsilonClosure is
> seemingly the only one dealing with weights:
> >
> >             if (! ApproxEqual(weight, iter->second.weight, delta_)) {
> // add extra part of weight to queue.
> >
> > (In ProcessFinal it also has "if (this_final_weight !=
> > Weight::Zero())" but I do not believe it is relevant?)
> >
> > I am trying to understand how to dig into the problem--are weights in
> the picture actually.
> >
> > Also, just for a test, I ran the grammar trough a "grep -v 'real
> real'", and indeed got a similar loop on the word "very" which is also
> often repeated. But the "real real" 2- and 3-grams are there in the
> "good" grammar too.
> >
> > Another thing I do not understand is the presence of the SIL ilabel
> in the backtrace. Here's the beginning of the trace that leads to the
> infinite loop as decoded with a little script I wrote (format is ilabel
> [ TAB olabel ]:
> >
> > #16
> > #0
> > V_B
> > Y_I
> > UW1_I
> > Z_E     views
> > #2
> > SIL
> > #0
> > AH0_B
> > N_I
> > SH_I    unsure
> > UH1_I
> > R_E
> >
> > Note the presence of SIL at line 8. This is not in lexicon:
> >
> > $ grep SIL
> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt
> > !SIL    1       0.20    1.00    1.00    SIL_S
> > $
> >
> > Is this a hint? How did it get there at all? I am using a standard
> script to build the L_disambig.fst:
> >
> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt)
> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt)
> > utils/make_lexicon_fst_silprob.pl
> $lang/dict/lexiconp_silprob_disambig.txt \
> >               data/local/dict/silprob.txt $silphone '#'$ndisambig | \
> >      fstcompile --isymbols=$lang/phones.txt --
> osymbols=$lang/words.txt \
> >      --keep_isymbols=false --keep_osymbols=false |   \
> >      fstaddselfloops  "echo $phone_disambig_symbol |" "echo
> $word_disambig_symbol |" | \
> >      fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1;
> >
> > I checked the lexicon, and there are indeed only real phones at the
> beginning of each word, no empty positions and no #N symbols.
> >
> >  -kkm
> >
> >> -----Original Message-----
> >> From: Daniel Povey [mailto:dp...@gm...]
> >> Sent: 2015-06-15 1944
> >> To: Kirill Katsnelson
> >> Cc: kal...@li...
> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> >>
> >> I think the confusion is probably between two loops with "real" on
> >> them in G.fst: one loop where you always take the bigram
> probability,
> >> and one where you always take the unigram probability.  Or maybe a
> >> similar confusion between a loop where you use the trigram "real
> real
> >> real" and the bigram "real real".  Those loops are expected to
> exist.
> >> Probably the issue is that something happened at the start of the
> >> sequence which caused the FST to be confused about which of those
> two
> >> states it was in.  If you have any empty words (words with empty
> >> pronunciation) in your lexicon this could possibly happen, as it
> >> would be confused between  taking a normal word, then the backoff
> symbol, vs.
> >> taking a normal word, then the empty word, then the backoff symbol.
> >> I think the current Kaldi graph-creation script check for empty
> words
> >> in the lexicon, for this reason.
> >>
> >> Dan
> >>
> >>
> >>
> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( )
> >> generally almost makes sense, given that #16 is the last one in
> >> table, the silence disambiguation symbol. (Not sure why "real" is
> >> emitted at L_E--I would rather expect it to be emitted at #1.) What
> I
> >> do not understand is what exactly the debug trace represents, and
> >> what should I make out if it. It is a path through the FST graph,
> but
> >> I do not understand what is this path exactly, and what does this
> >> endless walk of this loop mean.
> >> >
> >> >  -kkm
> >> >
> >> >> -----Original Message-----
> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> Sent: 2015-06-15 1858
> >> >> To: Kirill Katsnelson
> >> >> Cc: kal...@li...
> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >> >> completes
> >> >>
> >> >> Look into the "backoff disambiguation symbol", normally called
> #0.
> >> >> The reason why it is needed should be explained in the hbka.pdf
> >> paper.
> >> >> Dan
> >> >>
> >> >>
> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
> >> >> <kir...@sm...> wrote:
> >> >> > Thank you! The output consists of some sequences as you
> >> >> > described,
> >> >> quickly falling into a short ever repeated loop.
> >> >> >
> >> >> > The non-repeated section ends up with osymbols (excluding
> >> epsilons)
> >> >> "whatsoever on vacation up", and then the repeated part looks
> like "
> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word
> "real"
> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig.
> >> >> >
> >> >> > Both LMs contain a bigram for "vacation up" and a trigram
> >> "vacation
> >> >> up there". "up real" is a bigram in both, with 3-grams "up real
> >> quick"
> >> >> and "up real quickly". "up real" is also a tail of a few other
> >> >> 3-grams, but these are also same in both models (up to their
> >> weights).
> >> >> >
> >> >> > It looks I do not understand what should I make in the end out
> >> >> > of
> >> >> this
> >> >> > debug data :(
> >> >> >
> >> >> >  -kkm
> >> >> >
> >> >> >> -----Original Message-----
> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >> >> >> Sent: 2015-06-15 1821
> >> >> >> To: Kirill Katsnelson
> >> >> >> Cc: kal...@li...
> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >> >> >> completes
> >> >> >>
> >> >> >> > I have a small set of sentences with repeat counts, and
> >> >> >> > generating an
> >> >> >> LM out of it. One is generated by a horrible local tool I have
> >> >> >> trouble tracing exactly how. For this one L*G composition
> takes
> >> >> about
> >> >> >> 20 seconds on my CPU. Another LM I just generated out of the
> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been
> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for
> >> >> >> about 30
> >> minutes,
> >> >> >> and still churning. fstdeterminizestar --use-log=true is
> >> >> >> running at
> >> >> 100%.
> >> >> >> L_disambig.fst is the same file in both cases. Looks like the
> G
> >> >> >> making it not determinizable, although I have no idea how it
> >> >> >> came to
> >> >> be.
> >> >> >> >
> >> >> >> > Anyone could share an advice on tracking down the problem?
> >> Thanks.
> >> >> >>
> >> >> >> You can send a signal to that program like  kill -SIGUSR1
> >> >> >> process-id and it will print out some info about the symbol
> >> >> >> sequences involved, I think it is like
> >> >> >>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
> >> >> >> Usually there is a particular word sequence that is
> problematic.
> >> >> >> Dan
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> >  -kkm
> >> >> >> >
> >> >> >> > ------------------------------------------------------------
> -
> >> >> >> > --
> >> -
> >> >> >> > --
> >> >> -
> >> >> >> > --
> >> >> >> -
> >> >> >> > -------- _______________________________________________
> >> >> >> > Kaldi-users mailing list
> >> >> >> > Kal...@li...
> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users