Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Holy guacamole! That was it. Thank you very very much.

Perhaps arpa2fst v2.0 would detect such bloopers.

> -----Original Message-----
> From: Daniel Povey [mailto:dp...@gm...]
> Sent: 2015-06-16 1526
> To: Kirill Katsnelson
> Cc: kal...@li...
> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> 
> It turns out the problem was probably caused  by the end of-sentence
> symbol </s> appearing in inappropriate places in the LM, at the start
> of n-grams rather than the end.  Probably the training data was
> contaminated somehow by </s>.
> Dan
> 
> 
> On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote:
> >> I am currently trying to get a minimal reproduction with a script.
> Let it run for a while. I'll send you what remains of it, and hope it
> might give me an idea too.
> >>
> >> Looks like that fstdeterminize may have completed on this grammar
> >> (how do you call the thing symbolized as $G$? "grammar" sounded
> >> confusing, as I understand, but I have no other word not exceeding 2
> >> syllables :))
> >
> > I would call it an LM.
> >
> >>> I have left one running by mistake before going to sleep, and it
> was done. I am running one again with the time command to make sure
> this is not a fluke. So it is possible that it is not exactly non-
> determinizable, but instead takes enormous time (hours on one LM, < 1
> sec on another). Which is the same thing from the engineering
> standpoint, close enough, as those engineering vs mathematics jokes go.
> But jokes aside, I want something more bounded for a production system,
> so I need to understand what throws it off so badly.
> >
> > I would still call it a problem.  Check if your ARPA contains <eps>
> or
> > #0.  I may need to add checks for this into arpa2fst (which we will
> > rewrite at some point anyway).  Another problem could be weird things
> > like stray \r's which make one word seem like two in some
> > circumstances.
> > If I saw the output of arpa2fst I could probably figure out fairly
> > quickly what the problem was.  The way I would debug this is to trace
> > through your LM FST from the start and follow those symbols (or
> > epsilons) on that trace from the determinization failure, and see how
> > there are two different paths.
> > It's better if you share a couple different traces, not just one, so
> > we can see what's in common.
> >
> >> Is fstdeterminizestar more than fstrmepsilon ∘  fstdeterminize (the
> latter with the kaldi patch)?
> >
> > No, it should be faster.   fstrmepsilon ∘  fstdeterminize should fail
> too.
> >
> >> Ah, and this is a Linux machine. So everything looks very very
> standard (oops. Did I just create an infinite loop by repeating a
> word?).
> >
> > I am considering changing the way the LM disambig symbols are used to
> > make this kind of problem less likely to happen in future, by having
> > several disambig symbols for the LM, one per order, instead of just
> > one.
> >
> > Dan
> >
> >
> >
> >>> -----Original Message-----
> >>> From: Daniel Povey [mailto:dp...@gm...]
> >>> Sent: 2015-06-15 2340
> >>> To: Kirill Katsnelson
> >>> Cc: kal...@li...
> >>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
> >>>
> >>> In general SRILM language models are OK, but something weird could
> >>> have happened, especially on an unusual platform like Windows.
> >>> Look for duplicate lines with apparently the same n-gram on, and
> >>> also send to me (but not to kaldi-user) the arpa LM.
> >>> Dan
> >>>
> >>>
> >>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson
> >>> <kir...@sm...> wrote:
> >>> > Nope. The only thing I am thinking of doing is to bisect it
> >>> > somehow,
> >>> to get a minimal grammar that still refuses to determinize. I tried
> >>> different smoothing and played with other switches to ngram_count,
> >>> but it still does loop. Are there any known problems with
> >>> srilm-generated models?
> >>> >
> >>> >  -kkm
> >>> >
> >>> >> -----Original Message-----
> >>> >> From: Daniel Povey [mailto:dp...@gm...]
> >>> >> Sent: 2015-06-15 2248
> >>> >> To: Kirill Katsnelson
> >>> >> Cc: kal...@li...
> >>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >>> >> completes
> >>> >>
> >>> >> OOVs should be OK.
> >>> >> Make sure there are no n-grams with things like <s> <s>
> >>> >>
> >>> >> e.g. see the lines
> >>> >>     grep -v '<s> <s>' | \
> >>> >>     grep -v '</s> <s>' | \
> >>> >>     grep -v '</s> </s>' | \
> >>> >>
> >>> >> in the WSJ script:
> >>> >>
> >>> >>  gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \
> >>> >>     grep -v '<s> <s>' | \
> >>> >>     grep -v '</s> <s>' | \
> >>> >>     grep -v '</s> </s>' | \
> >>> >>     arpa2fst - | fstprint | \
> >>> >>     utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \
> >>> >>     utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
> >>> >> isymbols=$test/words.txt \
> >>> >>       --osymbols=$test/words.txt  --keep_isymbols=false --
> >>> >> keep_osymbols=false | \
> >>> >>      fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst
> >>> >>
> >>> >> Dan
> >>> >>
> >>> >>
> >>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson
> >>> >> <kir...@sm...> wrote:
> >>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes
> >>> >> > under a second to determinize). And the bad one loops at the
> word "zero"
> >>> >> > like this
> >>> >> >
> >>> >> > #0
> >>> >> > unsure  unsure
> >>> >> > #0
> >>> >> > of      of
> >>> >> > #0
> >>> >> > yours   yours
> >>> >> > #0
> >>> >> > is      is
> >>> >> > #0
> >>> >> > your    your
> >>> >> > #0
> >>> >> > zip     zip
> >>> >> > #0
> >>> >> > wrong   wrong
> >>> >> > #0
> >>> >> > with    with
> >>> >> > #0
> >>> >> > zero    zero
> >>> >> > #0
> >>> >> > zero    zero
> >>> >> > ....
> >>> >> >
> >>> >> > I am taking the LM straight from ngram_counts to the standard
> >>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs:
> >>> >> >
> >>> >> > remove_oovs.pl: removed 4646 lines.
> >>> >> >
> >>> >> > Is this generally a problem? So does my "good" arpa LM. I
> >>> >> > grepped
> >>> >> both for the word zero, but could not spot anything outrageous.
> >>> >> Can you think of anything I can look for?
> >>> >> >
> >>> >> > My source is no longer than 10 days old. Here's the pipeline,
> >>> >> > just in
> >>> >> case.
> >>> >> >
> >>> >> > cat $src/$arpalm | tr -d '\r' | \
> >>> >> >   utils/find_arpa_oovs.pl $lang/words.txt  > $lang/lm_oovs.txt
> >>> >> >
> >>> >> > cat $src/$arpalm | tr -d '\r' | \
> >>> >> >   arpa2fst - | fstprint | \
> >>> >> >   utils/remove_oovs.pl $lang/lm_oovs.txt | \
> >>> >> >   utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
> >>> >> isymbols=$lang/words.txt \
> >>> >> >     --osymbols=$lang/words.txt  --keep_isymbols=false --
> >>> >> keep_osymbols=false | \
> >>> >> >    fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst
> >>> >> >
> >>> >> >  -kkm
> >>> >> >
> >>> >> >
> >>> >> >> -----Original Message-----
> >>> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >>> >> >> Sent: 2015-06-15 2206
> >>> >> >> To: Kirill Katsnelson
> >>> >> >> Cc: kal...@li...
> >>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >>> >> >> completes
> >>> >> >>
> >>> >> >> I don't recommend to look at the fstdeterminizestar algorithm
> >>> >> itself-
> >>> >> >> it's very complicated.  Instead focus on the definition of
> >>> >> >> "determinizable" and the twins property, and figure out what
> >>> >> >> path
> >>> >> you
> >>> >> >> are taking through L.fst and G.fst.  Trying to
> >>> >> >> fstdeterminizestar G.fst directly, and seeing whether it
> >>> >> >> terminates or not, may tell
> >>> >> you
> >>> >> >> something; if it fails, send the signal and see what happens.
> >>> >> >> fstdeterminizestar does care about the weights, but only to
> >>> >> >> the extent that they are the same or different from each
> >>> >> >> other; and
> >>> if
> >>> >> >> your G.fst is generated from arpa2fst the pipeline should
> work
> >>> for
> >>> >> >> any ARPA-format language model- make sure you are using an
> >>> >> >> up-to-
> >>> >> date
> >>> >> >> Kaldi though, there have been fixes as recently as a few
> >>> >> >> months
> >>> ago.
> >>> >> >> The presence of SIL is not surprising, it is the
> >>> >> >> optional-silence added by the lexicon.  I think that script
> is
> >>> >> >> adding #16 if it does
> >>> >> >> *not* take the optional silence, otherwise it adds the phone
> SIL.
> >>> >> >> Since you are calling your FST a "grammar" I'm wondering
> >>> >> >> whether you have done something fancy with mapping words to
> >>> >> >> FSTs or something like that, which is causing the result to
> >>> >> >> not be
> >>> determinizable.
> >>> >> >>
> >>> >> >> Dan
> >>> >> >>
> >>> >> >>
> >>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson
> >>> >> >> <kir...@sm...> wrote:
> >>> >> >> > Thank you very much for your help Dan, but I am still
> stuck.
> >>> >> >> >
> >>> >> >> > First of all, a question: does the fstdeterminizestar
> >>> >> >> > algorithm
> >>> >> >> depend on actual backoff and n-gram probabilities, i.e. will
> >>> >> >> it behave differently if the numbers in arpa model file are
> >>> different?
> >>> >> >> Or does it depend only on arc labels but not weights? I am
> >>> looking
> >>> >> at
> >>> >> >> the code but certainly I am far from being able to understand
> it.
> >>> >> >> I cheated by looking at all if conditions in it, and this one
> >>> >> >> in EpsilonClosure is seemingly the only one dealing with
> weights:
> >>> >> >> >
> >>> >> >> >             if (! ApproxEqual(weight, iter->second.weight,
> >>> >> delta_))
> >>> >> >> > {
> >>> >> >> // add extra part of weight to queue.
> >>> >> >> >
> >>> >> >> > (In ProcessFinal it also has "if (this_final_weight !=
> >>> >> >> > Weight::Zero())" but I do not believe it is relevant?)
> >>> >> >> >
> >>> >> >> > I am trying to understand how to dig into the problem--are
> >>> >> >> > weights in
> >>> >> >> the picture actually.
> >>> >> >> >
> >>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v
> >>> 'real
> >>> >> >> real'", and indeed got a similar loop on the word "very"
> which
> >>> >> >> is also often repeated. But the "real real" 2- and 3-grams
> are
> >>> >> >> there in the "good" grammar too.
> >>> >> >> >
> >>> >> >> > Another thing I do not understand is the presence of the
> SIL
> >>> >> ilabel
> >>> >> >> in the backtrace. Here's the beginning of the trace that
> leads
> >>> >> >> to
> >>> >> the
> >>> >> >> infinite loop as decoded with a little script I wrote (format
> >>> >> >> is ilabel [ TAB olabel ]:
> >>> >> >> >
> >>> >> >> > #16
> >>> >> >> > #0
> >>> >> >> > V_B
> >>> >> >> > Y_I
> >>> >> >> > UW1_I
> >>> >> >> > Z_E     views
> >>> >> >> > #2
> >>> >> >> > SIL
> >>> >> >> > #0
> >>> >> >> > AH0_B
> >>> >> >> > N_I
> >>> >> >> > SH_I    unsure
> >>> >> >> > UH1_I
> >>> >> >> > R_E
> >>> >> >> >
> >>> >> >> > Note the presence of SIL at line 8. This is not in lexicon:
> >>> >> >> >
> >>> >> >> > $ grep SIL
> >>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt
> >>> >> >> > !SIL    1       0.20    1.00    1.00    SIL_S
> >>> >> >> > $
> >>> >> >> >
> >>> >> >> > Is this a hint? How did it get there at all? I am using a
> >>> >> >> > standard
> >>> >> >> script to build the L_disambig.fst:
> >>> >> >> >
> >>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}'
> >>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk
> '$1=="#0"{print
> >>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl
> >>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \
> >>> >> >> >               data/local/dict/silprob.txt $silphone
> >>> >> >> > '#'$ndisambig
> >>> >> | \
> >>> >> >> >      fstcompile --isymbols=$lang/phones.txt --
> >>> >> >> osymbols=$lang/words.txt \
> >>> >> >> >      --keep_isymbols=false --keep_osymbols=false |   \
> >>> >> >> >      fstaddselfloops  "echo $phone_disambig_symbol |" "echo
> >>> >> >> $word_disambig_symbol |" | \
> >>> >> >> >      fstarcsort --sort_type=olabel > $lang/L_disambig.fst
> ||
> >>> >> >> > exit 1;
> >>> >> >> >
> >>> >> >> > I checked the lexicon, and there are indeed only real
> phones
> >>> >> >> > at
> >>> >> the
> >>> >> >> beginning of each word, no empty positions and no #N symbols.
> >>> >> >> >
> >>> >> >> >  -kkm
> >>> >> >> >
> >>> >> >> >> -----Original Message-----
> >>> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >>> >> >> >> Sent: 2015-06-15 1944
> >>> >> >> >> To: Kirill Katsnelson
> >>> >> >> >> Cc: kal...@li...
> >>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
> >>> >> >> >> completes
> >>> >> >> >>
> >>> >> >> >> I think the confusion is probably between two loops with
> >>> "real"
> >>> >> on
> >>> >> >> >> them in G.fst: one loop where you always take the bigram
> >>> >> >> probability,
> >>> >> >> >> and one where you always take the unigram probability.  Or
> >>> >> >> >> maybe
> >>> >> a
> >>> >> >> >> similar confusion between a loop where you use the trigram
> >>> >> >> >> "real
> >>> >> >> real
> >>> >> >> >> real" and the bigram "real real".  Those loops are
> expected
> >>> >> >> >> to
> >>> >> >> exist.
> >>> >> >> >> Probably the issue is that something happened at the start
> >>> >> >> >> of the sequence which caused the FST to be confused about
> >>> >> >> >> which
> >>> of
> >>> >> >> >> those
> >>> >> >> two
> >>> >> >> >> states it was in.  If you have any empty words (words with
> >>> >> >> >> empty
> >>> >> >> >> pronunciation) in your lexicon this could possibly happen,
> >>> >> >> >> as it would be confused between  taking a normal word,
> then
> >>> >> >> >> the backoff
> >>> >> >> symbol, vs.
> >>> >> >> >> taking a normal word, then the empty word, then the
> backoff
> >>> >> symbol.
> >>> >> >> >> I think the current Kaldi graph-creation script check for
> >>> empty
> >>> >> >> words
> >>> >> >> >> in the lexicon, for this reason.
> >>> >> >> >>
> >>> >> >> >> Dan
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >>
> >>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( )
> >>> >> >> >> > #0
> >>> (
> >>> >> >> >> > )
> >>> >> >> >> generally almost makes sense, given that #16 is the last
> >>> >> >> >> one
> >>> in
> >>> >> >> >> table, the silence disambiguation symbol. (Not sure why
> "real"
> >>> >> >> >> is emitted at L_E--I would rather expect it to be emitted
> >>> >> >> >> at
> >>> >> >> >> #1.) What
> >>> >> >> I
> >>> >> >> >> do not understand is what exactly the debug trace
> >>> >> >> >> represents, and what should I make out if it. It is a path
> >>> >> >> >> through the FST graph,
> >>> >> >> but
> >>> >> >> >> I do not understand what is this path exactly, and what
> >>> >> >> >> does this endless walk of this loop mean.
> >>> >> >> >> >
> >>> >> >> >> >  -kkm
> >>> >> >> >> >
> >>> >> >> >> >> -----Original Message-----
> >>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >>> >> >> >> >> Sent: 2015-06-15 1858
> >>> >> >> >> >> To: Kirill Katsnelson
> >>> >> >> >> >> Cc: kal...@li...
> >>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G)
> >>> >> >> >> >> never completes
> >>> >> >> >> >>
> >>> >> >> >> >> Look into the "backoff disambiguation symbol", normally
> >>> >> >> >> >> called
> >>> >> >> #0.
> >>> >> >> >> >> The reason why it is needed should be explained in the
> >>> >> hbka.pdf
> >>> >> >> >> paper.
> >>> >> >> >> >> Dan
> >>> >> >> >> >>
> >>> >> >> >> >>
> >>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
> >>> >> >> >> >> <kir...@sm...> wrote:
> >>> >> >> >> >> > Thank you! The output consists of some sequences as
> >>> >> >> >> >> > you described,
> >>> >> >> >> >> quickly falling into a short ever repeated loop.
> >>> >> >> >> >> >
> >>> >> >> >> >> > The non-repeated section ends up with osymbols
> >>> >> >> >> >> > (excluding
> >>> >> >> >> epsilons)
> >>> >> >> >> >> "whatsoever on vacation up", and then the repeated part
> >>> >> >> >> >> looks
> >>> >> >> like "
> >>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)".
> The
> >>> >> >> >> >> word
> >>> >> >> "real"
> >>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig.
> >>> >> >> >> >> >
> >>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a
> >>> >> >> >> >> > trigram
> >>> >> >> >> "vacation
> >>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams
> >>> >> >> >> >> "up
> >>> >> real
> >>> >> >> >> quick"
> >>> >> >> >> >> and "up real quickly". "up real" is also a tail of a
> few
> >>> >> >> >> >> other 3-grams, but these are also same in both models
> >>> >> >> >> >> (up
> >>> to
> >>> >> >> >> >> their
> >>> >> >> >> weights).
> >>> >> >> >> >> >
> >>> >> >> >> >> > It looks I do not understand what should I make in
> the
> >>> end
> >>> >> >> >> >> > out of
> >>> >> >> >> >> this
> >>> >> >> >> >> > debug data :(
> >>> >> >> >> >> >
> >>> >> >> >> >> >  -kkm
> >>> >> >> >> >> >
> >>> >> >> >> >> >> -----Original Message-----
> >>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
> >>> >> >> >> >> >> Sent: 2015-06-15 1821
> >>> >> >> >> >> >> To: Kirill Katsnelson
> >>> >> >> >> >> >> Cc: kal...@li...
> >>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G)
> >>> never
> >>> >> >> >> >> >> completes
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> > I have a small set of sentences with repeat
> counts,
> >>> and
> >>> >> >> >> >> >> > generating an
> >>> >> >> >> >> >> LM out of it. One is generated by a horrible local
> >>> >> >> >> >> >> tool
> >>> I
> >>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G
> >>> >> >> >> >> >> composition
> >>> >> >> takes
> >>> >> >> >> >> about
> >>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated
> out
> >>> >> >> >> >> >> of
> >>> >> the
> >>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one
> has
> >>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G
> >>> >> >> >> >> >> composition step for about 30
> >>> >> >> >> minutes,
> >>> >> >> >> >> >> and still churning. fstdeterminizestar --use-
> log=true
> >>> >> >> >> >> >> is running at
> >>> >> >> >> >> 100%.
> >>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks
> >>> like
> >>> >> >> >> >> >> the
> >>> >> >> G
> >>> >> >> >> >> >> making it not determinizable, although I have no
> idea
> >>> how
> >>> >> it
> >>> >> >> >> >> >> came to
> >>> >> >> >> >> be.
> >>> >> >> >> >> >> >
> >>> >> >> >> >> >> > Anyone could share an advice on tracking down the
> >>> >> problem?
> >>> >> >> >> Thanks.
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> You can send a signal to that program like  kill -
> >>> SIGUSR1
> >>> >> >> >> >> >> process-id and it will print out some info about the
> >>> >> >> >> >> >> symbol sequences involved, I think it is like
> >>> >> >> >> >> >>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
> >>> >> >> >> >> >> Usually there is a particular word sequence that is
> >>> >> >> problematic.
> >>> >> >> >> >> >> Dan
> >>> >> >> >> >> >>
> >>> >> >> >> >> >>
> >>> >> >> >> >> >>
> >>> >> >> >> >> >>
> >>> >> >> >> >> >> >
> >>> >> >> >> >> >> >  -kkm
> >>> >> >> >> >> >> >
> >>> >> >> >> >> >> > --------------------------------------------------
> -
> >>> >> >> >> >> >> > ---
> >>> -
> >>> >> >> >> >> >> > --
> >>> >> -
> >>> >> >> >> >> >> > --
> >>> >> >> -
> >>> >> >> >> >> >> > --
> >>> >> >> >> -
> >>> >> >> >> >> >> > --
> >>> >> >> >> >> -
> >>> >> >> >> >> >> > --
> >>> >> >> >> >> >> -
> >>> >> >> >> >> >> > --------
> >>> >> >> >> >> >> > _______________________________________________
> >>> >> >> >> >> >> > Kaldi-users mailing list
> >>> >> >> >> >> >> > Kal...@li...
> >>> >> >> >> >> >> >
> https://lists.sourceforge.net/lists/listinfo/kaldi-
> >>> user
> >>> >> >> >> >> >> > s