Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

In general SRILM language models are OK, but something weird could
have happened, especially on an unusual platform like Windows.
Look for duplicate lines with apparently the same n-gram on, and also
send to me (but not to kaldi-user) the arpa LM.
Dan

On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson
<kir...@sm...> wrote:
> Nope. The only thing I am thinking of doing is to bisect it somehow, to get a minimal grammar that still refuses to determinize. I tried different smoothing and played with other switches to ngram_count, but it still does loop. Are there any known problems with srilm-generated models?
>
>  -kkm
>
>> -----Original Message-----
>> From: Daniel Povey [mailto:dp...@gm...]
>> Sent: 2015-06-15 2248
>> To: Kirill Katsnelson
>> Cc: kal...@li...
>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
>>
>> OOVs should be OK.
>> Make sure there are no n-grams with things like <s> <s>
>>
>> e.g. see the lines
>>     grep -v '<s> <s>' | \
>>     grep -v '</s> <s>' | \
>>     grep -v '</s> </s>' | \
>>
>> in the WSJ script:
>>
>>  gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \
>>     grep -v '<s> <s>' | \
>>     grep -v '</s> <s>' | \
>>     grep -v '</s> </s>' | \
>>     arpa2fst - | fstprint | \
>>     utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \
>>     utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
>> isymbols=$test/words.txt \
>>       --osymbols=$test/words.txt  --keep_isymbols=false --
>> keep_osymbols=false | \
>>      fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst
>>
>> Dan
>>
>>
>> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson
>> <kir...@sm...> wrote:
>> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a
>> > second to determinize). And the bad one loops at the word "zero" like
>> > this
>> >
>> > #0
>> > unsure  unsure
>> > #0
>> > of      of
>> > #0
>> > yours   yours
>> > #0
>> > is      is
>> > #0
>> > your    your
>> > #0
>> > zip     zip
>> > #0
>> > wrong   wrong
>> > #0
>> > with    with
>> > #0
>> > zero    zero
>> > #0
>> > zero    zero
>> > ....
>> >
>> > I am taking the LM straight from ngram_counts to the standard
>> pipeline, nothing fancy. The only thing is it has a lot of OOVs:
>> >
>> > remove_oovs.pl: removed 4646 lines.
>> >
>> > Is this generally a problem? So does my "good" arpa LM. I grepped
>> both for the word zero, but could not spot anything outrageous. Can you
>> think of anything I can look for?
>> >
>> > My source is no longer than 10 days old. Here's the pipeline, just in
>> case.
>> >
>> > cat $src/$arpalm | tr -d '\r' | \
>> >   utils/find_arpa_oovs.pl $lang/words.txt  > $lang/lm_oovs.txt
>> >
>> > cat $src/$arpalm | tr -d '\r' | \
>> >   arpa2fst - | fstprint | \
>> >   utils/remove_oovs.pl $lang/lm_oovs.txt | \
>> >   utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --
>> isymbols=$lang/words.txt \
>> >     --osymbols=$lang/words.txt  --keep_isymbols=false --
>> keep_osymbols=false | \
>> >    fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst
>> >
>> >  -kkm
>> >
>> >
>> >> -----Original Message-----
>> >> From: Daniel Povey [mailto:dp...@gm...]
>> >> Sent: 2015-06-15 2206
>> >> To: Kirill Katsnelson
>> >> Cc: kal...@li...
>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
>> >>
>> >> I don't recommend to look at the fstdeterminizestar algorithm
>> itself-
>> >> it's very complicated.  Instead focus on the definition of
>> >> "determinizable" and the twins property, and figure out what path
>> you
>> >> are taking through L.fst and G.fst.  Trying to fstdeterminizestar
>> >> G.fst directly, and seeing whether it terminates or not, may tell
>> you
>> >> something; if it fails, send the signal and see what happens.
>> >> fstdeterminizestar does care about the weights, but only to the
>> >> extent that they are the same or different from each other; and if
>> >> your G.fst is generated from arpa2fst the pipeline should work for
>> >> any ARPA-format language model- make sure you are using an up-to-
>> date
>> >> Kaldi though, there have been fixes as recently as a few months ago.
>> >> The presence of SIL is not surprising, it is the optional-silence
>> >> added by the lexicon.  I think that script is adding #16 if it does
>> >> *not* take the optional silence, otherwise it adds the phone SIL.
>> >> Since you are calling your FST a "grammar" I'm wondering whether you
>> >> have done something fancy with mapping words to FSTs or something
>> >> like that, which is causing the result to not be determinizable.
>> >>
>> >> Dan
>> >>
>> >>
>> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson
>> >> <kir...@sm...> wrote:
>> >> > Thank you very much for your help Dan, but I am still stuck.
>> >> >
>> >> > First of all, a question: does the fstdeterminizestar algorithm
>> >> depend on actual backoff and n-gram probabilities, i.e. will it
>> >> behave differently if the numbers in arpa model file are different?
>> >> Or does it depend only on arc labels but not weights? I am looking
>> at
>> >> the code but certainly I am far from being able to understand it. I
>> >> cheated by looking at all if conditions in it, and this one in
>> >> EpsilonClosure is seemingly the only one dealing with weights:
>> >> >
>> >> >             if (! ApproxEqual(weight, iter->second.weight,
>> delta_))
>> >> > {
>> >> // add extra part of weight to queue.
>> >> >
>> >> > (In ProcessFinal it also has "if (this_final_weight !=
>> >> > Weight::Zero())" but I do not believe it is relevant?)
>> >> >
>> >> > I am trying to understand how to dig into the problem--are weights
>> >> > in
>> >> the picture actually.
>> >> >
>> >> > Also, just for a test, I ran the grammar trough a "grep -v 'real
>> >> real'", and indeed got a similar loop on the word "very" which is
>> >> also often repeated. But the "real real" 2- and 3-grams are there in
>> >> the "good" grammar too.
>> >> >
>> >> > Another thing I do not understand is the presence of the SIL
>> ilabel
>> >> in the backtrace. Here's the beginning of the trace that leads to
>> the
>> >> infinite loop as decoded with a little script I wrote (format is
>> >> ilabel [ TAB olabel ]:
>> >> >
>> >> > #16
>> >> > #0
>> >> > V_B
>> >> > Y_I
>> >> > UW1_I
>> >> > Z_E     views
>> >> > #2
>> >> > SIL
>> >> > #0
>> >> > AH0_B
>> >> > N_I
>> >> > SH_I    unsure
>> >> > UH1_I
>> >> > R_E
>> >> >
>> >> > Note the presence of SIL at line 8. This is not in lexicon:
>> >> >
>> >> > $ grep SIL
>> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt
>> >> > !SIL    1       0.20    1.00    1.00    SIL_S
>> >> > $
>> >> >
>> >> > Is this a hint? How did it get there at all? I am using a standard
>> >> script to build the L_disambig.fst:
>> >> >
>> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt)
>> >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt)
>> >> > utils/make_lexicon_fst_silprob.pl
>> >> $lang/dict/lexiconp_silprob_disambig.txt \
>> >> >               data/local/dict/silprob.txt $silphone '#'$ndisambig
>> | \
>> >> >      fstcompile --isymbols=$lang/phones.txt --
>> >> osymbols=$lang/words.txt \
>> >> >      --keep_isymbols=false --keep_osymbols=false |   \
>> >> >      fstaddselfloops  "echo $phone_disambig_symbol |" "echo
>> >> $word_disambig_symbol |" | \
>> >> >      fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit
>> >> > 1;
>> >> >
>> >> > I checked the lexicon, and there are indeed only real phones at
>> the
>> >> beginning of each word, no empty positions and no #N symbols.
>> >> >
>> >> >  -kkm
>> >> >
>> >> >> -----Original Message-----
>> >> >> From: Daniel Povey [mailto:dp...@gm...]
>> >> >> Sent: 2015-06-15 1944
>> >> >> To: Kirill Katsnelson
>> >> >> Cc: kal...@li...
>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
>> >> >> completes
>> >> >>
>> >> >> I think the confusion is probably between two loops with "real"
>> on
>> >> >> them in G.fst: one loop where you always take the bigram
>> >> probability,
>> >> >> and one where you always take the unigram probability.  Or maybe
>> a
>> >> >> similar confusion between a loop where you use the trigram "real
>> >> real
>> >> >> real" and the bigram "real real".  Those loops are expected to
>> >> exist.
>> >> >> Probably the issue is that something happened at the start of the
>> >> >> sequence which caused the FST to be confused about which of those
>> >> two
>> >> >> states it was in.  If you have any empty words (words with empty
>> >> >> pronunciation) in your lexicon this could possibly happen, as it
>> >> >> would be confused between  taking a normal word, then the backoff
>> >> symbol, vs.
>> >> >> taking a normal word, then the empty word, then the backoff
>> symbol.
>> >> >> I think the current Kaldi graph-creation script check for empty
>> >> words
>> >> >> in the lexicon, for this reason.
>> >> >>
>> >> >> Dan
>> >> >>
>> >> >>
>> >> >>
>> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( )
>> >> >> generally almost makes sense, given that #16 is the last one in
>> >> >> table, the silence disambiguation symbol. (Not sure why "real" is
>> >> >> emitted at L_E--I would rather expect it to be emitted at #1.)
>> >> >> What
>> >> I
>> >> >> do not understand is what exactly the debug trace represents, and
>> >> >> what should I make out if it. It is a path through the FST graph,
>> >> but
>> >> >> I do not understand what is this path exactly, and what does this
>> >> >> endless walk of this loop mean.
>> >> >> >
>> >> >> >  -kkm
>> >> >> >
>> >> >> >> -----Original Message-----
>> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
>> >> >> >> Sent: 2015-06-15 1858
>> >> >> >> To: Kirill Katsnelson
>> >> >> >> Cc: kal...@li...
>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
>> >> >> >> completes
>> >> >> >>
>> >> >> >> Look into the "backoff disambiguation symbol", normally called
>> >> #0.
>> >> >> >> The reason why it is needed should be explained in the
>> hbka.pdf
>> >> >> paper.
>> >> >> >> Dan
>> >> >> >>
>> >> >> >>
>> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
>> >> >> >> <kir...@sm...> wrote:
>> >> >> >> > Thank you! The output consists of some sequences as you
>> >> >> >> > described,
>> >> >> >> quickly falling into a short ever repeated loop.
>> >> >> >> >
>> >> >> >> > The non-repeated section ends up with osymbols (excluding
>> >> >> epsilons)
>> >> >> >> "whatsoever on vacation up", and then the repeated part looks
>> >> like "
>> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word
>> >> "real"
>> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig.
>> >> >> >> >
>> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram
>> >> >> "vacation
>> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up
>> real
>> >> >> quick"
>> >> >> >> and "up real quickly". "up real" is also a tail of a few other
>> >> >> >> 3-grams, but these are also same in both models (up to their
>> >> >> weights).
>> >> >> >> >
>> >> >> >> > It looks I do not understand what should I make in the end
>> >> >> >> > out of
>> >> >> >> this
>> >> >> >> > debug data :(
>> >> >> >> >
>> >> >> >> >  -kkm
>> >> >> >> >
>> >> >> >> >> -----Original Message-----
>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...]
>> >> >> >> >> Sent: 2015-06-15 1821
>> >> >> >> >> To: Kirill Katsnelson
>> >> >> >> >> Cc: kal...@li...
>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never
>> >> >> >> >> completes
>> >> >> >> >>
>> >> >> >> >> > I have a small set of sentences with repeat counts, and
>> >> >> >> >> > generating an
>> >> >> >> >> LM out of it. One is generated by a horrible local tool I
>> >> >> >> >> have trouble tracing exactly how. For this one L*G
>> >> >> >> >> composition
>> >> takes
>> >> >> >> about
>> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of
>> the
>> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been
>> >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for
>> >> >> >> >> about 30
>> >> >> minutes,
>> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is
>> >> >> >> >> running at
>> >> >> >> 100%.
>> >> >> >> >> L_disambig.fst is the same file in both cases. Looks like
>> >> >> >> >> the
>> >> G
>> >> >> >> >> making it not determinizable, although I have no idea how
>> it
>> >> >> >> >> came to
>> >> >> >> be.
>> >> >> >> >> >
>> >> >> >> >> > Anyone could share an advice on tracking down the
>> problem?
>> >> >> Thanks.
>> >> >> >> >>
>> >> >> >> >> You can send a signal to that program like  kill -SIGUSR1
>> >> >> >> >> process-id and it will print out some info about the symbol
>> >> >> >> >> sequences involved, I think it is like
>> >> >> >> >>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
>> >> >> >> >> Usually there is a particular word sequence that is
>> >> problematic.
>> >> >> >> >> Dan
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >>
>> >> >> >> >> >
>> >> >> >> >> >  -kkm
>> >> >> >> >> >
>> >> >> >> >> > ---------------------------------------------------------
>> -
>> >> >> >> >> > --
>> >> -
>> >> >> >> >> > --
>> >> >> -
>> >> >> >> >> > --
>> >> >> >> -
>> >> >> >> >> > --
>> >> >> >> >> -
>> >> >> >> >> > -------- _______________________________________________
>> >> >> >> >> > Kaldi-users mailing list
>> >> >> >> >> > Kal...@li...
>> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users