From: Kirill K. <kir...@sm...> - 2015-06-16 05:42:30
|
Bingo. G.fst is not determinizable (the "good" G.fst takes under a second to determinize). And the bad one loops at the word "zero" like this #0 unsure unsure #0 of of #0 yours yours #0 is is #0 your your #0 zip zip #0 wrong wrong #0 with with #0 zero zero #0 zero zero .... I am taking the LM straight from ngram_counts to the standard pipeline, nothing fancy. The only thing is it has a lot of OOVs: remove_oovs.pl: removed 4646 lines. Is this generally a problem? So does my "good" arpa LM. I grepped both for the word zero, but could not spot anything outrageous. Can you think of anything I can look for? My source is no longer than 10 days old. Here's the pipeline, just in case. cat $src/$arpalm | tr -d '\r' | \ utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt cat $src/$arpalm | tr -d '\r' | \ arpa2fst - | fstprint | \ utils/remove_oovs.pl $lang/lm_oovs.txt | \ utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt \ --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false | \ fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 2206 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > I don't recommend to look at the fstdeterminizestar algorithm itself- > it's very complicated. Instead focus on the definition of > "determinizable" and the twins property, and figure out what path you > are taking through L.fst and G.fst. Trying to fstdeterminizestar G.fst > directly, and seeing whether it terminates or not, may tell you > something; if it fails, send the signal and see what happens. > fstdeterminizestar does care about the weights, but only to the extent > that they are the same or different from each other; and if your G.fst > is generated from arpa2fst the pipeline should work for any ARPA-format > language model- make sure you are using an up-to-date Kaldi though, > there have been fixes as recently as a few months ago. > The presence of SIL is not surprising, it is the optional-silence added > by the lexicon. I think that script is adding #16 if it does > *not* take the optional silence, otherwise it adds the phone SIL. > Since you are calling your FST a "grammar" I'm wondering whether you > have done something fancy with mapping words to FSTs or something like > that, which is causing the result to not be determinizable. > > Dan > > > On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson > <kir...@sm...> wrote: > > Thank you very much for your help Dan, but I am still stuck. > > > > First of all, a question: does the fstdeterminizestar algorithm > depend on actual backoff and n-gram probabilities, i.e. will it behave > differently if the numbers in arpa model file are different? Or does it > depend only on arc labels but not weights? I am looking at the code but > certainly I am far from being able to understand it. I cheated by > looking at all if conditions in it, and this one in EpsilonClosure is > seemingly the only one dealing with weights: > > > > if (! ApproxEqual(weight, iter->second.weight, delta_)) { > // add extra part of weight to queue. > > > > (In ProcessFinal it also has "if (this_final_weight != > > Weight::Zero())" but I do not believe it is relevant?) > > > > I am trying to understand how to dig into the problem--are weights in > the picture actually. > > > > Also, just for a test, I ran the grammar trough a "grep -v 'real > real'", and indeed got a similar loop on the word "very" which is also > often repeated. But the "real real" 2- and 3-grams are there in the > "good" grammar too. > > > > Another thing I do not understand is the presence of the SIL ilabel > in the backtrace. Here's the beginning of the trace that leads to the > infinite loop as decoded with a little script I wrote (format is ilabel > [ TAB olabel ]: > > > > #16 > > #0 > > V_B > > Y_I > > UW1_I > > Z_E views > > #2 > > SIL > > #0 > > AH0_B > > N_I > > SH_I unsure > > UH1_I > > R_E > > > > Note the presence of SIL at line 8. This is not in lexicon: > > > > $ grep SIL > data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > > !SIL 1 0.20 1.00 1.00 SIL_S > > $ > > > > Is this a hint? How did it get there at all? I am using a standard > script to build the L_disambig.fst: > > > > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) > > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) > > utils/make_lexicon_fst_silprob.pl > $lang/dict/lexiconp_silprob_disambig.txt \ > > data/local/dict/silprob.txt $silphone '#'$ndisambig | \ > > fstcompile --isymbols=$lang/phones.txt -- > osymbols=$lang/words.txt \ > > --keep_isymbols=false --keep_osymbols=false | \ > > fstaddselfloops "echo $phone_disambig_symbol |" "echo > $word_disambig_symbol |" | \ > > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; > > > > I checked the lexicon, and there are indeed only real phones at the > beginning of each word, no empty positions and no #N symbols. > > > > -kkm > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 1944 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> I think the confusion is probably between two loops with "real" on > >> them in G.fst: one loop where you always take the bigram > probability, > >> and one where you always take the unigram probability. Or maybe a > >> similar confusion between a loop where you use the trigram "real > real > >> real" and the bigram "real real". Those loops are expected to > exist. > >> Probably the issue is that something happened at the start of the > >> sequence which caused the FST to be confused about which of those > two > >> states it was in. If you have any empty words (words with empty > >> pronunciation) in your lexicon this could possibly happen, as it > >> would be confused between taking a normal word, then the backoff > symbol, vs. > >> taking a normal word, then the empty word, then the backoff symbol. > >> I think the current Kaldi graph-creation script check for empty > words > >> in the lexicon, for this reason. > >> > >> Dan > >> > >> > >> > >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) > >> generally almost makes sense, given that #16 is the last one in > >> table, the silence disambiguation symbol. (Not sure why "real" is > >> emitted at L_E--I would rather expect it to be emitted at #1.) What > I > >> do not understand is what exactly the debug trace represents, and > >> what should I make out if it. It is a path through the FST graph, > but > >> I do not understand what is this path exactly, and what does this > >> endless walk of this loop mean. > >> > > >> > -kkm > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 1858 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> Look into the "backoff disambiguation symbol", normally called > #0. > >> >> The reason why it is needed should be explained in the hbka.pdf > >> paper. > >> >> Dan > >> >> > >> >> > >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> >> <kir...@sm...> wrote: > >> >> > Thank you! The output consists of some sequences as you > >> >> > described, > >> >> quickly falling into a short ever repeated loop. > >> >> > > >> >> > The non-repeated section ends up with osymbols (excluding > >> epsilons) > >> >> "whatsoever on vacation up", and then the repeated part looks > like " > >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word > "real" > >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> >> > > >> >> > Both LMs contain a bigram for "vacation up" and a trigram > >> "vacation > >> >> up there". "up real" is a bigram in both, with 3-grams "up real > >> quick" > >> >> and "up real quickly". "up real" is also a tail of a few other > >> >> 3-grams, but these are also same in both models (up to their > >> weights). > >> >> > > >> >> > It looks I do not understand what should I make in the end out > >> >> > of > >> >> this > >> >> > debug data :( > >> >> > > >> >> > -kkm > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> Sent: 2015-06-15 1821 > >> >> >> To: Kirill Katsnelson > >> >> >> Cc: kal...@li... > >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> completes > >> >> >> > >> >> >> > I have a small set of sentences with repeat counts, and > >> >> >> > generating an > >> >> >> LM out of it. One is generated by a horrible local tool I have > >> >> >> trouble tracing exactly how. For this one L*G composition > takes > >> >> about > >> >> >> 20 seconds on my CPU. Another LM I just generated out of the > >> >> >> same files with srilm 1.7.1 ngram-count. This one has been > >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for > >> >> >> about 30 > >> minutes, > >> >> >> and still churning. fstdeterminizestar --use-log=true is > >> >> >> running at > >> >> 100%. > >> >> >> L_disambig.fst is the same file in both cases. Looks like the > G > >> >> >> making it not determinizable, although I have no idea how it > >> >> >> came to > >> >> be. > >> >> >> > > >> >> >> > Anyone could share an advice on tracking down the problem? > >> Thanks. > >> >> >> > >> >> >> You can send a signal to that program like kill -SIGUSR1 > >> >> >> process-id and it will print out some info about the symbol > >> >> >> sequences involved, I think it is like > >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> >> Usually there is a particular word sequence that is > problematic. > >> >> >> Dan > >> >> >> > >> >> >> > >> >> >> > >> >> >> > >> >> >> > > >> >> >> > -kkm > >> >> >> > > >> >> >> > ------------------------------------------------------------ > - > >> >> >> > -- > >> - > >> >> >> > -- > >> >> - > >> >> >> > -- > >> >> >> - > >> >> >> > -------- _______________________________________________ > >> >> >> > Kaldi-users mailing list > >> >> >> > Kal...@li... > >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |