From: Kirill K. <kir...@sm...> - 2015-06-16 04:55:25
|
Thank you very much for your help Dan, but I am still stuck. First of all, a question: does the fstdeterminizestar algorithm depend on actual backoff and n-gram probabilities, i.e. will it behave differently if the numbers in arpa model file are different? Or does it depend only on arc labels but not weights? I am looking at the code but certainly I am far from being able to understand it. I cheated by looking at all if conditions in it, and this one in EpsilonClosure is seemingly the only one dealing with weights: if (! ApproxEqual(weight, iter->second.weight, delta_)) { // add extra part of weight to queue. (In ProcessFinal it also has "if (this_final_weight != Weight::Zero())" but I do not believe it is relevant?) I am trying to understand how to dig into the problem--are weights in the picture actually. Also, just for a test, I ran the grammar trough a "grep -v 'real real'", and indeed got a similar loop on the word "very" which is also often repeated. But the "real real" 2- and 3-grams are there in the "good" grammar too. Another thing I do not understand is the presence of the SIL ilabel in the backtrace. Here's the beginning of the trace that leads to the infinite loop as decoded with a little script I wrote (format is ilabel [ TAB olabel ]: #16 #0 V_B Y_I UW1_I Z_E views #2 SIL #0 AH0_B N_I SH_I unsure UH1_I R_E Note the presence of SIL at line 8. This is not in lexicon: $ grep SIL data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt !SIL 1 0.20 1.00 1.00 SIL_S $ Is this a hint? How did it get there at all? I am using a standard script to build the L_disambig.fst: phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl $lang/dict/lexiconp_silprob_disambig.txt \ data/local/dict/silprob.txt $silphone '#'$ndisambig | \ fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/words.txt \ --keep_isymbols=false --keep_osymbols=false | \ fstaddselfloops "echo $phone_disambig_symbol |" "echo $word_disambig_symbol |" | \ fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; I checked the lexicon, and there are indeed only real phones at the beginning of each word, no empty positions and no #N symbols. -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 1944 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > I think the confusion is probably between two loops with "real" on them > in G.fst: one loop where you always take the bigram probability, and > one where you always take the unigram probability. Or maybe a similar > confusion between a loop where you use the trigram "real real real" and > the bigram "real real". Those loops are expected to exist. > Probably the issue is that something happened at the start of the > sequence which caused the FST to be confused about which of those two > states it was in. If you have any empty words (words with empty > pronunciation) in your lexicon this could possibly happen, as it would > be confused between taking a normal word, then the backoff symbol, vs. > taking a normal word, then the empty word, then the backoff symbol. > I think the current Kaldi graph-creation script check for empty words > in the lexicon, for this reason. > > Dan > > > > > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) > generally almost makes sense, given that #16 is the last one in table, > the silence disambiguation symbol. (Not sure why "real" is emitted at > L_E--I would rather expect it to be emitted at #1.) What I do not > understand is what exactly the debug trace represents, and what should > I make out if it. It is a path through the FST graph, but I do not > understand what is this path exactly, and what does this endless walk > of this loop mean. > > > > -kkm > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 1858 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> Look into the "backoff disambiguation symbol", normally called #0. > >> The reason why it is needed should be explained in the hbka.pdf > paper. > >> Dan > >> > >> > >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> <kir...@sm...> wrote: > >> > Thank you! The output consists of some sequences as you described, > >> quickly falling into a short ever repeated loop. > >> > > >> > The non-repeated section ends up with osymbols (excluding > epsilons) > >> "whatsoever on vacation up", and then the repeated part looks like " > >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real" > >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> > > >> > Both LMs contain a bigram for "vacation up" and a trigram > "vacation > >> up there". "up real" is a bigram in both, with 3-grams "up real > quick" > >> and "up real quickly". "up real" is also a tail of a few other > >> 3-grams, but these are also same in both models (up to their > weights). > >> > > >> > It looks I do not understand what should I make in the end out of > >> this > >> > debug data :( > >> > > >> > -kkm > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 1821 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> > I have a small set of sentences with repeat counts, and > >> >> > generating an > >> >> LM out of it. One is generated by a horrible local tool I have > >> >> trouble tracing exactly how. For this one L*G composition takes > >> about > >> >> 20 seconds on my CPU. Another LM I just generated out of the same > >> >> files with srilm 1.7.1 ngram-count. This one has been sitting in > >> >> mkgraphs.sh on L_disambig*G composition step for about 30 > minutes, > >> >> and still churning. fstdeterminizestar --use-log=true is running > >> >> at > >> 100%. > >> >> L_disambig.fst is the same file in both cases. Looks like the G > >> >> making it not determinizable, although I have no idea how it came > >> >> to > >> be. > >> >> > > >> >> > Anyone could share an advice on tracking down the problem? > Thanks. > >> >> > >> >> You can send a signal to that program like kill -SIGUSR1 > >> >> process-id and it will print out some info about the symbol > >> >> sequences involved, I think it is like > >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> Usually there is a particular word sequence that is problematic. > >> >> Dan > >> >> > >> >> > >> >> > >> >> > >> >> > > >> >> > -kkm > >> >> > > >> >> > --------------------------------------------------------------- > - > >> >> > -- > >> - > >> >> > -- > >> >> - > >> >> > -------- _______________________________________________ > >> >> > Kaldi-users mailing list > >> >> > Kal...@li... > >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |