From: Daniel P. <dp...@gm...> - 2015-06-16 05:06:16
|
I don't recommend to look at the fstdeterminizestar algorithm itself- it's very complicated. Instead focus on the definition of "determinizable" and the twins property, and figure out what path you are taking through L.fst and G.fst. Trying to fstdeterminizestar G.fst directly, and seeing whether it terminates or not, may tell you something; if it fails, send the signal and see what happens. fstdeterminizestar does care about the weights, but only to the extent that they are the same or different from each other; and if your G.fst is generated from arpa2fst the pipeline should work for any ARPA-format language model- make sure you are using an up-to-date Kaldi though, there have been fixes as recently as a few months ago. The presence of SIL is not surprising, it is the optional-silence added by the lexicon. I think that script is adding #16 if it does *not* take the optional silence, otherwise it adds the phone SIL. Since you are calling your FST a "grammar" I'm wondering whether you have done something fancy with mapping words to FSTs or something like that, which is causing the result to not be determinizable. Dan On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson <kir...@sm...> wrote: > Thank you very much for your help Dan, but I am still stuck. > > First of all, a question: does the fstdeterminizestar algorithm depend on actual backoff and n-gram probabilities, i.e. will it behave differently if the numbers in arpa model file are different? Or does it depend only on arc labels but not weights? I am looking at the code but certainly I am far from being able to understand it. I cheated by looking at all if conditions in it, and this one in EpsilonClosure is seemingly the only one dealing with weights: > > if (! ApproxEqual(weight, iter->second.weight, delta_)) { // add extra part of weight to queue. > > (In ProcessFinal it also has "if (this_final_weight != Weight::Zero())" but I do not believe it is relevant?) > > I am trying to understand how to dig into the problem--are weights in the picture actually. > > Also, just for a test, I ran the grammar trough a "grep -v 'real real'", and indeed got a similar loop on the word "very" which is also often repeated. But the "real real" 2- and 3-grams are there in the "good" grammar too. > > Another thing I do not understand is the presence of the SIL ilabel in the backtrace. Here's the beginning of the trace that leads to the infinite loop as decoded with a little script I wrote (format is ilabel [ TAB olabel ]: > > #16 > #0 > V_B > Y_I > UW1_I > Z_E views > #2 > SIL > #0 > AH0_B > N_I > SH_I unsure > UH1_I > R_E > > Note the presence of SIL at line 8. This is not in lexicon: > > $ grep SIL data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > !SIL 1 0.20 1.00 1.00 SIL_S > $ > > Is this a hint? How did it get there at all? I am using a standard script to build the L_disambig.fst: > > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) > utils/make_lexicon_fst_silprob.pl $lang/dict/lexiconp_silprob_disambig.txt \ > data/local/dict/silprob.txt $silphone '#'$ndisambig | \ > fstcompile --isymbols=$lang/phones.txt --osymbols=$lang/words.txt \ > --keep_isymbols=false --keep_osymbols=false | \ > fstaddselfloops "echo $phone_disambig_symbol |" "echo $word_disambig_symbol |" | \ > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; > > I checked the lexicon, and there are indeed only real phones at the beginning of each word, no empty positions and no #N symbols. > > -kkm > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 1944 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> I think the confusion is probably between two loops with "real" on them >> in G.fst: one loop where you always take the bigram probability, and >> one where you always take the unigram probability. Or maybe a similar >> confusion between a loop where you use the trigram "real real real" and >> the bigram "real real". Those loops are expected to exist. >> Probably the issue is that something happened at the start of the >> sequence which caused the FST to be confused about which of those two >> states it was in. If you have any empty words (words with empty >> pronunciation) in your lexicon this could possibly happen, as it would >> be confused between taking a normal word, then the backoff symbol, vs. >> taking a normal word, then the empty word, then the backoff symbol. >> I think the current Kaldi graph-creation script check for empty words >> in the lexicon, for this reason. >> >> Dan >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) >> generally almost makes sense, given that #16 is the last one in table, >> the silence disambiguation symbol. (Not sure why "real" is emitted at >> L_E--I would rather expect it to be emitted at #1.) What I do not >> understand is what exactly the debug trace represents, and what should >> I make out if it. It is a path through the FST graph, but I do not >> understand what is this path exactly, and what does this endless walk >> of this loop mean. >> > >> > -kkm >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 1858 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> Look into the "backoff disambiguation symbol", normally called #0. >> >> The reason why it is needed should be explained in the hbka.pdf >> paper. >> >> Dan >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> <kir...@sm...> wrote: >> >> > Thank you! The output consists of some sequences as you described, >> >> quickly falling into a short ever repeated loop. >> >> > >> >> > The non-repeated section ends up with osymbols (excluding >> epsilons) >> >> "whatsoever on vacation up", and then the repeated part looks like " >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real" >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> > >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> "vacation >> >> up there". "up real" is a bigram in both, with 3-grams "up real >> quick" >> >> and "up real quickly". "up real" is also a tail of a few other >> >> 3-grams, but these are also same in both models (up to their >> weights). >> >> > >> >> > It looks I do not understand what should I make in the end out of >> >> this >> >> > debug data :( >> >> > >> >> > -kkm >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 1821 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> > I have a small set of sentences with repeat counts, and >> >> >> > generating an >> >> >> LM out of it. One is generated by a horrible local tool I have >> >> >> trouble tracing exactly how. For this one L*G composition takes >> >> about >> >> >> 20 seconds on my CPU. Another LM I just generated out of the same >> >> >> files with srilm 1.7.1 ngram-count. This one has been sitting in >> >> >> mkgraphs.sh on L_disambig*G composition step for about 30 >> minutes, >> >> >> and still churning. fstdeterminizestar --use-log=true is running >> >> >> at >> >> 100%. >> >> >> L_disambig.fst is the same file in both cases. Looks like the G >> >> >> making it not determinizable, although I have no idea how it came >> >> >> to >> >> be. >> >> >> > >> >> >> > Anyone could share an advice on tracking down the problem? >> Thanks. >> >> >> >> >> >> You can send a signal to that program like kill -SIGUSR1 >> >> >> process-id and it will print out some info about the symbol >> >> >> sequences involved, I think it is like >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> Usually there is a particular word sequence that is problematic. >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> > --------------------------------------------------------------- >> - >> >> >> > -- >> >> - >> >> >> > -- >> >> >> - >> >> >> > -------- _______________________________________________ >> >> >> > Kaldi-users mailing list >> >> >> > Kal...@li... >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |