From: Kirill K. <kir...@sm...> - 2015-06-16 02:34:40
|
The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) generally almost makes sense, given that #16 is the last one in table, the silence disambiguation symbol. (Not sure why "real" is emitted at L_E--I would rather expect it to be emitted at #1.) What I do not understand is what exactly the debug trace represents, and what should I make out if it. It is a path through the FST graph, but I do not understand what is this path exactly, and what does this endless walk of this loop mean. -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 1858 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > Look into the "backoff disambiguation symbol", normally called #0. > The reason why it is needed should be explained in the hbka.pdf paper. > Dan > > > On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > <kir...@sm...> wrote: > > Thank you! The output consists of some sequences as you described, > quickly falling into a short ever repeated loop. > > > > The non-repeated section ends up with osymbols (excluding epsilons) > "whatsoever on vacation up", and then the repeated part looks like " #1 > ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real" is > spelled "R_B IY1_I L_E #1" in L_disambig. > > > > Both LMs contain a bigram for "vacation up" and a trigram "vacation > up there". "up real" is a bigram in both, with 3-grams "up real quick" > and "up real quickly". "up real" is also a tail of a few other 3-grams, > but these are also same in both models (up to their weights). > > > > It looks I do not understand what should I make in the end out of > this > > debug data :( > > > > -kkm > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 1821 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> > I have a small set of sentences with repeat counts, and generating > >> > an > >> LM out of it. One is generated by a horrible local tool I have > >> trouble tracing exactly how. For this one L*G composition takes > about > >> 20 seconds on my CPU. Another LM I just generated out of the same > >> files with srilm 1.7.1 ngram-count. This one has been sitting in > >> mkgraphs.sh on L_disambig*G composition step for about 30 minutes, > >> and still churning. fstdeterminizestar --use-log=true is running at > 100%. > >> L_disambig.fst is the same file in both cases. Looks like the G > >> making it not determinizable, although I have no idea how it came to > be. > >> > > >> > Anyone could share an advice on tracking down the problem? Thanks. > >> > >> You can send a signal to that program like kill -SIGUSR1 process-id > >> and it will print out some info about the symbol sequences involved, > >> I think it is like > >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> Usually there is a particular word sequence that is problematic. > >> Dan > >> > >> > >> > >> > >> > > >> > -kkm > >> > > >> > ------------------------------------------------------------------ > - > >> > -- > >> - > >> > -------- _______________________________________________ > >> > Kaldi-users mailing list > >> > Kal...@li... > >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |