From: Kirill K. <kir...@sm...> - 2015-06-16 06:03:59
|
Nope. The only thing I am thinking of doing is to bisect it somehow, to get a minimal grammar that still refuses to determinize. I tried different smoothing and played with other switches to ngram_count, but it still does loop. Are there any known problems with srilm-generated models? -kkm > -----Original Message----- > From: Daniel Povey [mailto:dp...@gm...] > Sent: 2015-06-15 2248 > To: Kirill Katsnelson > Cc: kal...@li... > Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > > OOVs should be OK. > Make sure there are no n-grams with things like <s> <s> > > e.g. see the lines > grep -v '<s> <s>' | \ > grep -v '</s> <s>' | \ > grep -v '</s> </s>' | \ > > in the WSJ script: > > gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ > grep -v '<s> <s>' | \ > grep -v '</s> <s>' | \ > grep -v '</s> </s>' | \ > arpa2fst - | fstprint | \ > utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > isymbols=$test/words.txt \ > --osymbols=$test/words.txt --keep_isymbols=false -- > keep_osymbols=false | \ > fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst > > Dan > > > On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson > <kir...@sm...> wrote: > > Bingo. G.fst is not determinizable (the "good" G.fst takes under a > > second to determinize). And the bad one loops at the word "zero" like > > this > > > > #0 > > unsure unsure > > #0 > > of of > > #0 > > yours yours > > #0 > > is is > > #0 > > your your > > #0 > > zip zip > > #0 > > wrong wrong > > #0 > > with with > > #0 > > zero zero > > #0 > > zero zero > > .... > > > > I am taking the LM straight from ngram_counts to the standard > pipeline, nothing fancy. The only thing is it has a lot of OOVs: > > > > remove_oovs.pl: removed 4646 lines. > > > > Is this generally a problem? So does my "good" arpa LM. I grepped > both for the word zero, but could not spot anything outrageous. Can you > think of anything I can look for? > > > > My source is no longer than 10 days old. Here's the pipeline, just in > case. > > > > cat $src/$arpalm | tr -d '\r' | \ > > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt > > > > cat $src/$arpalm | tr -d '\r' | \ > > arpa2fst - | fstprint | \ > > utils/remove_oovs.pl $lang/lm_oovs.txt | \ > > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- > isymbols=$lang/words.txt \ > > --osymbols=$lang/words.txt --keep_isymbols=false -- > keep_osymbols=false | \ > > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst > > > > -kkm > > > > > >> -----Original Message----- > >> From: Daniel Povey [mailto:dp...@gm...] > >> Sent: 2015-06-15 2206 > >> To: Kirill Katsnelson > >> Cc: kal...@li... > >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes > >> > >> I don't recommend to look at the fstdeterminizestar algorithm > itself- > >> it's very complicated. Instead focus on the definition of > >> "determinizable" and the twins property, and figure out what path > you > >> are taking through L.fst and G.fst. Trying to fstdeterminizestar > >> G.fst directly, and seeing whether it terminates or not, may tell > you > >> something; if it fails, send the signal and see what happens. > >> fstdeterminizestar does care about the weights, but only to the > >> extent that they are the same or different from each other; and if > >> your G.fst is generated from arpa2fst the pipeline should work for > >> any ARPA-format language model- make sure you are using an up-to- > date > >> Kaldi though, there have been fixes as recently as a few months ago. > >> The presence of SIL is not surprising, it is the optional-silence > >> added by the lexicon. I think that script is adding #16 if it does > >> *not* take the optional silence, otherwise it adds the phone SIL. > >> Since you are calling your FST a "grammar" I'm wondering whether you > >> have done something fancy with mapping words to FSTs or something > >> like that, which is causing the result to not be determinizable. > >> > >> Dan > >> > >> > >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson > >> <kir...@sm...> wrote: > >> > Thank you very much for your help Dan, but I am still stuck. > >> > > >> > First of all, a question: does the fstdeterminizestar algorithm > >> depend on actual backoff and n-gram probabilities, i.e. will it > >> behave differently if the numbers in arpa model file are different? > >> Or does it depend only on arc labels but not weights? I am looking > at > >> the code but certainly I am far from being able to understand it. I > >> cheated by looking at all if conditions in it, and this one in > >> EpsilonClosure is seemingly the only one dealing with weights: > >> > > >> > if (! ApproxEqual(weight, iter->second.weight, > delta_)) > >> > { > >> // add extra part of weight to queue. > >> > > >> > (In ProcessFinal it also has "if (this_final_weight != > >> > Weight::Zero())" but I do not believe it is relevant?) > >> > > >> > I am trying to understand how to dig into the problem--are weights > >> > in > >> the picture actually. > >> > > >> > Also, just for a test, I ran the grammar trough a "grep -v 'real > >> real'", and indeed got a similar loop on the word "very" which is > >> also often repeated. But the "real real" 2- and 3-grams are there in > >> the "good" grammar too. > >> > > >> > Another thing I do not understand is the presence of the SIL > ilabel > >> in the backtrace. Here's the beginning of the trace that leads to > the > >> infinite loop as decoded with a little script I wrote (format is > >> ilabel [ TAB olabel ]: > >> > > >> > #16 > >> > #0 > >> > V_B > >> > Y_I > >> > UW1_I > >> > Z_E views > >> > #2 > >> > SIL > >> > #0 > >> > AH0_B > >> > N_I > >> > SH_I unsure > >> > UH1_I > >> > R_E > >> > > >> > Note the presence of SIL at line 8. This is not in lexicon: > >> > > >> > $ grep SIL > >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt > >> > !SIL 1 0.20 1.00 1.00 SIL_S > >> > $ > >> > > >> > Is this a hint? How did it get there at all? I am using a standard > >> script to build the L_disambig.fst: > >> > > >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) > >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) > >> > utils/make_lexicon_fst_silprob.pl > >> $lang/dict/lexiconp_silprob_disambig.txt \ > >> > data/local/dict/silprob.txt $silphone '#'$ndisambig > | \ > >> > fstcompile --isymbols=$lang/phones.txt -- > >> osymbols=$lang/words.txt \ > >> > --keep_isymbols=false --keep_osymbols=false | \ > >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo > >> $word_disambig_symbol |" | \ > >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit > >> > 1; > >> > > >> > I checked the lexicon, and there are indeed only real phones at > the > >> beginning of each word, no empty positions and no #N symbols. > >> > > >> > -kkm > >> > > >> >> -----Original Message----- > >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> Sent: 2015-06-15 1944 > >> >> To: Kirill Katsnelson > >> >> Cc: kal...@li... > >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> completes > >> >> > >> >> I think the confusion is probably between two loops with "real" > on > >> >> them in G.fst: one loop where you always take the bigram > >> probability, > >> >> and one where you always take the unigram probability. Or maybe > a > >> >> similar confusion between a loop where you use the trigram "real > >> real > >> >> real" and the bigram "real real". Those loops are expected to > >> exist. > >> >> Probably the issue is that something happened at the start of the > >> >> sequence which caused the FST to be confused about which of those > >> two > >> >> states it was in. If you have any empty words (words with empty > >> >> pronunciation) in your lexicon this could possibly happen, as it > >> >> would be confused between taking a normal word, then the backoff > >> symbol, vs. > >> >> taking a normal word, then the empty word, then the backoff > symbol. > >> >> I think the current Kaldi graph-creation script check for empty > >> words > >> >> in the lexicon, for this reason. > >> >> > >> >> Dan > >> >> > >> >> > >> >> > >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) > >> >> generally almost makes sense, given that #16 is the last one in > >> >> table, the silence disambiguation symbol. (Not sure why "real" is > >> >> emitted at L_E--I would rather expect it to be emitted at #1.) > >> >> What > >> I > >> >> do not understand is what exactly the debug trace represents, and > >> >> what should I make out if it. It is a path through the FST graph, > >> but > >> >> I do not understand what is this path exactly, and what does this > >> >> endless walk of this loop mean. > >> >> > > >> >> > -kkm > >> >> > > >> >> >> -----Original Message----- > >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> Sent: 2015-06-15 1858 > >> >> >> To: Kirill Katsnelson > >> >> >> Cc: kal...@li... > >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> completes > >> >> >> > >> >> >> Look into the "backoff disambiguation symbol", normally called > >> #0. > >> >> >> The reason why it is needed should be explained in the > hbka.pdf > >> >> paper. > >> >> >> Dan > >> >> >> > >> >> >> > >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson > >> >> >> <kir...@sm...> wrote: > >> >> >> > Thank you! The output consists of some sequences as you > >> >> >> > described, > >> >> >> quickly falling into a short ever repeated loop. > >> >> >> > > >> >> >> > The non-repeated section ends up with osymbols (excluding > >> >> epsilons) > >> >> >> "whatsoever on vacation up", and then the repeated part looks > >> like " > >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word > >> "real" > >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. > >> >> >> > > >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram > >> >> "vacation > >> >> >> up there". "up real" is a bigram in both, with 3-grams "up > real > >> >> quick" > >> >> >> and "up real quickly". "up real" is also a tail of a few other > >> >> >> 3-grams, but these are also same in both models (up to their > >> >> weights). > >> >> >> > > >> >> >> > It looks I do not understand what should I make in the end > >> >> >> > out of > >> >> >> this > >> >> >> > debug data :( > >> >> >> > > >> >> >> > -kkm > >> >> >> > > >> >> >> >> -----Original Message----- > >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] > >> >> >> >> Sent: 2015-06-15 1821 > >> >> >> >> To: Kirill Katsnelson > >> >> >> >> Cc: kal...@li... > >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never > >> >> >> >> completes > >> >> >> >> > >> >> >> >> > I have a small set of sentences with repeat counts, and > >> >> >> >> > generating an > >> >> >> >> LM out of it. One is generated by a horrible local tool I > >> >> >> >> have trouble tracing exactly how. For this one L*G > >> >> >> >> composition > >> takes > >> >> >> about > >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of > the > >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been > >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for > >> >> >> >> about 30 > >> >> minutes, > >> >> >> >> and still churning. fstdeterminizestar --use-log=true is > >> >> >> >> running at > >> >> >> 100%. > >> >> >> >> L_disambig.fst is the same file in both cases. Looks like > >> >> >> >> the > >> G > >> >> >> >> making it not determinizable, although I have no idea how > it > >> >> >> >> came to > >> >> >> be. > >> >> >> >> > > >> >> >> >> > Anyone could share an advice on tracking down the > problem? > >> >> Thanks. > >> >> >> >> > >> >> >> >> You can send a signal to that program like kill -SIGUSR1 > >> >> >> >> process-id and it will print out some info about the symbol > >> >> >> >> sequences involved, I think it is like > >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. > >> >> >> >> Usually there is a particular word sequence that is > >> problematic. > >> >> >> >> Dan > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > > >> >> >> >> > -kkm > >> >> >> >> > > >> >> >> >> > --------------------------------------------------------- > - > >> >> >> >> > -- > >> - > >> >> >> >> > -- > >> >> - > >> >> >> >> > -- > >> >> >> - > >> >> >> >> > -- > >> >> >> >> - > >> >> >> >> > -------- _______________________________________________ > >> >> >> >> > Kaldi-users mailing list > >> >> >> >> > Kal...@li... > >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |