From: Daniel P. <dp...@gm...> - 2015-06-16 22:59:09
|
Guoguo is going to fix arpa2fst tonight so that it will detect that. Later when we rewrite it we'll include that feature. Dan On Tue, Jun 16, 2015 at 6:58 PM, Kirill Katsnelson <kir...@sm...> wrote: > Holy guacamole! That was it. Thank you very very much. > > Perhaps arpa2fst v2.0 would detect such bloopers. > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-16 1526 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> It turns out the problem was probably caused by the end of-sentence >> symbol </s> appearing in inappropriate places in the LM, at the start >> of n-grams rather than the end. Probably the training data was >> contaminated somehow by </s>. >> Dan >> >> >> On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote: >> >> I am currently trying to get a minimal reproduction with a script. >> Let it run for a while. I'll send you what remains of it, and hope it >> might give me an idea too. >> >> >> >> Looks like that fstdeterminize may have completed on this grammar >> >> (how do you call the thing symbolized as $G$? "grammar" sounded >> >> confusing, as I understand, but I have no other word not exceeding 2 >> >> syllables :)) >> > >> > I would call it an LM. >> > >> >>> I have left one running by mistake before going to sleep, and it >> was done. I am running one again with the time command to make sure >> this is not a fluke. So it is possible that it is not exactly non- >> determinizable, but instead takes enormous time (hours on one LM, < 1 >> sec on another). Which is the same thing from the engineering >> standpoint, close enough, as those engineering vs mathematics jokes go. >> But jokes aside, I want something more bounded for a production system, >> so I need to understand what throws it off so badly. >> > >> > I would still call it a problem. Check if your ARPA contains <eps> >> or >> > #0. I may need to add checks for this into arpa2fst (which we will >> > rewrite at some point anyway). Another problem could be weird things >> > like stray \r's which make one word seem like two in some >> > circumstances. >> > If I saw the output of arpa2fst I could probably figure out fairly >> > quickly what the problem was. The way I would debug this is to trace >> > through your LM FST from the start and follow those symbols (or >> > epsilons) on that trace from the determinization failure, and see how >> > there are two different paths. >> > It's better if you share a couple different traces, not just one, so >> > we can see what's in common. >> > >> >> Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the >> latter with the kaldi patch)? >> > >> > No, it should be faster. fstrmepsilon ∘ fstdeterminize should fail >> too. >> > >> >> Ah, and this is a Linux machine. So everything looks very very >> standard (oops. Did I just create an infinite loop by repeating a >> word?). >> > >> > I am considering changing the way the LM disambig symbols are used to >> > make this kind of problem less likely to happen in future, by having >> > several disambig symbols for the LM, one per order, instead of just >> > one. >> > >> > Dan >> > >> > >> > >> >>> -----Original Message----- >> >>> From: Daniel Povey [mailto:dp...@gm...] >> >>> Sent: 2015-06-15 2340 >> >>> To: Kirill Katsnelson >> >>> Cc: kal...@li... >> >>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >>> >> >>> In general SRILM language models are OK, but something weird could >> >>> have happened, especially on an unusual platform like Windows. >> >>> Look for duplicate lines with apparently the same n-gram on, and >> >>> also send to me (but not to kaldi-user) the arpa LM. >> >>> Dan >> >>> >> >>> >> >>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson >> >>> <kir...@sm...> wrote: >> >>> > Nope. The only thing I am thinking of doing is to bisect it >> >>> > somehow, >> >>> to get a minimal grammar that still refuses to determinize. I tried >> >>> different smoothing and played with other switches to ngram_count, >> >>> but it still does loop. Are there any known problems with >> >>> srilm-generated models? >> >>> > >> >>> > -kkm >> >>> > >> >>> >> -----Original Message----- >> >>> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> Sent: 2015-06-15 2248 >> >>> >> To: Kirill Katsnelson >> >>> >> Cc: kal...@li... >> >>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >>> >> completes >> >>> >> >> >>> >> OOVs should be OK. >> >>> >> Make sure there are no n-grams with things like <s> <s> >> >>> >> >> >>> >> e.g. see the lines >> >>> >> grep -v '<s> <s>' | \ >> >>> >> grep -v '</s> <s>' | \ >> >>> >> grep -v '</s> </s>' | \ >> >>> >> >> >>> >> in the WSJ script: >> >>> >> >> >>> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >> >>> >> grep -v '<s> <s>' | \ >> >>> >> grep -v '</s> <s>' | \ >> >>> >> grep -v '</s> </s>' | \ >> >>> >> arpa2fst - | fstprint | \ >> >>> >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >> >>> >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> >>> >> isymbols=$test/words.txt \ >> >>> >> --osymbols=$test/words.txt --keep_isymbols=false -- >> >>> >> keep_osymbols=false | \ >> >>> >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >> >>> >> >> >>> >> Dan >> >>> >> >> >>> >> >> >>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >> >>> >> <kir...@sm...> wrote: >> >>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes >> >>> >> > under a second to determinize). And the bad one loops at the >> word "zero" >> >>> >> > like this >> >>> >> > >> >>> >> > #0 >> >>> >> > unsure unsure >> >>> >> > #0 >> >>> >> > of of >> >>> >> > #0 >> >>> >> > yours yours >> >>> >> > #0 >> >>> >> > is is >> >>> >> > #0 >> >>> >> > your your >> >>> >> > #0 >> >>> >> > zip zip >> >>> >> > #0 >> >>> >> > wrong wrong >> >>> >> > #0 >> >>> >> > with with >> >>> >> > #0 >> >>> >> > zero zero >> >>> >> > #0 >> >>> >> > zero zero >> >>> >> > .... >> >>> >> > >> >>> >> > I am taking the LM straight from ngram_counts to the standard >> >>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >> >>> >> > >> >>> >> > remove_oovs.pl: removed 4646 lines. >> >>> >> > >> >>> >> > Is this generally a problem? So does my "good" arpa LM. I >> >>> >> > grepped >> >>> >> both for the word zero, but could not spot anything outrageous. >> >>> >> Can you think of anything I can look for? >> >>> >> > >> >>> >> > My source is no longer than 10 days old. Here's the pipeline, >> >>> >> > just in >> >>> >> case. >> >>> >> > >> >>> >> > cat $src/$arpalm | tr -d '\r' | \ >> >>> >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >> >>> >> > >> >>> >> > cat $src/$arpalm | tr -d '\r' | \ >> >>> >> > arpa2fst - | fstprint | \ >> >>> >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >> >>> >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> >>> >> isymbols=$lang/words.txt \ >> >>> >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >> >>> >> keep_osymbols=false | \ >> >>> >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >> >>> >> > >> >>> >> > -kkm >> >>> >> > >> >>> >> > >> >>> >> >> -----Original Message----- >> >>> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> Sent: 2015-06-15 2206 >> >>> >> >> To: Kirill Katsnelson >> >>> >> >> Cc: kal...@li... >> >>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >>> >> >> completes >> >>> >> >> >> >>> >> >> I don't recommend to look at the fstdeterminizestar algorithm >> >>> >> itself- >> >>> >> >> it's very complicated. Instead focus on the definition of >> >>> >> >> "determinizable" and the twins property, and figure out what >> >>> >> >> path >> >>> >> you >> >>> >> >> are taking through L.fst and G.fst. Trying to >> >>> >> >> fstdeterminizestar G.fst directly, and seeing whether it >> >>> >> >> terminates or not, may tell >> >>> >> you >> >>> >> >> something; if it fails, send the signal and see what happens. >> >>> >> >> fstdeterminizestar does care about the weights, but only to >> >>> >> >> the extent that they are the same or different from each >> >>> >> >> other; and >> >>> if >> >>> >> >> your G.fst is generated from arpa2fst the pipeline should >> work >> >>> for >> >>> >> >> any ARPA-format language model- make sure you are using an >> >>> >> >> up-to- >> >>> >> date >> >>> >> >> Kaldi though, there have been fixes as recently as a few >> >>> >> >> months >> >>> ago. >> >>> >> >> The presence of SIL is not surprising, it is the >> >>> >> >> optional-silence added by the lexicon. I think that script >> is >> >>> >> >> adding #16 if it does >> >>> >> >> *not* take the optional silence, otherwise it adds the phone >> SIL. >> >>> >> >> Since you are calling your FST a "grammar" I'm wondering >> >>> >> >> whether you have done something fancy with mapping words to >> >>> >> >> FSTs or something like that, which is causing the result to >> >>> >> >> not be >> >>> determinizable. >> >>> >> >> >> >>> >> >> Dan >> >>> >> >> >> >>> >> >> >> >>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> >>> >> >> <kir...@sm...> wrote: >> >>> >> >> > Thank you very much for your help Dan, but I am still >> stuck. >> >>> >> >> > >> >>> >> >> > First of all, a question: does the fstdeterminizestar >> >>> >> >> > algorithm >> >>> >> >> depend on actual backoff and n-gram probabilities, i.e. will >> >>> >> >> it behave differently if the numbers in arpa model file are >> >>> different? >> >>> >> >> Or does it depend only on arc labels but not weights? I am >> >>> looking >> >>> >> at >> >>> >> >> the code but certainly I am far from being able to understand >> it. >> >>> >> >> I cheated by looking at all if conditions in it, and this one >> >>> >> >> in EpsilonClosure is seemingly the only one dealing with >> weights: >> >>> >> >> > >> >>> >> >> > if (! ApproxEqual(weight, iter->second.weight, >> >>> >> delta_)) >> >>> >> >> > { >> >>> >> >> // add extra part of weight to queue. >> >>> >> >> > >> >>> >> >> > (In ProcessFinal it also has "if (this_final_weight != >> >>> >> >> > Weight::Zero())" but I do not believe it is relevant?) >> >>> >> >> > >> >>> >> >> > I am trying to understand how to dig into the problem--are >> >>> >> >> > weights in >> >>> >> >> the picture actually. >> >>> >> >> > >> >>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v >> >>> 'real >> >>> >> >> real'", and indeed got a similar loop on the word "very" >> which >> >>> >> >> is also often repeated. But the "real real" 2- and 3-grams >> are >> >>> >> >> there in the "good" grammar too. >> >>> >> >> > >> >>> >> >> > Another thing I do not understand is the presence of the >> SIL >> >>> >> ilabel >> >>> >> >> in the backtrace. Here's the beginning of the trace that >> leads >> >>> >> >> to >> >>> >> the >> >>> >> >> infinite loop as decoded with a little script I wrote (format >> >>> >> >> is ilabel [ TAB olabel ]: >> >>> >> >> > >> >>> >> >> > #16 >> >>> >> >> > #0 >> >>> >> >> > V_B >> >>> >> >> > Y_I >> >>> >> >> > UW1_I >> >>> >> >> > Z_E views >> >>> >> >> > #2 >> >>> >> >> > SIL >> >>> >> >> > #0 >> >>> >> >> > AH0_B >> >>> >> >> > N_I >> >>> >> >> > SH_I unsure >> >>> >> >> > UH1_I >> >>> >> >> > R_E >> >>> >> >> > >> >>> >> >> > Note the presence of SIL at line 8. This is not in lexicon: >> >>> >> >> > >> >>> >> >> > $ grep SIL >> >>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> >>> >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >> >>> >> >> > $ >> >>> >> >> > >> >>> >> >> > Is this a hint? How did it get there at all? I am using a >> >>> >> >> > standard >> >>> >> >> script to build the L_disambig.fst: >> >>> >> >> > >> >>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' >> >>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk >> '$1=="#0"{print >> >>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl >> >>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >> >>> >> >> > data/local/dict/silprob.txt $silphone >> >>> >> >> > '#'$ndisambig >> >>> >> | \ >> >>> >> >> > fstcompile --isymbols=$lang/phones.txt -- >> >>> >> >> osymbols=$lang/words.txt \ >> >>> >> >> > --keep_isymbols=false --keep_osymbols=false | \ >> >>> >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> >>> >> >> $word_disambig_symbol |" | \ >> >>> >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst >> || >> >>> >> >> > exit 1; >> >>> >> >> > >> >>> >> >> > I checked the lexicon, and there are indeed only real >> phones >> >>> >> >> > at >> >>> >> the >> >>> >> >> beginning of each word, no empty positions and no #N symbols. >> >>> >> >> > >> >>> >> >> > -kkm >> >>> >> >> > >> >>> >> >> >> -----Original Message----- >> >>> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> >> Sent: 2015-06-15 1944 >> >>> >> >> >> To: Kirill Katsnelson >> >>> >> >> >> Cc: kal...@li... >> >>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >>> >> >> >> completes >> >>> >> >> >> >> >>> >> >> >> I think the confusion is probably between two loops with >> >>> "real" >> >>> >> on >> >>> >> >> >> them in G.fst: one loop where you always take the bigram >> >>> >> >> probability, >> >>> >> >> >> and one where you always take the unigram probability. Or >> >>> >> >> >> maybe >> >>> >> a >> >>> >> >> >> similar confusion between a loop where you use the trigram >> >>> >> >> >> "real >> >>> >> >> real >> >>> >> >> >> real" and the bigram "real real". Those loops are >> expected >> >>> >> >> >> to >> >>> >> >> exist. >> >>> >> >> >> Probably the issue is that something happened at the start >> >>> >> >> >> of the sequence which caused the FST to be confused about >> >>> >> >> >> which >> >>> of >> >>> >> >> >> those >> >>> >> >> two >> >>> >> >> >> states it was in. If you have any empty words (words with >> >>> >> >> >> empty >> >>> >> >> >> pronunciation) in your lexicon this could possibly happen, >> >>> >> >> >> as it would be confused between taking a normal word, >> then >> >>> >> >> >> the backoff >> >>> >> >> symbol, vs. >> >>> >> >> >> taking a normal word, then the empty word, then the >> backoff >> >>> >> symbol. >> >>> >> >> >> I think the current Kaldi graph-creation script check for >> >>> empty >> >>> >> >> words >> >>> >> >> >> in the lexicon, for this reason. >> >>> >> >> >> >> >>> >> >> >> Dan >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) >> >>> >> >> >> > #0 >> >>> ( >> >>> >> >> >> > ) >> >>> >> >> >> generally almost makes sense, given that #16 is the last >> >>> >> >> >> one >> >>> in >> >>> >> >> >> table, the silence disambiguation symbol. (Not sure why >> "real" >> >>> >> >> >> is emitted at L_E--I would rather expect it to be emitted >> >>> >> >> >> at >> >>> >> >> >> #1.) What >> >>> >> >> I >> >>> >> >> >> do not understand is what exactly the debug trace >> >>> >> >> >> represents, and what should I make out if it. It is a path >> >>> >> >> >> through the FST graph, >> >>> >> >> but >> >>> >> >> >> I do not understand what is this path exactly, and what >> >>> >> >> >> does this endless walk of this loop mean. >> >>> >> >> >> > >> >>> >> >> >> > -kkm >> >>> >> >> >> > >> >>> >> >> >> >> -----Original Message----- >> >>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> >> >> Sent: 2015-06-15 1858 >> >>> >> >> >> >> To: Kirill Katsnelson >> >>> >> >> >> >> Cc: kal...@li... >> >>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >> >>> >> >> >> >> never completes >> >>> >> >> >> >> >> >>> >> >> >> >> Look into the "backoff disambiguation symbol", normally >> >>> >> >> >> >> called >> >>> >> >> #0. >> >>> >> >> >> >> The reason why it is needed should be explained in the >> >>> >> hbka.pdf >> >>> >> >> >> paper. >> >>> >> >> >> >> Dan >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >>> >> >> >> >> <kir...@sm...> wrote: >> >>> >> >> >> >> > Thank you! The output consists of some sequences as >> >>> >> >> >> >> > you described, >> >>> >> >> >> >> quickly falling into a short ever repeated loop. >> >>> >> >> >> >> > >> >>> >> >> >> >> > The non-repeated section ends up with osymbols >> >>> >> >> >> >> > (excluding >> >>> >> >> >> epsilons) >> >>> >> >> >> >> "whatsoever on vacation up", and then the repeated part >> >>> >> >> >> >> looks >> >>> >> >> like " >> >>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". >> The >> >>> >> >> >> >> word >> >>> >> >> "real" >> >>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >>> >> >> >> >> > >> >>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a >> >>> >> >> >> >> > trigram >> >>> >> >> >> "vacation >> >>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams >> >>> >> >> >> >> "up >> >>> >> real >> >>> >> >> >> quick" >> >>> >> >> >> >> and "up real quickly". "up real" is also a tail of a >> few >> >>> >> >> >> >> other 3-grams, but these are also same in both models >> >>> >> >> >> >> (up >> >>> to >> >>> >> >> >> >> their >> >>> >> >> >> weights). >> >>> >> >> >> >> > >> >>> >> >> >> >> > It looks I do not understand what should I make in >> the >> >>> end >> >>> >> >> >> >> > out of >> >>> >> >> >> >> this >> >>> >> >> >> >> > debug data :( >> >>> >> >> >> >> > >> >>> >> >> >> >> > -kkm >> >>> >> >> >> >> > >> >>> >> >> >> >> >> -----Original Message----- >> >>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >>> >> >> >> >> >> Sent: 2015-06-15 1821 >> >>> >> >> >> >> >> To: Kirill Katsnelson >> >>> >> >> >> >> >> Cc: kal...@li... >> >>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >> >>> never >> >>> >> >> >> >> >> completes >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> > I have a small set of sentences with repeat >> counts, >> >>> and >> >>> >> >> >> >> >> > generating an >> >>> >> >> >> >> >> LM out of it. One is generated by a horrible local >> >>> >> >> >> >> >> tool >> >>> I >> >>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G >> >>> >> >> >> >> >> composition >> >>> >> >> takes >> >>> >> >> >> >> about >> >>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated >> out >> >>> >> >> >> >> >> of >> >>> >> the >> >>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one >> has >> >>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G >> >>> >> >> >> >> >> composition step for about 30 >> >>> >> >> >> minutes, >> >>> >> >> >> >> >> and still churning. fstdeterminizestar --use- >> log=true >> >>> >> >> >> >> >> is running at >> >>> >> >> >> >> 100%. >> >>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks >> >>> like >> >>> >> >> >> >> >> the >> >>> >> >> G >> >>> >> >> >> >> >> making it not determinizable, although I have no >> idea >> >>> how >> >>> >> it >> >>> >> >> >> >> >> came to >> >>> >> >> >> >> be. >> >>> >> >> >> >> >> > >> >>> >> >> >> >> >> > Anyone could share an advice on tracking down the >> >>> >> problem? >> >>> >> >> >> Thanks. >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> You can send a signal to that program like kill - >> >>> SIGUSR1 >> >>> >> >> >> >> >> process-id and it will print out some info about the >> >>> >> >> >> >> >> symbol sequences involved, I think it is like >> >>> >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >>> >> >> >> >> >> Usually there is a particular word sequence that is >> >>> >> >> problematic. >> >>> >> >> >> >> >> Dan >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> >> >>> >> >> >> >> >> > >> >>> >> >> >> >> >> > -kkm >> >>> >> >> >> >> >> > >> >>> >> >> >> >> >> > -------------------------------------------------- >> - >> >>> >> >> >> >> >> > --- >> >>> - >> >>> >> >> >> >> >> > -- >> >>> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> >> >> - >> >>> >> >> >> >> >> > -- >> >>> >> >> >> >> >> - >> >>> >> >> >> >> >> > -------- >> >>> >> >> >> >> >> > _______________________________________________ >> >>> >> >> >> >> >> > Kaldi-users mailing list >> >>> >> >> >> >> >> > Kal...@li... >> >>> >> >> >> >> >> > >> https://lists.sourceforge.net/lists/listinfo/kaldi- >> >>> user >> >>> >> >> >> >> >> > s |