From: Daniel P. <dp...@gm...> - 2015-06-16 22:26:29
|
It turns out the problem was probably caused by the end of-sentence symbol </s> appearing in inappropriate places in the LM, at the start of n-grams rather than the end. Probably the training data was contaminated somehow by </s>. Dan On Tue, Jun 16, 2015 at 2:07 PM, Daniel Povey <dp...@gm...> wrote: >> I am currently trying to get a minimal reproduction with a script. Let it run for a while. I'll send you what remains of it, and hope it might give me an idea too. >> >> Looks like that fstdeterminize may have completed on this grammar (how do you call the thing symbolized as $G$? "grammar" sounded confusing, as I understand, but I have no other word not exceeding 2 syllables :)) > > I would call it an LM. > >>> I have left one running by mistake before going to sleep, and it was done. I am running one again with the time command to make sure this is not a fluke. So it is possible that it is not exactly non-determinizable, but instead takes enormous time (hours on one LM, < 1 sec on another). Which is the same thing from the engineering standpoint, close enough, as those engineering vs mathematics jokes go. But jokes aside, I want something more bounded for a production system, so I need to understand what throws it off so badly. > > I would still call it a problem. Check if your ARPA contains <eps> or > #0. I may need to add checks for this into arpa2fst (which we will > rewrite at some point anyway). Another problem could be weird things > like stray \r's which make one word seem like two in some > circumstances. > If I saw the output of arpa2fst I could probably figure out fairly > quickly what the problem was. The way I would debug this is to trace > through your LM FST from the start and follow those symbols (or > epsilons) on that trace from the determinization failure, and see how > there are two different paths. > It's better if you share a couple different traces, not just one, so > we can see what's in common. > >> Is fstdeterminizestar more than fstrmepsilon ∘ fstdeterminize (the latter with the kaldi patch)? > > No, it should be faster. fstrmepsilon ∘ fstdeterminize should fail too. > >> Ah, and this is a Linux machine. So everything looks very very standard (oops. Did I just create an infinite loop by repeating a word?). > > I am considering changing the way the LM disambig symbols are used to > make this kind of problem less likely to happen in future, by having > several disambig symbols for the LM, one per order, instead of just > one. > > Dan > > > >>> -----Original Message----- >>> From: Daniel Povey [mailto:dp...@gm...] >>> Sent: 2015-06-15 2340 >>> To: Kirill Katsnelson >>> Cc: kal...@li... >>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >>> >>> In general SRILM language models are OK, but something weird could have >>> happened, especially on an unusual platform like Windows. >>> Look for duplicate lines with apparently the same n-gram on, and also >>> send to me (but not to kaldi-user) the arpa LM. >>> Dan >>> >>> >>> On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson >>> <kir...@sm...> wrote: >>> > Nope. The only thing I am thinking of doing is to bisect it somehow, >>> to get a minimal grammar that still refuses to determinize. I tried >>> different smoothing and played with other switches to ngram_count, but >>> it still does loop. Are there any known problems with srilm-generated >>> models? >>> > >>> > -kkm >>> > >>> >> -----Original Message----- >>> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> Sent: 2015-06-15 2248 >>> >> To: Kirill Katsnelson >>> >> Cc: kal...@li... >>> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >>> >> >>> >> OOVs should be OK. >>> >> Make sure there are no n-grams with things like <s> <s> >>> >> >>> >> e.g. see the lines >>> >> grep -v '<s> <s>' | \ >>> >> grep -v '</s> <s>' | \ >>> >> grep -v '</s> </s>' | \ >>> >> >>> >> in the WSJ script: >>> >> >>> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >>> >> grep -v '<s> <s>' | \ >>> >> grep -v '</s> <s>' | \ >>> >> grep -v '</s> </s>' | \ >>> >> arpa2fst - | fstprint | \ >>> >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >>> >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >>> >> isymbols=$test/words.txt \ >>> >> --osymbols=$test/words.txt --keep_isymbols=false -- >>> >> keep_osymbols=false | \ >>> >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >>> >> >>> >> Dan >>> >> >>> >> >>> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >>> >> <kir...@sm...> wrote: >>> >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a >>> >> > second to determinize). And the bad one loops at the word "zero" >>> >> > like this >>> >> > >>> >> > #0 >>> >> > unsure unsure >>> >> > #0 >>> >> > of of >>> >> > #0 >>> >> > yours yours >>> >> > #0 >>> >> > is is >>> >> > #0 >>> >> > your your >>> >> > #0 >>> >> > zip zip >>> >> > #0 >>> >> > wrong wrong >>> >> > #0 >>> >> > with with >>> >> > #0 >>> >> > zero zero >>> >> > #0 >>> >> > zero zero >>> >> > .... >>> >> > >>> >> > I am taking the LM straight from ngram_counts to the standard >>> >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >>> >> > >>> >> > remove_oovs.pl: removed 4646 lines. >>> >> > >>> >> > Is this generally a problem? So does my "good" arpa LM. I grepped >>> >> both for the word zero, but could not spot anything outrageous. Can >>> >> you think of anything I can look for? >>> >> > >>> >> > My source is no longer than 10 days old. Here's the pipeline, just >>> >> > in >>> >> case. >>> >> > >>> >> > cat $src/$arpalm | tr -d '\r' | \ >>> >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >>> >> > >>> >> > cat $src/$arpalm | tr -d '\r' | \ >>> >> > arpa2fst - | fstprint | \ >>> >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >>> >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >>> >> isymbols=$lang/words.txt \ >>> >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >>> >> keep_osymbols=false | \ >>> >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >>> >> > >>> >> > -kkm >>> >> > >>> >> > >>> >> >> -----Original Message----- >>> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> Sent: 2015-06-15 2206 >>> >> >> To: Kirill Katsnelson >>> >> >> Cc: kal...@li... >>> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >>> >> >> completes >>> >> >> >>> >> >> I don't recommend to look at the fstdeterminizestar algorithm >>> >> itself- >>> >> >> it's very complicated. Instead focus on the definition of >>> >> >> "determinizable" and the twins property, and figure out what path >>> >> you >>> >> >> are taking through L.fst and G.fst. Trying to fstdeterminizestar >>> >> >> G.fst directly, and seeing whether it terminates or not, may tell >>> >> you >>> >> >> something; if it fails, send the signal and see what happens. >>> >> >> fstdeterminizestar does care about the weights, but only to the >>> >> >> extent that they are the same or different from each other; and >>> if >>> >> >> your G.fst is generated from arpa2fst the pipeline should work >>> for >>> >> >> any ARPA-format language model- make sure you are using an up-to- >>> >> date >>> >> >> Kaldi though, there have been fixes as recently as a few months >>> ago. >>> >> >> The presence of SIL is not surprising, it is the optional-silence >>> >> >> added by the lexicon. I think that script is adding #16 if it >>> >> >> does >>> >> >> *not* take the optional silence, otherwise it adds the phone SIL. >>> >> >> Since you are calling your FST a "grammar" I'm wondering whether >>> >> >> you have done something fancy with mapping words to FSTs or >>> >> >> something like that, which is causing the result to not be >>> determinizable. >>> >> >> >>> >> >> Dan >>> >> >> >>> >> >> >>> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >>> >> >> <kir...@sm...> wrote: >>> >> >> > Thank you very much for your help Dan, but I am still stuck. >>> >> >> > >>> >> >> > First of all, a question: does the fstdeterminizestar algorithm >>> >> >> depend on actual backoff and n-gram probabilities, i.e. will it >>> >> >> behave differently if the numbers in arpa model file are >>> different? >>> >> >> Or does it depend only on arc labels but not weights? I am >>> looking >>> >> at >>> >> >> the code but certainly I am far from being able to understand it. >>> >> >> I cheated by looking at all if conditions in it, and this one in >>> >> >> EpsilonClosure is seemingly the only one dealing with weights: >>> >> >> > >>> >> >> > if (! ApproxEqual(weight, iter->second.weight, >>> >> delta_)) >>> >> >> > { >>> >> >> // add extra part of weight to queue. >>> >> >> > >>> >> >> > (In ProcessFinal it also has "if (this_final_weight != >>> >> >> > Weight::Zero())" but I do not believe it is relevant?) >>> >> >> > >>> >> >> > I am trying to understand how to dig into the problem--are >>> >> >> > weights in >>> >> >> the picture actually. >>> >> >> > >>> >> >> > Also, just for a test, I ran the grammar trough a "grep -v >>> 'real >>> >> >> real'", and indeed got a similar loop on the word "very" which is >>> >> >> also often repeated. But the "real real" 2- and 3-grams are there >>> >> >> in the "good" grammar too. >>> >> >> > >>> >> >> > Another thing I do not understand is the presence of the SIL >>> >> ilabel >>> >> >> in the backtrace. Here's the beginning of the trace that leads to >>> >> the >>> >> >> infinite loop as decoded with a little script I wrote (format is >>> >> >> ilabel [ TAB olabel ]: >>> >> >> > >>> >> >> > #16 >>> >> >> > #0 >>> >> >> > V_B >>> >> >> > Y_I >>> >> >> > UW1_I >>> >> >> > Z_E views >>> >> >> > #2 >>> >> >> > SIL >>> >> >> > #0 >>> >> >> > AH0_B >>> >> >> > N_I >>> >> >> > SH_I unsure >>> >> >> > UH1_I >>> >> >> > R_E >>> >> >> > >>> >> >> > Note the presence of SIL at line 8. This is not in lexicon: >>> >> >> > >>> >> >> > $ grep SIL >>> >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >>> >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >>> >> >> > $ >>> >> >> > >>> >> >> > Is this a hint? How did it get there at all? I am using a >>> >> >> > standard >>> >> >> script to build the L_disambig.fst: >>> >> >> > >>> >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' >>> >> >> > $lang/phones.txt) word_disambig_symbol=$(awk '$1=="#0"{print >>> >> >> > $2}' $lang/words.txt) utils/make_lexicon_fst_silprob.pl >>> >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >>> >> >> > data/local/dict/silprob.txt $silphone >>> >> >> > '#'$ndisambig >>> >> | \ >>> >> >> > fstcompile --isymbols=$lang/phones.txt -- >>> >> >> osymbols=$lang/words.txt \ >>> >> >> > --keep_isymbols=false --keep_osymbols=false | \ >>> >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >>> >> >> $word_disambig_symbol |" | \ >>> >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || >>> >> >> > exit 1; >>> >> >> > >>> >> >> > I checked the lexicon, and there are indeed only real phones at >>> >> the >>> >> >> beginning of each word, no empty positions and no #N symbols. >>> >> >> > >>> >> >> > -kkm >>> >> >> > >>> >> >> >> -----Original Message----- >>> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> >> Sent: 2015-06-15 1944 >>> >> >> >> To: Kirill Katsnelson >>> >> >> >> Cc: kal...@li... >>> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >>> >> >> >> completes >>> >> >> >> >>> >> >> >> I think the confusion is probably between two loops with >>> "real" >>> >> on >>> >> >> >> them in G.fst: one loop where you always take the bigram >>> >> >> probability, >>> >> >> >> and one where you always take the unigram probability. Or >>> >> >> >> maybe >>> >> a >>> >> >> >> similar confusion between a loop where you use the trigram >>> >> >> >> "real >>> >> >> real >>> >> >> >> real" and the bigram "real real". Those loops are expected to >>> >> >> exist. >>> >> >> >> Probably the issue is that something happened at the start of >>> >> >> >> the sequence which caused the FST to be confused about which >>> of >>> >> >> >> those >>> >> >> two >>> >> >> >> states it was in. If you have any empty words (words with >>> >> >> >> empty >>> >> >> >> pronunciation) in your lexicon this could possibly happen, as >>> >> >> >> it would be confused between taking a normal word, then the >>> >> >> >> backoff >>> >> >> symbol, vs. >>> >> >> >> taking a normal word, then the empty word, then the backoff >>> >> symbol. >>> >> >> >> I think the current Kaldi graph-creation script check for >>> empty >>> >> >> words >>> >> >> >> in the lexicon, for this reason. >>> >> >> >> >>> >> >> >> Dan >>> >> >> >> >>> >> >> >> >>> >> >> >> >>> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 >>> ( >>> >> >> >> > ) >>> >> >> >> generally almost makes sense, given that #16 is the last one >>> in >>> >> >> >> table, the silence disambiguation symbol. (Not sure why "real" >>> >> >> >> is emitted at L_E--I would rather expect it to be emitted at >>> >> >> >> #1.) What >>> >> >> I >>> >> >> >> do not understand is what exactly the debug trace represents, >>> >> >> >> and what should I make out if it. It is a path through the FST >>> >> >> >> graph, >>> >> >> but >>> >> >> >> I do not understand what is this path exactly, and what does >>> >> >> >> this endless walk of this loop mean. >>> >> >> >> > >>> >> >> >> > -kkm >>> >> >> >> > >>> >> >> >> >> -----Original Message----- >>> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> >> >> Sent: 2015-06-15 1858 >>> >> >> >> >> To: Kirill Katsnelson >>> >> >> >> >> Cc: kal...@li... >>> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >>> >> >> >> >> completes >>> >> >> >> >> >>> >> >> >> >> Look into the "backoff disambiguation symbol", normally >>> >> >> >> >> called >>> >> >> #0. >>> >> >> >> >> The reason why it is needed should be explained in the >>> >> hbka.pdf >>> >> >> >> paper. >>> >> >> >> >> Dan >>> >> >> >> >> >>> >> >> >> >> >>> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >>> >> >> >> >> <kir...@sm...> wrote: >>> >> >> >> >> > Thank you! The output consists of some sequences as you >>> >> >> >> >> > described, >>> >> >> >> >> quickly falling into a short ever repeated loop. >>> >> >> >> >> > >>> >> >> >> >> > The non-repeated section ends up with osymbols (excluding >>> >> >> >> epsilons) >>> >> >> >> >> "whatsoever on vacation up", and then the repeated part >>> >> >> >> >> looks >>> >> >> like " >>> >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The >>> >> >> >> >> word >>> >> >> "real" >>> >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >>> >> >> >> >> > >>> >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >>> >> >> >> "vacation >>> >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up >>> >> real >>> >> >> >> quick" >>> >> >> >> >> and "up real quickly". "up real" is also a tail of a few >>> >> >> >> >> other 3-grams, but these are also same in both models (up >>> to >>> >> >> >> >> their >>> >> >> >> weights). >>> >> >> >> >> > >>> >> >> >> >> > It looks I do not understand what should I make in the >>> end >>> >> >> >> >> > out of >>> >> >> >> >> this >>> >> >> >> >> > debug data :( >>> >> >> >> >> > >>> >> >> >> >> > -kkm >>> >> >> >> >> > >>> >> >> >> >> >> -----Original Message----- >>> >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >>> >> >> >> >> >> Sent: 2015-06-15 1821 >>> >> >> >> >> >> To: Kirill Katsnelson >>> >> >> >> >> >> Cc: kal...@li... >>> >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) >>> never >>> >> >> >> >> >> completes >>> >> >> >> >> >> >>> >> >> >> >> >> > I have a small set of sentences with repeat counts, >>> and >>> >> >> >> >> >> > generating an >>> >> >> >> >> >> LM out of it. One is generated by a horrible local tool >>> I >>> >> >> >> >> >> have trouble tracing exactly how. For this one L*G >>> >> >> >> >> >> composition >>> >> >> takes >>> >> >> >> >> about >>> >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of >>> >> the >>> >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has >>> >> >> >> >> >> been sitting in mkgraphs.sh on L_disambig*G composition >>> >> >> >> >> >> step for about 30 >>> >> >> >> minutes, >>> >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >>> >> >> >> >> >> running at >>> >> >> >> >> 100%. >>> >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks >>> like >>> >> >> >> >> >> the >>> >> >> G >>> >> >> >> >> >> making it not determinizable, although I have no idea >>> how >>> >> it >>> >> >> >> >> >> came to >>> >> >> >> >> be. >>> >> >> >> >> >> > >>> >> >> >> >> >> > Anyone could share an advice on tracking down the >>> >> problem? >>> >> >> >> Thanks. >>> >> >> >> >> >> >>> >> >> >> >> >> You can send a signal to that program like kill - >>> SIGUSR1 >>> >> >> >> >> >> process-id and it will print out some info about the >>> >> >> >> >> >> symbol sequences involved, I think it is like >>> >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >>> >> >> >> >> >> Usually there is a particular word sequence that is >>> >> >> problematic. >>> >> >> >> >> >> Dan >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> >>> >> >> >> >> >> > >>> >> >> >> >> >> > -kkm >>> >> >> >> >> >> > >>> >> >> >> >> >> > ------------------------------------------------------ >>> - >>> >> >> >> >> >> > -- >>> >> - >>> >> >> >> >> >> > -- >>> >> >> - >>> >> >> >> >> >> > -- >>> >> >> >> - >>> >> >> >> >> >> > -- >>> >> >> >> >> - >>> >> >> >> >> >> > -- >>> >> >> >> >> >> - >>> >> >> >> >> >> > -------- >>> >> >> >> >> >> > _______________________________________________ >>> >> >> >> >> >> > Kaldi-users mailing list >>> >> >> >> >> >> > Kal...@li... >>> >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi- >>> user >>> >> >> >> >> >> > s |