From: Daniel P. <dp...@gm...> - 2015-06-16 06:40:06
|
In general SRILM language models are OK, but something weird could have happened, especially on an unusual platform like Windows. Look for duplicate lines with apparently the same n-gram on, and also send to me (but not to kaldi-user) the arpa LM. Dan On Tue, Jun 16, 2015 at 2:03 AM, Kirill Katsnelson <kir...@sm...> wrote: > Nope. The only thing I am thinking of doing is to bisect it somehow, to get a minimal grammar that still refuses to determinize. I tried different smoothing and played with other switches to ngram_count, but it still does loop. Are there any known problems with srilm-generated models? > > -kkm > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 2248 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> OOVs should be OK. >> Make sure there are no n-grams with things like <s> <s> >> >> e.g. see the lines >> grep -v '<s> <s>' | \ >> grep -v '</s> <s>' | \ >> grep -v '</s> </s>' | \ >> >> in the WSJ script: >> >> gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ >> grep -v '<s> <s>' | \ >> grep -v '</s> <s>' | \ >> grep -v '</s> </s>' | \ >> arpa2fst - | fstprint | \ >> utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ >> utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> isymbols=$test/words.txt \ >> --osymbols=$test/words.txt --keep_isymbols=false -- >> keep_osymbols=false | \ >> fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst >> >> Dan >> >> >> On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson >> <kir...@sm...> wrote: >> > Bingo. G.fst is not determinizable (the "good" G.fst takes under a >> > second to determinize). And the bad one loops at the word "zero" like >> > this >> > >> > #0 >> > unsure unsure >> > #0 >> > of of >> > #0 >> > yours yours >> > #0 >> > is is >> > #0 >> > your your >> > #0 >> > zip zip >> > #0 >> > wrong wrong >> > #0 >> > with with >> > #0 >> > zero zero >> > #0 >> > zero zero >> > .... >> > >> > I am taking the LM straight from ngram_counts to the standard >> pipeline, nothing fancy. The only thing is it has a lot of OOVs: >> > >> > remove_oovs.pl: removed 4646 lines. >> > >> > Is this generally a problem? So does my "good" arpa LM. I grepped >> both for the word zero, but could not spot anything outrageous. Can you >> think of anything I can look for? >> > >> > My source is no longer than 10 days old. Here's the pipeline, just in >> case. >> > >> > cat $src/$arpalm | tr -d '\r' | \ >> > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt >> > >> > cat $src/$arpalm | tr -d '\r' | \ >> > arpa2fst - | fstprint | \ >> > utils/remove_oovs.pl $lang/lm_oovs.txt | \ >> > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile -- >> isymbols=$lang/words.txt \ >> > --osymbols=$lang/words.txt --keep_isymbols=false -- >> keep_osymbols=false | \ >> > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst >> > >> > -kkm >> > >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 2206 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> I don't recommend to look at the fstdeterminizestar algorithm >> itself- >> >> it's very complicated. Instead focus on the definition of >> >> "determinizable" and the twins property, and figure out what path >> you >> >> are taking through L.fst and G.fst. Trying to fstdeterminizestar >> >> G.fst directly, and seeing whether it terminates or not, may tell >> you >> >> something; if it fails, send the signal and see what happens. >> >> fstdeterminizestar does care about the weights, but only to the >> >> extent that they are the same or different from each other; and if >> >> your G.fst is generated from arpa2fst the pipeline should work for >> >> any ARPA-format language model- make sure you are using an up-to- >> date >> >> Kaldi though, there have been fixes as recently as a few months ago. >> >> The presence of SIL is not surprising, it is the optional-silence >> >> added by the lexicon. I think that script is adding #16 if it does >> >> *not* take the optional silence, otherwise it adds the phone SIL. >> >> Since you are calling your FST a "grammar" I'm wondering whether you >> >> have done something fancy with mapping words to FSTs or something >> >> like that, which is causing the result to not be determinizable. >> >> >> >> Dan >> >> >> >> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> >> <kir...@sm...> wrote: >> >> > Thank you very much for your help Dan, but I am still stuck. >> >> > >> >> > First of all, a question: does the fstdeterminizestar algorithm >> >> depend on actual backoff and n-gram probabilities, i.e. will it >> >> behave differently if the numbers in arpa model file are different? >> >> Or does it depend only on arc labels but not weights? I am looking >> at >> >> the code but certainly I am far from being able to understand it. I >> >> cheated by looking at all if conditions in it, and this one in >> >> EpsilonClosure is seemingly the only one dealing with weights: >> >> > >> >> > if (! ApproxEqual(weight, iter->second.weight, >> delta_)) >> >> > { >> >> // add extra part of weight to queue. >> >> > >> >> > (In ProcessFinal it also has "if (this_final_weight != >> >> > Weight::Zero())" but I do not believe it is relevant?) >> >> > >> >> > I am trying to understand how to dig into the problem--are weights >> >> > in >> >> the picture actually. >> >> > >> >> > Also, just for a test, I ran the grammar trough a "grep -v 'real >> >> real'", and indeed got a similar loop on the word "very" which is >> >> also often repeated. But the "real real" 2- and 3-grams are there in >> >> the "good" grammar too. >> >> > >> >> > Another thing I do not understand is the presence of the SIL >> ilabel >> >> in the backtrace. Here's the beginning of the trace that leads to >> the >> >> infinite loop as decoded with a little script I wrote (format is >> >> ilabel [ TAB olabel ]: >> >> > >> >> > #16 >> >> > #0 >> >> > V_B >> >> > Y_I >> >> > UW1_I >> >> > Z_E views >> >> > #2 >> >> > SIL >> >> > #0 >> >> > AH0_B >> >> > N_I >> >> > SH_I unsure >> >> > UH1_I >> >> > R_E >> >> > >> >> > Note the presence of SIL at line 8. This is not in lexicon: >> >> > >> >> > $ grep SIL >> >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> >> > !SIL 1 0.20 1.00 1.00 SIL_S >> >> > $ >> >> > >> >> > Is this a hint? How did it get there at all? I am using a standard >> >> script to build the L_disambig.fst: >> >> > >> >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) >> >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) >> >> > utils/make_lexicon_fst_silprob.pl >> >> $lang/dict/lexiconp_silprob_disambig.txt \ >> >> > data/local/dict/silprob.txt $silphone '#'$ndisambig >> | \ >> >> > fstcompile --isymbols=$lang/phones.txt -- >> >> osymbols=$lang/words.txt \ >> >> > --keep_isymbols=false --keep_osymbols=false | \ >> >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> >> $word_disambig_symbol |" | \ >> >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit >> >> > 1; >> >> > >> >> > I checked the lexicon, and there are indeed only real phones at >> the >> >> beginning of each word, no empty positions and no #N symbols. >> >> > >> >> > -kkm >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 1944 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> I think the confusion is probably between two loops with "real" >> on >> >> >> them in G.fst: one loop where you always take the bigram >> >> probability, >> >> >> and one where you always take the unigram probability. Or maybe >> a >> >> >> similar confusion between a loop where you use the trigram "real >> >> real >> >> >> real" and the bigram "real real". Those loops are expected to >> >> exist. >> >> >> Probably the issue is that something happened at the start of the >> >> >> sequence which caused the FST to be confused about which of those >> >> two >> >> >> states it was in. If you have any empty words (words with empty >> >> >> pronunciation) in your lexicon this could possibly happen, as it >> >> >> would be confused between taking a normal word, then the backoff >> >> symbol, vs. >> >> >> taking a normal word, then the empty word, then the backoff >> symbol. >> >> >> I think the current Kaldi graph-creation script check for empty >> >> words >> >> >> in the lexicon, for this reason. >> >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) >> >> >> generally almost makes sense, given that #16 is the last one in >> >> >> table, the silence disambiguation symbol. (Not sure why "real" is >> >> >> emitted at L_E--I would rather expect it to be emitted at #1.) >> >> >> What >> >> I >> >> >> do not understand is what exactly the debug trace represents, and >> >> >> what should I make out if it. It is a path through the FST graph, >> >> but >> >> >> I do not understand what is this path exactly, and what does this >> >> >> endless walk of this loop mean. >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> >> -----Original Message----- >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> Sent: 2015-06-15 1858 >> >> >> >> To: Kirill Katsnelson >> >> >> >> Cc: kal...@li... >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> completes >> >> >> >> >> >> >> >> Look into the "backoff disambiguation symbol", normally called >> >> #0. >> >> >> >> The reason why it is needed should be explained in the >> hbka.pdf >> >> >> paper. >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> >> >> <kir...@sm...> wrote: >> >> >> >> > Thank you! The output consists of some sequences as you >> >> >> >> > described, >> >> >> >> quickly falling into a short ever repeated loop. >> >> >> >> > >> >> >> >> > The non-repeated section ends up with osymbols (excluding >> >> >> epsilons) >> >> >> >> "whatsoever on vacation up", and then the repeated part looks >> >> like " >> >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word >> >> "real" >> >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> >> >> > >> >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> >> >> "vacation >> >> >> >> up there". "up real" is a bigram in both, with 3-grams "up >> real >> >> >> quick" >> >> >> >> and "up real quickly". "up real" is also a tail of a few other >> >> >> >> 3-grams, but these are also same in both models (up to their >> >> >> weights). >> >> >> >> > >> >> >> >> > It looks I do not understand what should I make in the end >> >> >> >> > out of >> >> >> >> this >> >> >> >> > debug data :( >> >> >> >> > >> >> >> >> > -kkm >> >> >> >> > >> >> >> >> >> -----Original Message----- >> >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> >> Sent: 2015-06-15 1821 >> >> >> >> >> To: Kirill Katsnelson >> >> >> >> >> Cc: kal...@li... >> >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> >> completes >> >> >> >> >> >> >> >> >> >> > I have a small set of sentences with repeat counts, and >> >> >> >> >> > generating an >> >> >> >> >> LM out of it. One is generated by a horrible local tool I >> >> >> >> >> have trouble tracing exactly how. For this one L*G >> >> >> >> >> composition >> >> takes >> >> >> >> about >> >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of >> the >> >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been >> >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for >> >> >> >> >> about 30 >> >> >> minutes, >> >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >> >> >> >> >> running at >> >> >> >> 100%. >> >> >> >> >> L_disambig.fst is the same file in both cases. Looks like >> >> >> >> >> the >> >> G >> >> >> >> >> making it not determinizable, although I have no idea how >> it >> >> >> >> >> came to >> >> >> >> be. >> >> >> >> >> > >> >> >> >> >> > Anyone could share an advice on tracking down the >> problem? >> >> >> Thanks. >> >> >> >> >> >> >> >> >> >> You can send a signal to that program like kill -SIGUSR1 >> >> >> >> >> process-id and it will print out some info about the symbol >> >> >> >> >> sequences involved, I think it is like >> >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> >> >> Usually there is a particular word sequence that is >> >> problematic. >> >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> >> > -kkm >> >> >> >> >> > >> >> >> >> >> > --------------------------------------------------------- >> - >> >> >> >> >> > -- >> >> - >> >> >> >> >> > -- >> >> >> - >> >> >> >> >> > -- >> >> >> >> - >> >> >> >> >> > -- >> >> >> >> >> - >> >> >> >> >> > -------- _______________________________________________ >> >> >> >> >> > Kaldi-users mailing list >> >> >> >> >> > Kal...@li... >> >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |