From: Daniel P. <dp...@gm...> - 2015-06-16 05:48:28
|
OOVs should be OK. Make sure there are no n-grams with things like <s> <s> e.g. see the lines grep -v '<s> <s>' | \ grep -v '</s> <s>' | \ grep -v '</s> </s>' | \ in the WSJ script: gunzip -c $lmdir/lm_${lm_suffix}.arpa.gz | \ grep -v '<s> <s>' | \ grep -v '</s> <s>' | \ grep -v '</s> </s>' | \ arpa2fst - | fstprint | \ utils/remove_oovs.pl $tmpdir/oovs_${lm_suffix}.txt | \ utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$test/words.txt \ --osymbols=$test/words.txt --keep_isymbols=false --keep_osymbols=false | \ fstrmepsilon | fstarcsort --sort_type=ilabel > $test/G.fst Dan On Tue, Jun 16, 2015 at 1:42 AM, Kirill Katsnelson <kir...@sm...> wrote: > Bingo. G.fst is not determinizable (the "good" G.fst takes under a second to determinize). And the bad one loops at the word "zero" like this > > #0 > unsure unsure > #0 > of of > #0 > yours yours > #0 > is is > #0 > your your > #0 > zip zip > #0 > wrong wrong > #0 > with with > #0 > zero zero > #0 > zero zero > .... > > I am taking the LM straight from ngram_counts to the standard pipeline, nothing fancy. The only thing is it has a lot of OOVs: > > remove_oovs.pl: removed 4646 lines. > > Is this generally a problem? So does my "good" arpa LM. I grepped both for the word zero, but could not spot anything outrageous. Can you think of anything I can look for? > > My source is no longer than 10 days old. Here's the pipeline, just in case. > > cat $src/$arpalm | tr -d '\r' | \ > utils/find_arpa_oovs.pl $lang/words.txt > $lang/lm_oovs.txt > > cat $src/$arpalm | tr -d '\r' | \ > arpa2fst - | fstprint | \ > utils/remove_oovs.pl $lang/lm_oovs.txt | \ > utils/eps2disambig.pl | utils/s2eps.pl | fstcompile --isymbols=$lang/words.txt \ > --osymbols=$lang/words.txt --keep_isymbols=false --keep_osymbols=false | \ > fstrmepsilon | fstarcsort --sort_type=ilabel > $lang/G.fst > > -kkm > > >> -----Original Message----- >> From: Daniel Povey [mailto:dp...@gm...] >> Sent: 2015-06-15 2206 >> To: Kirill Katsnelson >> Cc: kal...@li... >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> I don't recommend to look at the fstdeterminizestar algorithm itself- >> it's very complicated. Instead focus on the definition of >> "determinizable" and the twins property, and figure out what path you >> are taking through L.fst and G.fst. Trying to fstdeterminizestar G.fst >> directly, and seeing whether it terminates or not, may tell you >> something; if it fails, send the signal and see what happens. >> fstdeterminizestar does care about the weights, but only to the extent >> that they are the same or different from each other; and if your G.fst >> is generated from arpa2fst the pipeline should work for any ARPA-format >> language model- make sure you are using an up-to-date Kaldi though, >> there have been fixes as recently as a few months ago. >> The presence of SIL is not surprising, it is the optional-silence added >> by the lexicon. I think that script is adding #16 if it does >> *not* take the optional silence, otherwise it adds the phone SIL. >> Since you are calling your FST a "grammar" I'm wondering whether you >> have done something fancy with mapping words to FSTs or something like >> that, which is causing the result to not be determinizable. >> >> Dan >> >> >> On Tue, Jun 16, 2015 at 12:55 AM, Kirill Katsnelson >> <kir...@sm...> wrote: >> > Thank you very much for your help Dan, but I am still stuck. >> > >> > First of all, a question: does the fstdeterminizestar algorithm >> depend on actual backoff and n-gram probabilities, i.e. will it behave >> differently if the numbers in arpa model file are different? Or does it >> depend only on arc labels but not weights? I am looking at the code but >> certainly I am far from being able to understand it. I cheated by >> looking at all if conditions in it, and this one in EpsilonClosure is >> seemingly the only one dealing with weights: >> > >> > if (! ApproxEqual(weight, iter->second.weight, delta_)) { >> // add extra part of weight to queue. >> > >> > (In ProcessFinal it also has "if (this_final_weight != >> > Weight::Zero())" but I do not believe it is relevant?) >> > >> > I am trying to understand how to dig into the problem--are weights in >> the picture actually. >> > >> > Also, just for a test, I ran the grammar trough a "grep -v 'real >> real'", and indeed got a similar loop on the word "very" which is also >> often repeated. But the "real real" 2- and 3-grams are there in the >> "good" grammar too. >> > >> > Another thing I do not understand is the presence of the SIL ilabel >> in the backtrace. Here's the beginning of the trace that leads to the >> infinite loop as decoded with a little script I wrote (format is ilabel >> [ TAB olabel ]: >> > >> > #16 >> > #0 >> > V_B >> > Y_I >> > UW1_I >> > Z_E views >> > #2 >> > SIL >> > #0 >> > AH0_B >> > N_I >> > SH_I unsure >> > UH1_I >> > R_E >> > >> > Note the presence of SIL at line 8. This is not in lexicon: >> > >> > $ grep SIL >> data/lang_sa_generic_test/dict/lexiconp_silprob_disambig.txt >> > !SIL 1 0.20 1.00 1.00 SIL_S >> > $ >> > >> > Is this a hint? How did it get there at all? I am using a standard >> script to build the L_disambig.fst: >> > >> > phone_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/phones.txt) >> > word_disambig_symbol=$(awk '$1=="#0"{print $2}' $lang/words.txt) >> > utils/make_lexicon_fst_silprob.pl >> $lang/dict/lexiconp_silprob_disambig.txt \ >> > data/local/dict/silprob.txt $silphone '#'$ndisambig | \ >> > fstcompile --isymbols=$lang/phones.txt -- >> osymbols=$lang/words.txt \ >> > --keep_isymbols=false --keep_osymbols=false | \ >> > fstaddselfloops "echo $phone_disambig_symbol |" "echo >> $word_disambig_symbol |" | \ >> > fstarcsort --sort_type=olabel > $lang/L_disambig.fst || exit 1; >> > >> > I checked the lexicon, and there are indeed only real phones at the >> beginning of each word, no empty positions and no #N symbols. >> > >> > -kkm >> > >> >> -----Original Message----- >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> Sent: 2015-06-15 1944 >> >> To: Kirill Katsnelson >> >> Cc: kal...@li... >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes >> >> >> >> I think the confusion is probably between two loops with "real" on >> >> them in G.fst: one loop where you always take the bigram >> probability, >> >> and one where you always take the unigram probability. Or maybe a >> >> similar confusion between a loop where you use the trigram "real >> real >> >> real" and the bigram "real real". Those loops are expected to >> exist. >> >> Probably the issue is that something happened at the start of the >> >> sequence which caused the FST to be confused about which of those >> two >> >> states it was in. If you have any empty words (words with empty >> >> pronunciation) in your lexicon this could possibly happen, as it >> >> would be confused between taking a normal word, then the backoff >> symbol, vs. >> >> taking a normal word, then the empty word, then the backoff symbol. >> >> I think the current Kaldi graph-creation script check for empty >> words >> >> in the lexicon, for this reason. >> >> >> >> Dan >> >> >> >> >> >> >> >> > The sequence R_B ( ) IY1_I ( ) L_E (real) #1 ( ) #16 ( ) #0 ( ) >> >> generally almost makes sense, given that #16 is the last one in >> >> table, the silence disambiguation symbol. (Not sure why "real" is >> >> emitted at L_E--I would rather expect it to be emitted at #1.) What >> I >> >> do not understand is what exactly the debug trace represents, and >> >> what should I make out if it. It is a path through the FST graph, >> but >> >> I do not understand what is this path exactly, and what does this >> >> endless walk of this loop mean. >> >> > >> >> > -kkm >> >> > >> >> >> -----Original Message----- >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> Sent: 2015-06-15 1858 >> >> >> To: Kirill Katsnelson >> >> >> Cc: kal...@li... >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> completes >> >> >> >> >> >> Look into the "backoff disambiguation symbol", normally called >> #0. >> >> >> The reason why it is needed should be explained in the hbka.pdf >> >> paper. >> >> >> Dan >> >> >> >> >> >> >> >> >> On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson >> >> >> <kir...@sm...> wrote: >> >> >> > Thank you! The output consists of some sequences as you >> >> >> > described, >> >> >> quickly falling into a short ever repeated loop. >> >> >> > >> >> >> > The non-repeated section ends up with osymbols (excluding >> >> epsilons) >> >> >> "whatsoever on vacation up", and then the repeated part looks >> like " >> >> >> #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word >> "real" >> >> >> is spelled "R_B IY1_I L_E #1" in L_disambig. >> >> >> > >> >> >> > Both LMs contain a bigram for "vacation up" and a trigram >> >> "vacation >> >> >> up there". "up real" is a bigram in both, with 3-grams "up real >> >> quick" >> >> >> and "up real quickly". "up real" is also a tail of a few other >> >> >> 3-grams, but these are also same in both models (up to their >> >> weights). >> >> >> > >> >> >> > It looks I do not understand what should I make in the end out >> >> >> > of >> >> >> this >> >> >> > debug data :( >> >> >> > >> >> >> > -kkm >> >> >> > >> >> >> >> -----Original Message----- >> >> >> >> From: Daniel Povey [mailto:dp...@gm...] >> >> >> >> Sent: 2015-06-15 1821 >> >> >> >> To: Kirill Katsnelson >> >> >> >> Cc: kal...@li... >> >> >> >> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never >> >> >> >> completes >> >> >> >> >> >> >> >> > I have a small set of sentences with repeat counts, and >> >> >> >> > generating an >> >> >> >> LM out of it. One is generated by a horrible local tool I have >> >> >> >> trouble tracing exactly how. For this one L*G composition >> takes >> >> >> about >> >> >> >> 20 seconds on my CPU. Another LM I just generated out of the >> >> >> >> same files with srilm 1.7.1 ngram-count. This one has been >> >> >> >> sitting in mkgraphs.sh on L_disambig*G composition step for >> >> >> >> about 30 >> >> minutes, >> >> >> >> and still churning. fstdeterminizestar --use-log=true is >> >> >> >> running at >> >> >> 100%. >> >> >> >> L_disambig.fst is the same file in both cases. Looks like the >> G >> >> >> >> making it not determinizable, although I have no idea how it >> >> >> >> came to >> >> >> be. >> >> >> >> > >> >> >> >> > Anyone could share an advice on tracking down the problem? >> >> Thanks. >> >> >> >> >> >> >> >> You can send a signal to that program like kill -SIGUSR1 >> >> >> >> process-id and it will print out some info about the symbol >> >> >> >> sequences involved, I think it is like >> >> >> >> isymbol1 (osymbol1) isymbol2 (osymbol2) and so on. >> >> >> >> Usually there is a particular word sequence that is >> problematic. >> >> >> >> Dan >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> >> >> >> > -kkm >> >> >> >> > >> >> >> >> > ------------------------------------------------------------ >> - >> >> >> >> > -- >> >> - >> >> >> >> > -- >> >> >> - >> >> >> >> > -- >> >> >> >> - >> >> >> >> > -------- _______________________________________________ >> >> >> >> > Kaldi-users mailing list >> >> >> >> > Kal...@li... >> >> >> >> > https://lists.sourceforge.net/lists/listinfo/kaldi-users |