Re: [Kaldi-users] fstdeterminizestar (L*G) never completes

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Look into the "backoff disambiguation symbol", normally called #0.
The reason why it is needed should be explained in the hbka.pdf paper.
Dan

On Mon, Jun 15, 2015 at 9:54 PM, Kirill Katsnelson
<kir...@sm...> wrote:
> Thank you! The output consists of some sequences as you described, quickly falling into a short ever repeated loop.
>
> The non-repeated section ends up with osymbols (excluding epsilons) "whatsoever on vacation up", and then the repeated part looks like " #1 ( ) #16 ( ) #0 ( ) R_B ( ) IY1_I ( ) L_E (real)". The word "real" is spelled "R_B IY1_I L_E #1" in L_disambig.
>
> Both LMs contain a bigram for "vacation up" and a trigram "vacation up there". "up real" is a bigram in both, with 3-grams "up real quick" and "up real quickly". "up real" is also a tail of a few other 3-grams, but these are also same in both models (up to their weights).
>
> It looks I do not understand what should I make in the end out of this debug data :(
>
>  -kkm
>
>> -----Original Message-----
>> From: Daniel Povey [mailto:dp...@gm...]
>> Sent: 2015-06-15 1821
>> To: Kirill Katsnelson
>> Cc: kal...@li...
>> Subject: Re: [Kaldi-users] fstdeterminizestar (L*G) never completes
>>
>> > I have a small set of sentences with repeat counts, and generating an
>> LM out of it. One is generated by a horrible local tool I have trouble
>> tracing exactly how. For this one L*G composition takes about 20
>> seconds on my CPU. Another LM I just generated out of the same files
>> with srilm 1.7.1 ngram-count. This one has been sitting in mkgraphs.sh
>> on L_disambig*G composition step for about 30 minutes, and still
>> churning. fstdeterminizestar --use-log=true is running at 100%.
>> L_disambig.fst is the same file in both cases. Looks like the G making
>> it not determinizable, although I have no idea how it came to be.
>> >
>> > Anyone could share an advice on tracking down the problem? Thanks.
>>
>> You can send a signal to that program like  kill -SIGUSR1 process-id
>> and it will print out some info about the symbol sequences involved, I
>> think it is like
>>  isymbol1 (osymbol1)  isymbol2 (osymbol2) and so on.
>> Usually there is a particular word sequence that is problematic.
>> Dan
>>
>>
>>
>>
>> >
>> >  -kkm
>> >
>> > ---------------------------------------------------------------------
>> -
>> > -------- _______________________________________________
>> > Kaldi-users mailing list
>> > Kal...@li...
>> > https://lists.sourceforge.net/lists/listinfo/kaldi-users