Re: [Kaldi-users] generating language model based on switchboard fails to produce word_map

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Modifying it to use the standard s5 util scripts worked great.  It is fully decoded with timings:

lattice-1best "ark:gunzip -c exp/tri1/decode_cmu_test/lat.1.gz|" ark:- | lattice-align-words exp/tri1/graph/phones/word_boundary.int exp/tri1/final.mdl ark:- ark:- | nbest-to-ctm ark:- - | ./utils/int2sym.pl -f 5 exp/tri1/graph/words.txt

I'm still using the Kaldi LM tools, but as I have more time, I will switch them to use the SRILM.

Your help was much appreciated. 

Thanks,  

Nathan

On Jul 19, 2013, at 3:28 PM, Daniel Povey wrote:

> You should be able to modify the script, yes.  I'm surprised that
> variable doesn't work-- try removing everything from the directory
> where it's working.  There's a variable set to 10000 that you should
> set to a smaller value.  But the Kaldi LM tools aren't great-- I
> recommend SRILM.
> Dan
> 
> 
> On Fri, Jul 19, 2013 at 2:54 PM, Nathan Dunn <nd...@ca...> wrote:
>> 
>> Sorry, I came to the same conclusion about the word_map afterwords.
>> 
>> I have only 5K utterances, so that might be the problem.
>> 
>> 1 - Can I use less by modifying the script?   There is a holdout_sent, but modifying that variable has little effect.
>> 2 - Is there a good model to look at this (wsj/s5 or use swbd/s6, swbd/s5/swbd1_train_lms_edin,babel/s5/local/train_lms_srilm.sh)?
>> 
>> Thanks,
>> 
>> Nathan
>> 
>> 
>> 
>> On Jul 19, 2013, at 11:26 AM, Daniel Povey wrote:
>> 
>>> The word_map is correct, it has to do with a form of compression.
>>> How many utterances are in your training set?  By default it uses the
>>> 1st 10k as validation set; if your #utts is less than this, then the
>>> training set will be empty.   You should investigate SRILM too, you'll
>>> probably find it easier to use.
>>> Dan
>>> 
>>> 
>>> On Fri, Jul 19, 2013 at 1:46 PM, Nathan Dunn <nd...@ca...> wrote:
>>>> 
>>>> I am adapting an earlier script to use mirror what switchboard/s5 is doing.
>>>> 
>>>> 
>>>> 
>>>> train_lm.sh --arpa --lmtype 3gram-mincount $dir || exit 1;
>>>> 
>>>> on this line:
>>>> sort -m <(gunzip -c $subdir/heldout_ngrams.gz) - | compute_perplexity
>>>> 
>>>> I get:
>>>> Not installing the kaldi_lm toolkit since it is already there.
>>>> Getting raw N-gram counts
>>>> discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000
>>>> discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000
>>>> discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000
>>>> Iteration 1/6 of optimizing discounting parameters
>>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000
>>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000
>>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=0.825000 phi=2.000000
>>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000
>>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000
>>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.100000 phi=2.000000
>>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000
>>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000
>>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.485000 phi=2.000000
>>>> interpolate_ngrams: 116456 words in wordslist
>>>> compute_perplexity: for history-state "", no total-count % is seen
>>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>> 
>>>> real    0m0.027s
>>>> user    0m0.016s
>>>> sys     0m0.004s
>>>> interpolate_ngrams: 116456 words in wordslist
>>>> interpolate_ngrams: 116456 words in wordslist
>>>> compute_perplexity: for history-state "", no total-count % is seen
>>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>> 
>>>> real    0m0.031s
>>>> user    0m0.020s
>>>> sys     0m0.000s
>>>> compute_perplexity: for history-state "", no total-count % is seen
>>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>> 
>>>> real    0m0.029s
>>>> user    0m0.020s
>>>> sys     0m0.000s
>>>> Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/ndunn/svn/kaldi-stable/egs/childspeech/s5/../../../tools/kaldi_lm/optimize_alpha.pl line 23.
>>>> Preparing train and test data
>>>> No such file data/local/lm/3gram-mincount/lm_unpruned.gz
>>>> steps/make_mfcc.sh --nj 20 --cmd scripts/run.pl data/train exp/make_mfcc/train /home
>>>> 
>>>> dies . . ..
>>>> 
>>>> THEORY - get_word_map.pl script is generating a bad word_map.  In vim, <80> <81> and <82> show-up as non-ascii characters as below:
>>>> 
>>>> Using the switchboard script this line uses get "get_word_map.pl" in kaldi_lm tools:
>>>> cat $dir/unigram.counts  | awk '{print $2}' | get_word_map.pl "<s>" "</s>" "<UNK>"
>>>> 
>>>> Bu this generates this, where the second-column is wonky.
>>>> 
>>>> S(/S/) s
>>>> AH(/AH/) v
>>>> IX(/IX/) w
>>>> THEIR x
>>>> /W y
>>>> SOME z
>>>> AA(/AA/) <80>
>>>> /K <81>
>>>> WATER <82>
>>>> 
>>>> Similar results on Linux and the Mac.
>>>> 
>>>> Any other ideas?
>>>> 
>>>> Thanks,
>>>> 
>>>> Nathan
>>>> 
>>>> 
>>>> 
>>