Re: [Kaldi-users] generating language model based on switchboard fails to produce word_map

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

You should be able to modify the script, yes.  I'm surprised that
variable doesn't work-- try removing everything from the directory
where it's working.  There's a variable set to 10000 that you should
set to a smaller value.  But the Kaldi LM tools aren't great-- I
recommend SRILM.
Dan

On Fri, Jul 19, 2013 at 2:54 PM, Nathan Dunn <nd...@ca...> wrote:
>
> Sorry, I came to the same conclusion about the word_map afterwords.
>
> I have only 5K utterances, so that might be the problem.
>
> 1 - Can I use less by modifying the script?   There is a holdout_sent, but modifying that variable has little effect.
> 2 - Is there a good model to look at this (wsj/s5 or use swbd/s6, swbd/s5/swbd1_train_lms_edin,babel/s5/local/train_lms_srilm.sh)?
>
> Thanks,
>
> Nathan
>
>
>
> On Jul 19, 2013, at 11:26 AM, Daniel Povey wrote:
>
>> The word_map is correct, it has to do with a form of compression.
>> How many utterances are in your training set?  By default it uses the
>> 1st 10k as validation set; if your #utts is less than this, then the
>> training set will be empty.   You should investigate SRILM too, you'll
>> probably find it easier to use.
>> Dan
>>
>>
>> On Fri, Jul 19, 2013 at 1:46 PM, Nathan Dunn <nd...@ca...> wrote:
>>>
>>> I am adapting an earlier script to use mirror what switchboard/s5 is doing.
>>>
>>>
>>>
>>> train_lm.sh --arpa --lmtype 3gram-mincount $dir || exit 1;
>>>
>>> on this line:
>>> sort -m <(gunzip -c $subdir/heldout_ngrams.gz) - | compute_perplexity
>>>
>>> I get:
>>> Not installing the kaldi_lm toolkit since it is already there.
>>> Getting raw N-gram counts
>>> discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000
>>> discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000
>>> discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000
>>> Iteration 1/6 of optimizing discounting parameters
>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000
>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000
>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=0.825000 phi=2.000000
>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000
>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000
>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.100000 phi=2.000000
>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000
>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000
>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.485000 phi=2.000000
>>> interpolate_ngrams: 116456 words in wordslist
>>> compute_perplexity: for history-state "", no total-count % is seen
>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>
>>> real    0m0.027s
>>> user    0m0.016s
>>> sys     0m0.004s
>>> interpolate_ngrams: 116456 words in wordslist
>>> interpolate_ngrams: 116456 words in wordslist
>>> compute_perplexity: for history-state "", no total-count % is seen
>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>
>>> real    0m0.031s
>>> user    0m0.020s
>>> sys     0m0.000s
>>> compute_perplexity: for history-state "", no total-count % is seen
>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>
>>> real    0m0.029s
>>> user    0m0.020s
>>> sys     0m0.000s
>>> Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/ndunn/svn/kaldi-stable/egs/childspeech/s5/../../../tools/kaldi_lm/optimize_alpha.pl line 23.
>>> Preparing train and test data
>>> No such file data/local/lm/3gram-mincount/lm_unpruned.gz
>>> steps/make_mfcc.sh --nj 20 --cmd scripts/run.pl data/train exp/make_mfcc/train /home
>>>
>>> dies . . ..
>>>
>>> THEORY - get_word_map.pl script is generating a bad word_map.  In vim, <80> <81> and <82> show-up as non-ascii characters as below:
>>>
>>> Using the switchboard script this line uses get "get_word_map.pl" in kaldi_lm tools:
>>> cat $dir/unigram.counts  | awk '{print $2}' | get_word_map.pl "<s>" "</s>" "<UNK>"
>>>
>>> Bu this generates this, where the second-column is wonky.
>>>
>>> S(/S/) s
>>> AH(/AH/) v
>>> IX(/IX/) w
>>> THEIR x
>>> /W y
>>> SOME z
>>> AA(/AA/) <80>
>>> /K <81>
>>> WATER <82>
>>>
>>> Similar results on Linux and the Mac.
>>>
>>> Any other ideas?
>>>
>>> Thanks,
>>>
>>> Nathan
>>>
>>>
>>>
>