Re: [Kaldi-users] generating language model based on switchboard fails to produce word_map

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

The word_map is correct, it has to do with a form of compression.
How many utterances are in your training set?  By default it uses the
1st 10k as validation set; if your #utts is less than this, then the
training set will be empty.   You should investigate SRILM too, you'll
probably find it easier to use.
Dan

On Fri, Jul 19, 2013 at 1:46 PM, Nathan Dunn <nd...@ca...> wrote:
>
> I am adapting an earlier script to use mirror what switchboard/s5 is doing.
>
>
>
> train_lm.sh --arpa --lmtype 3gram-mincount $dir || exit 1;
>
> on this line:
> sort -m <(gunzip -c $subdir/heldout_ngrams.gz) - | compute_perplexity
>
> I get:
> Not installing the kaldi_lm toolkit since it is already there.
> Getting raw N-gram counts
> discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000
> discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000
> discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000
> Iteration 1/6 of optimizing discounting parameters
> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000
> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000
> discount_ngrams: for n-gram order 3, D=0.000000, tau=0.825000 phi=2.000000
> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000
> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000
> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.100000 phi=2.000000
> discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000
> discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000
> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.485000 phi=2.000000
> interpolate_ngrams: 116456 words in wordslist
> compute_perplexity: for history-state "", no total-count % is seen
> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>
> real    0m0.027s
> user    0m0.016s
> sys     0m0.004s
> interpolate_ngrams: 116456 words in wordslist
> interpolate_ngrams: 116456 words in wordslist
> compute_perplexity: for history-state "", no total-count % is seen
> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>
> real    0m0.031s
> user    0m0.020s
> sys     0m0.000s
> compute_perplexity: for history-state "", no total-count % is seen
> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>
> real    0m0.029s
> user    0m0.020s
> sys     0m0.000s
> Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/ndunn/svn/kaldi-stable/egs/childspeech/s5/../../../tools/kaldi_lm/optimize_alpha.pl line 23.
> Preparing train and test data
> No such file data/local/lm/3gram-mincount/lm_unpruned.gz
> steps/make_mfcc.sh --nj 20 --cmd scripts/run.pl data/train exp/make_mfcc/train /home
>
> dies . . ..
>
> THEORY - get_word_map.pl script is generating a bad word_map.  In vim, <80> <81> and <82> show-up as non-ascii characters as below:
>
> Using the switchboard script this line uses get "get_word_map.pl" in kaldi_lm tools:
> cat $dir/unigram.counts  | awk '{print $2}' | get_word_map.pl "<s>" "</s>" "<UNK>"
>
> Bu this generates this, where the second-column is wonky.
>
> S(/S/) s
> AH(/AH/) v
> IX(/IX/) w
> THEIR x
> /W y
> SOME z
> AA(/AA/) <80>
> /K <81>
> WATER <82>
>
> Similar results on Linux and the Mac.
>
> Any other ideas?
>
> Thanks,
>
> Nathan
>
>
>