|
From: Mailing l. u. f. U. C. a. U. <kal...@li...> - 2013-07-25 16:10:15
|
Modifying it to use the standard s5 util scripts worked great. It is fully decoded with timings:
lattice-1best "ark:gunzip -c exp/tri1/decode_cmu_test/lat.1.gz|" ark:- | lattice-align-words exp/tri1/graph/phones/word_boundary.int exp/tri1/final.mdl ark:- ark:- | nbest-to-ctm ark:- - | ./utils/int2sym.pl -f 5 exp/tri1/graph/words.txt
I'm still using the Kaldi LM tools, but as I have more time, I will switch them to use the SRILM.
Your help was much appreciated.
Thanks,
Nathan
On Jul 19, 2013, at 3:28 PM, Daniel Povey wrote:
> You should be able to modify the script, yes. I'm surprised that
> variable doesn't work-- try removing everything from the directory
> where it's working. There's a variable set to 10000 that you should
> set to a smaller value. But the Kaldi LM tools aren't great-- I
> recommend SRILM.
> Dan
>
>
> On Fri, Jul 19, 2013 at 2:54 PM, Nathan Dunn <nd...@ca...> wrote:
>>
>> Sorry, I came to the same conclusion about the word_map afterwords.
>>
>> I have only 5K utterances, so that might be the problem.
>>
>> 1 - Can I use less by modifying the script? There is a holdout_sent, but modifying that variable has little effect.
>> 2 - Is there a good model to look at this (wsj/s5 or use swbd/s6, swbd/s5/swbd1_train_lms_edin,babel/s5/local/train_lms_srilm.sh)?
>>
>> Thanks,
>>
>> Nathan
>>
>>
>>
>> On Jul 19, 2013, at 11:26 AM, Daniel Povey wrote:
>>
>>> The word_map is correct, it has to do with a form of compression.
>>> How many utterances are in your training set? By default it uses the
>>> 1st 10k as validation set; if your #utts is less than this, then the
>>> training set will be empty. You should investigate SRILM too, you'll
>>> probably find it easier to use.
>>> Dan
>>>
>>>
>>> On Fri, Jul 19, 2013 at 1:46 PM, Nathan Dunn <nd...@ca...> wrote:
>>>>
>>>> I am adapting an earlier script to use mirror what switchboard/s5 is doing.
>>>>
>>>>
>>>>
>>>> train_lm.sh --arpa --lmtype 3gram-mincount $dir || exit 1;
>>>>
>>>> on this line:
>>>> sort -m <(gunzip -c $subdir/heldout_ngrams.gz) - | compute_perplexity
>>>>
>>>> I get:
>>>> Not installing the kaldi_lm toolkit since it is already there.
>>>> Getting raw N-gram counts
>>>> discount_ngrams: for n-gram order 1, D=0.000000, tau=0.000000 phi=1.000000
>>>> discount_ngrams: for n-gram order 2, D=0.000000, tau=0.000000 phi=1.000000
>>>> discount_ngrams: for n-gram order 3, D=1.000000, tau=0.000000 phi=1.000000
>>>> Iteration 1/6 of optimizing discounting parameters
>>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.675000 phi=2.000000
>>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.675000 phi=2.000000
>>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=0.825000 phi=2.000000
>>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=0.900000 phi=2.000000
>>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=0.900000 phi=2.000000
>>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.100000 phi=2.000000
>>>> discount_ngrams: for n-gram order 1, D=0.600000, tau=1.215000 phi=2.000000
>>>> discount_ngrams: for n-gram order 2, D=0.800000, tau=1.215000 phi=2.000000
>>>> discount_ngrams: for n-gram order 3, D=0.000000, tau=1.485000 phi=2.000000
>>>> interpolate_ngrams: 116456 words in wordslist
>>>> compute_perplexity: for history-state "", no total-count % is seen
>>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>>
>>>> real 0m0.027s
>>>> user 0m0.016s
>>>> sys 0m0.004s
>>>> interpolate_ngrams: 116456 words in wordslist
>>>> interpolate_ngrams: 116456 words in wordslist
>>>> compute_perplexity: for history-state "", no total-count % is seen
>>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>>
>>>> real 0m0.031s
>>>> user 0m0.020s
>>>> sys 0m0.000s
>>>> compute_perplexity: for history-state "", no total-count % is seen
>>>> (perhaps you didn't put the training n-grams through interpolate_ngrams?)
>>>>
>>>> real 0m0.029s
>>>> user 0m0.020s
>>>> sys 0m0.000s
>>>> Usage: optimize_alpha.pl alpha1 perplexity@alpha1 alpha2 perplexity@alpha2 alpha3 perplexity@alph3 at /home/ndunn/svn/kaldi-stable/egs/childspeech/s5/../../../tools/kaldi_lm/optimize_alpha.pl line 23.
>>>> Preparing train and test data
>>>> No such file data/local/lm/3gram-mincount/lm_unpruned.gz
>>>> steps/make_mfcc.sh --nj 20 --cmd scripts/run.pl data/train exp/make_mfcc/train /home
>>>>
>>>> dies . . ..
>>>>
>>>> THEORY - get_word_map.pl script is generating a bad word_map. In vim, <80> <81> and <82> show-up as non-ascii characters as below:
>>>>
>>>> Using the switchboard script this line uses get "get_word_map.pl" in kaldi_lm tools:
>>>> cat $dir/unigram.counts | awk '{print $2}' | get_word_map.pl "<s>" "</s>" "<UNK>"
>>>>
>>>> Bu this generates this, where the second-column is wonky.
>>>>
>>>> S(/S/) s
>>>> AH(/AH/) v
>>>> IX(/IX/) w
>>>> THEIR x
>>>> /W y
>>>> SOME z
>>>> AA(/AA/) <80>
>>>> /K <81>
>>>> WATER <82>
>>>>
>>>> Similar results on Linux and the Mac.
>>>>
>>>> Any other ideas?
>>>>
>>>> Thanks,
>>>>
>>>> Nathan
>>>>
>>>>
>>>>
>>
|