Re: [Kaldi-users] training recommendataions

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks.  

> It's unusual that the later stages of training are not better.
> Normally you get a substantial improvement.

I wonder if this is due to the very small amount of my training data.   

Is there a recommended recipe that I should follow for this type of data (20K in training data, decoding 1 min long passages)?  I tried to use swbd, but ended up going back to using the settings that more closely matched resource management.  

Nathan

On Jul 29, 2013, at 3:27 PM, Daniel Povey wrote:

>> 1 - I have a training set of around 5K words, though I could bring that up
>> to around 20K
> 
> More language model training data will definitely help.
> 
>> 2 - I am using the kaldi_lm, though I could use SRILM . . not sure if it
>> would necessarily improve results
> 
> Probably would make no difference-- more a usability issue.
> 
>> 3 - I am decoding about 1 minute of text, though training data is in 10
>> second epochs.  I can mix some of the test data in if that would help.
> 
> It's not considered good form to mix the test data in with training--
> this will give you unrealistically good results.
> 
>> 4 - When I am training deltas I use a very small # of leaves / gauss (100 /
>> 1000) to get the best results.   The best results are with tri1.  Further
>> training yields worse results.
> 
> It's unusual that the later stages of training are not better.
> Normally you get a substantial improvement.
> 
> Dan
> 
>> 5 - I use the same lexicon for the training and decoding (though a more
>> restrictive language model for decoding).
> 
>> Any help / thoughts are appreciated.
>> 
>> Thanks,
>> 
>> Nathan
>> 
>>