[Kaldi-developers] Fwd: Large improvements by adjusting priors

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

I am forwarding this email thread to kaldi-developers as I think it will be
of interest to people.  Vimal found that we can improve sMBR trained neural
nets by recomputing the priors after sMBR training-- setting them to the
average posterior computed by the neural net, on randomly chosen training
data.  [This is a little bit like ensuring that the Gaussian mixture
weights sum to one in a generative model, which is normally done in
discriminative training even though in princpile the objective function
would make sense without ensuring that they sum to one].

Karel has checked in the change for the nnet1 setup; and I believe he has
also changed the script to make --one-silence-class true the default...
 "--one-silence-class true" tends to improve results by reducing the
insertion rate, as well as making more sense as an objective function.
Basically, the old objective function (standard MPE/SMBR/MPFE) had an
asymmetry w.r.t. insertions, that insertions into silence regions were not
counted as errors... this never made sense but was done because it had
seemed to work (this was in other toolkits though, like HTK).  Anyway,
--one-silence-class true makes the objective function more symmetric, and
also makes it so that all silence phones (silence, noise, etc.) or silence
pdfs are treated as a single class so replacing silence with noise or vice
versa is not counted as an error.  This makes sense because it's similar to
how we normally score the systems.

I'm hoping that in a couple of weeks we can check in changes to the
discriminative training setup for nnet2, to make the changes.  I'd like to
test it on a few setups first though.

Dan

---------- Forwarded message ----------
From: Vesely Karel <ive...@fi...>
Date: Thu, Mar 5, 2015 at 8:24 AM
Subject: Re: Large improvements by adjusting priors
To: dp...@gm..., Vimal Manohar <vim...@gm...>

 Okay, thanks, I just commited the updated sMBR script which estimates the
priors on the training data.
It has fixed the problem of too many deletions appering after sMBR training
(there are errors in the tranining transcripts, so sMBR does not help much
here):

%WER 78.4 | 2711 24825 | 24.8 47.8 27.5 3.2 78.4 99.6 | -1.103 |
exp/dnn6b_butbn1_pretrain-dbn_dnn/decode_vllp.tune.seg1/scoring_lex_10/ctm.filt.sub.sys
%WER 78.4 | 2711 24825 | 24.4 46.6 29.0 2.8 78.4 99.6 | -1.208 |
exp/dnn6b_butbn1_pretrain-dbn_dnn/decode_vllp.tune.seg1_PRIOR/scoring_lex_11/ctm.filt.sub.sys
=> no change on frame-cross entropy training

%WER 80.7 | 2711 24825 | 20.7 29.8 49.4 1.4 80.7 99.8 | -1.130 |
exp/dnn6c_butbn1_pretrain-dbn_dnn_smbr/decode_vllp.tune.seg1/scoring_lex_9/ctm.filt.sub.sys
%WER 78.0 | 2711 24825 | 24.7 45.0 30.4 2.7 78.0 99.6 | -1.109 |
exp/dnn6c_butbn1_pretrain-dbn_dnn_smbr/decode_vllp.tune.seg1_PRIOR/scoring_lex_11/ctm.filt.sub.sys
=> helpful with sMBR training

Also changed the default of sMBR as Dan suggested:
do_smbr=true
exclude_silphones=true
one_silence_class=true

Thanks,
Karel.

On 03/04/2015 10:50 PM, Daniel Povey wrote:

Karel, also note that the --one-silence-class thing seems to have been
helpful in quite a few scenarios.  We should consider making this the
default.  Anyway the original formulation never made sense, it was always a
hack.  --one-silence-class makes more sense.
Dan

On Wed, Mar 4, 2015 at 4:43 PM, Vimal Manohar <vim...@gm...>
wrote:

> Yes, that is correct. I found it helpful especially in the cases where the
> epoch4 model was performing worse than epoch3, like when using a
> high learning rate. But after recomputing priors (individually for both
> the models) at the end of sMBR training, epoch4 model was much better
> than epoch3.
>
>
> On 03/04, Daniel Povey wrote:
>
>> Karel, I already do that at the end of my frame-cross-entropy training.
>> It
>> was never clear that it made a big difference but I felt it was the right
>> way to do it.
>> I think what Vimal was saying is that he did the same at the end of sMBR
>> training and it did make a difference.
>> Dan
>>
>>
>> On Wed, Mar 4, 2015 at 4:28 PM, Karel Veselý <ive...@fi...>
>> wrote:
>>
>>   Wow, that sounds good, just to check that I understand : instead of
>>> taking relative frequencies from the pdf-alignment
>>> you compute the priors as average DNN output on a subset of data at the
>>> end of frame-cross-entropy training.
>>> And then the priors are fixed during the sMBR training...
>>> Did I get it correctly?
>>> Thanks,
>>> Karel.
>>>
>>> Dne 4. 3. 2015 v 18:43 Daniel Povey napsal(a):
>>>
>>>
>>>  I am getting large improvements by adjusting prior on my Fisher setup
>>>
>>>> and also on some of the Babel systems.
>>>>
>>>>
>>>  Great news!  So I guess it means you recompute the prior term based on
>>> the average posteriors on a subset of data, just like at the end of the
>>> cross-entropy training script.  Cc'ing Karel for his info, as he might
>>> want
>>> to put this into his SMBR script.
>>>
>>>
>>>
>>>> On the baseline discriminative supervised system, Nnet2_SMBR the
>>>> improvement is 0.4% over not adjusting priors.
>>>>
>>>>
>>>  Cool.   Since this is a minor change we can check it into the SMBR
>>> training scripts quite soon.
>>>
>>>
>>>
>>>> On the SMBR multilingual recipe semisupervised system Multilang2_SMBR,
>>>> the improvement is around 0.2%.
>>>> On the lattice entropy stuff Multinnet2_NCE+SMBR, the improvement is
>>>> again 0.2%.
>>>> This are just the improvements considering only the respective previous
>>>> best systems.
>>>> Some of the other systems that were performing worse before seems to be
>>>> only because of a mismatch of priors. Some of the lattice entropy
>>>> systems got around 1% improvement making it more closer to the best
>>>> lattice entropy system. Also we had an issue before of unsupervised part
>>>> of neural net performing better than the supervised part. This is mostly
>>>> mitigated by adjusting priors.
>>>> I am testing the priors adjustment out on Babel languages.
>>>> Also I tried SMBR with one-silence-class in some Babel languages, it
>>>> gives around 1% improvement. It looks to be mostly due to decrease in
>>>> insertions and substitutions, but a slight increase in deletions. I am
>>>> now trying to see its effect in the supervised part of the
>>>> lattice entropy semisupervised recipe.
>>>>
>>>>
>>>  Cool!
>>>
>>>
>>>
>>>> Is there a way to extend one-silence-class for MMI or lattice
>>>> entropy? Can we merge arcs at a particular time that have silence pdfs
>>>> and then pass the gradients to all the silence pdfs in the DNN output
>>>> layer?
>>>>
>>>>
>>>  The one-silence-class thing is specific to MPE and SMBR, it's not
>>> applicable to MMI or cross-entropy.
>>>
>>>  Dan
>>>
>>>
>>>
>>>  Regards,
>>>>
>>>> --
>>>> Vimal Manohar
>>>> Doctoral Student
>>>> Electrical & Computer Engineering
>>>> Johns Hopkins University
>>>> Baltimore, MD
>>>>
>>>>
>>>
>>>
>>> --
>>> Karel Vesely, Brno University of Tec...@fi...,
>>> +420-54114-1300
>>>
>>>
>>>
> --
> Vimal Manohar
> Doctoral Student
> Electrical & Computer Engineering
> Johns Hopkins University
> Baltimore, MD
>

-- 
Karel Vesely, Brno University of Tec...@fi...,
+420-54114-1300