From: Daniel P. <dp...@gm...> - 2015-03-06 08:26:05
|
I am forwarding this email thread to kaldi-developers as I think it will be of interest to people. Vimal found that we can improve sMBR trained neural nets by recomputing the priors after sMBR training-- setting them to the average posterior computed by the neural net, on randomly chosen training data. [This is a little bit like ensuring that the Gaussian mixture weights sum to one in a generative model, which is normally done in discriminative training even though in princpile the objective function would make sense without ensuring that they sum to one]. Karel has checked in the change for the nnet1 setup; and I believe he has also changed the script to make --one-silence-class true the default... "--one-silence-class true" tends to improve results by reducing the insertion rate, as well as making more sense as an objective function. Basically, the old objective function (standard MPE/SMBR/MPFE) had an asymmetry w.r.t. insertions, that insertions into silence regions were not counted as errors... this never made sense but was done because it had seemed to work (this was in other toolkits though, like HTK). Anyway, --one-silence-class true makes the objective function more symmetric, and also makes it so that all silence phones (silence, noise, etc.) or silence pdfs are treated as a single class so replacing silence with noise or vice versa is not counted as an error. This makes sense because it's similar to how we normally score the systems. I'm hoping that in a couple of weeks we can check in changes to the discriminative training setup for nnet2, to make the changes. I'd like to test it on a few setups first though. Dan ---------- Forwarded message ---------- From: Vesely Karel <ive...@fi...> Date: Thu, Mar 5, 2015 at 8:24 AM Subject: Re: Large improvements by adjusting priors To: dp...@gm..., Vimal Manohar <vim...@gm...> Okay, thanks, I just commited the updated sMBR script which estimates the priors on the training data. It has fixed the problem of too many deletions appering after sMBR training (there are errors in the tranining transcripts, so sMBR does not help much here): %WER 78.4 | 2711 24825 | 24.8 47.8 27.5 3.2 78.4 99.6 | -1.103 | exp/dnn6b_butbn1_pretrain-dbn_dnn/decode_vllp.tune.seg1/scoring_lex_10/ctm.filt.sub.sys %WER 78.4 | 2711 24825 | 24.4 46.6 29.0 2.8 78.4 99.6 | -1.208 | exp/dnn6b_butbn1_pretrain-dbn_dnn/decode_vllp.tune.seg1_PRIOR/scoring_lex_11/ctm.filt.sub.sys => no change on frame-cross entropy training %WER 80.7 | 2711 24825 | 20.7 29.8 49.4 1.4 80.7 99.8 | -1.130 | exp/dnn6c_butbn1_pretrain-dbn_dnn_smbr/decode_vllp.tune.seg1/scoring_lex_9/ctm.filt.sub.sys %WER 78.0 | 2711 24825 | 24.7 45.0 30.4 2.7 78.0 99.6 | -1.109 | exp/dnn6c_butbn1_pretrain-dbn_dnn_smbr/decode_vllp.tune.seg1_PRIOR/scoring_lex_11/ctm.filt.sub.sys => helpful with sMBR training Also changed the default of sMBR as Dan suggested: do_smbr=true exclude_silphones=true one_silence_class=true Thanks, Karel. On 03/04/2015 10:50 PM, Daniel Povey wrote: Karel, also note that the --one-silence-class thing seems to have been helpful in quite a few scenarios. We should consider making this the default. Anyway the original formulation never made sense, it was always a hack. --one-silence-class makes more sense. Dan On Wed, Mar 4, 2015 at 4:43 PM, Vimal Manohar <vim...@gm...> wrote: > Yes, that is correct. I found it helpful especially in the cases where the > epoch4 model was performing worse than epoch3, like when using a > high learning rate. But after recomputing priors (individually for both > the models) at the end of sMBR training, epoch4 model was much better > than epoch3. > > > On 03/04, Daniel Povey wrote: > >> Karel, I already do that at the end of my frame-cross-entropy training. >> It >> was never clear that it made a big difference but I felt it was the right >> way to do it. >> I think what Vimal was saying is that he did the same at the end of sMBR >> training and it did make a difference. >> Dan >> >> >> On Wed, Mar 4, 2015 at 4:28 PM, Karel Veselý <ive...@fi...> >> wrote: >> >> Wow, that sounds good, just to check that I understand : instead of >>> taking relative frequencies from the pdf-alignment >>> you compute the priors as average DNN output on a subset of data at the >>> end of frame-cross-entropy training. >>> And then the priors are fixed during the sMBR training... >>> Did I get it correctly? >>> Thanks, >>> Karel. >>> >>> Dne 4. 3. 2015 v 18:43 Daniel Povey napsal(a): >>> >>> >>> I am getting large improvements by adjusting prior on my Fisher setup >>> >>>> and also on some of the Babel systems. >>>> >>>> >>> Great news! So I guess it means you recompute the prior term based on >>> the average posteriors on a subset of data, just like at the end of the >>> cross-entropy training script. Cc'ing Karel for his info, as he might >>> want >>> to put this into his SMBR script. >>> >>> >>> >>>> On the baseline discriminative supervised system, Nnet2_SMBR the >>>> improvement is 0.4% over not adjusting priors. >>>> >>>> >>> Cool. Since this is a minor change we can check it into the SMBR >>> training scripts quite soon. >>> >>> >>> >>>> On the SMBR multilingual recipe semisupervised system Multilang2_SMBR, >>>> the improvement is around 0.2%. >>>> On the lattice entropy stuff Multinnet2_NCE+SMBR, the improvement is >>>> again 0.2%. >>>> This are just the improvements considering only the respective previous >>>> best systems. >>>> Some of the other systems that were performing worse before seems to be >>>> only because of a mismatch of priors. Some of the lattice entropy >>>> systems got around 1% improvement making it more closer to the best >>>> lattice entropy system. Also we had an issue before of unsupervised part >>>> of neural net performing better than the supervised part. This is mostly >>>> mitigated by adjusting priors. >>>> I am testing the priors adjustment out on Babel languages. >>>> Also I tried SMBR with one-silence-class in some Babel languages, it >>>> gives around 1% improvement. It looks to be mostly due to decrease in >>>> insertions and substitutions, but a slight increase in deletions. I am >>>> now trying to see its effect in the supervised part of the >>>> lattice entropy semisupervised recipe. >>>> >>>> >>> Cool! >>> >>> >>> >>>> Is there a way to extend one-silence-class for MMI or lattice >>>> entropy? Can we merge arcs at a particular time that have silence pdfs >>>> and then pass the gradients to all the silence pdfs in the DNN output >>>> layer? >>>> >>>> >>> The one-silence-class thing is specific to MPE and SMBR, it's not >>> applicable to MMI or cross-entropy. >>> >>> Dan >>> >>> >>> >>> Regards, >>>> >>>> -- >>>> Vimal Manohar >>>> Doctoral Student >>>> Electrical & Computer Engineering >>>> Johns Hopkins University >>>> Baltimore, MD >>>> >>>> >>> >>> >>> -- >>> Karel Vesely, Brno University of Tec...@fi..., >>> +420-54114-1300 >>> >>> >>> > -- > Vimal Manohar > Doctoral Student > Electrical & Computer Engineering > Johns Hopkins University > Baltimore, MD > -- Karel Vesely, Brno University of Tec...@fi..., +420-54114-1300 |