CMU Sphinx / Forums / Speech Recognition Theory: SAT: SphinxTrain/scripts_pl/60.sa

Tarun Pruthi - 2009-08-04

Hi:

I have been using Sphinx to make a large vocabulary recognizer for our custom application. It has been working pretty well, but we were having a lot of variation in accuracy across speakers, especially non-native speakers. So, we decided to improve our models by using Speaker Adaptive Training. Given that in the revision of SphinxTrain I had with me, there was a set of scripts called 60.sa_train for speaker adaptive training, I assumed that it shouldn't be too difficult. However, after I ran SAT using 60.sa_train scripts, I actually observed a ~1.2% absolute reduction in accuracy on the test set (clean+noisy files on 35 speakers).

After I ran an SVN UPDATE, I observed that 60.sa_train directory was removed in r8525. Is there a good reason for that? Was there a problem with speaker adaptive training? I also observed that a set of scripts for vtln_align were added in r8604. Does vtln_align work better than SAT?

I would appreciate any help!
Thanks
Tarun Pruthi

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2009-08-04
  
  > I actually observed a ~1.2% absolute reduction in accuracy on the test set
  
  To debug this issue it's worth to find out the improvement on adaptation database set. If it increases then there is overtraining.
  
  > Is there a good reason for that?
  
  The way to adapt models in 60.sa_train was quite complicated and was overdoing the work for adaptation. The replacement is 80.mllr_adapt that does almost the same in a less complicated way.
  
  > I also observed that a set of scripts for vtln_align were added in r8604. Does vtln_align
  work better than SAT?
  
  Vtln_align is used to estimate frequence warp factor for further training of the VTLN model It's used if you'll enable VTLN, it's completely different type of online (while MLLR is it's present form is offline) adaptation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tarun Pruthi - 2009-08-05
  
  > The way to adapt models in 60.sa_train was quite complicated and was overdoing the work for adaptation.
  By "overdoing the work for adaptation" do you mean that you have also seen the test results go down with 60.sa_train? Further, do you know if the procedure for SAT, as coded in 60.sa_train, was accurate? On comparing the scripts in 60.sa_train with the Inverse transform procedure described in "Practical implementatins of speaker-adaptive training by Spyros Matsoukas, Rich Schwartz, Hubert Jin and Long Nguyen" I feel there are some errors in the scripts and in the inverse transform code. Do you know if the procedure described in this paper is the one that was followed in the scripts?
  
  > The replacement is 80.mllr_adapt that does almost the same in a less complicated way.
  As far as I can understand 80.mllr_adapt does not create new models. It is simply creating mllr transforms for every speaker in the adaptation database. So, I don't see how it is achieving the same thing as SAT?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2009-08-05
    
    > By "overdoing the work for adaptation" do you mean that you have also seen the test results go down with 60.sa_train?
    
    Well, everything is possible, there could be bugs. But this adaptation do work. For example you can take an4 database, create speaker list with the only speaker an4_test. Create an4_test.ctl and an4_test.lsn just by copying the an4_test.fileids and an4_test.transcription. Then define
    $CFG_SPEAKERLIST = "$CFG_BASE_DIR/etc/speakers";
    Then if you are using latest sphinxtrain, you need to modify scripts a bit:
    
    diff -upr 60.sa_train/baum_welch.pl 60.sa_train.new/baum_welch.pl
    --- 60.sa_train/baum_welch.pl 2006-12-30 01:27:10.000000000 +0300
    +++ 60.sa_train.new/baum_welch.pl 2009-08-05 20:07:34.000000000 +0400
    @@ -52,6 +52,9 @@ $| = 1; # Turn on autoflushing
    die "USAGE: $0 <iter> <speaker>" if @ARGV != 2;
    my ($iter, $speaker) = @ARGV;
    
    +use vars qw($MLLT_FILE $MODEL_TYPE);
    +$MLLT_FILE = catfile($ST::CFG_MODEL_DIR, "${ST::CFG_EXPTNAME}.mllt");
    +
    my $modelinitialname="${ST::CFG_EXPTNAME}.cd_${ST::CFG_DIRLABEL}${ST::CFG_N_TIED_STATES}";
    my $modelname="${ST::CFG_EXPTNAME}.sat${ST::CFG_DIRLABEL}";
    my $mdefname="${ST::CFG_EXPTNAME}.$ST::CFG_N_TIED_STATES.mdef";
    @@ -80,17 +83,39 @@ my $topn = 4;
    my $logdir = "$ST::CFG_LOG_DIR/$processname";
    mkdir ($logdir,0777);
    
    -# If there is an LDA transformation, use it
    -my @lda_args;
    -if (defined($ST::CFG_LDA_TRANSFORM) and -r $ST::CFG_LDA_TRANSFORM) {
    - push(@lda_args,
    - -ldafn => $ST::CFG_LDA_TRANSFORM,
    - -ldadim => $ST::CFG_LDA_DIMENSION);
    -}
    
    my ($listoffiles, $transcriptfile, $logfile);
    $listoffiles = catfile($ST::CFG_LIST_DIR, "$speaker.ctl");
    $transcriptfile = catfile($ST::CFG_LIST_DIR, "$speaker.lsn");
    +$logfile = "$logdir/${ST::CFG_EXPTNAME}.$iter-$speaker.bw.log";
    +
    +# Add the MLLT transform if it exists
    +my @extra_args;
    +if (defined($ST::CFG_SVSPEC)){
    + push(@extra_args, -svspec =>$ST::CFG_SVSPEC);
    +}
    +if (-r $MLLT_FILE) {
    + push(@extra_args,
    + -ldafn => $MLLT_FILE,
    + -ldadim => $ST::CFG_LDA_DIMENSION);
    +}
    +
    +if ($ST::CFG_CD_VITERBI eq 'yes') {
    + push(@extra_args, -viterbi => 'yes');
    +}
    +
    +if (defined($ST::CFG_PHSEG_DIR)) {
    + open INPUT,"${ST::CFG_LISTOFFILES}" or die "Failed to open $ST::CFG_LISTOFFILES: $!";
    + # Check control file format (determines if we need -outputfullpath)
    + my $line = <INPUT>;
    + if (split(" ", $line) ==1) {
    + # Use full file path
    + push(@extra_args, -outputfullpath => 'yes');
    + }
    + close INPUT;
    + push(@extra_args,
    + -phsegdir => $ST::CFG_PHSEG_DIR);
    +}
    my @mllr_args;
    # If we have an MLLR transform, apply it in Baum-Welch
    if ($iter > 1) {
    @@ -98,7 +123,6 @@ if ($iter > 1) {
    @mllr_args = (-mllrmat => catfile($ST::CFG_MODEL_DIR, $modelname,
    "$ST::CFG_EXPTNAME.$speaker.mllr"));
    }
    -$logfile = "$logdir/${ST::CFG_EXPTNAME}.$iter-$speaker.bw.log";
    
    my $ctl_counter = 0;
    open INPUT,"<$listoffiles" or die "Failed to open $listoffiles: $!";
    @@ -145,7 +169,7 @@ my $return_value = RunTool
    -diagfull => $ST::CFG_DIAGFULL,
    -feat => $ST::CFG_FEATURE,
    -ceplen => $ST::CFG_VECTOR_LENGTH,
    - @lda_args,
    + @extra_args,
    @mllr_args,
    -timing => "no");
    
    diff -upr 60.sa_train/norm.pl 60.sa_train.new/norm.pl
    --- 60.sa_train/norm.pl 2006-12-30 01:27:10.000000000 +0300
    +++ 60.sa_train.new/norm.pl 2009-08-05 20:01:41.000000000 +0400
    @@ -83,14 +83,6 @@ HTML_Print ("\t" . ImgSrc("$ST::CFG_BASE
    Log (" Normalization for iteration: $iter ");
    HTML_Print (FormatURL("$logfile", "Log File") . " ");
    
    -# if there is an LDA transformation, use it
    -my @feat;
    -if (defined($ST::CFG_LDA_TRANSFORM) and -r $ST::CFG_LDA_TRANSFORM) {
    - @feat = (-feat => '1s_c', -ceplen => $ST::CFG_LDA_DIMENSION);
    -}
    -else {
    - @feat = (-feat => $ST::CFG_FEATURE, -ceplen => $ST::CFG_VECTOR_LENGTH);
    -}
    my $return_value = RunTool
    ('norm', $logfile, 0,
    -accumdir => @bwaccumdirs,
    @@ -98,8 +90,7 @@ my $return_value = RunTool
    -tmatfn => $transition_matrices,
    -meanfn => $means,
    -varfn => $variances,
    - -fullvar => $ST::CFG_FULLVAR,
    - @feat
    + -fullvar => $ST::CFG_FULLVAR
    );
    
    if ($return_value) {
    
    Then retrain. On my experiment (exact numbers depend on dither) the results are the following:
    
    Before:
    
    SENTENCE ERROR: 60.0% (78/130) WORD ERROR RATE: 19.7% (151/773)
    
    After:
    SENTENCE ERROR: 33.1% (43/130) WORD ERROR RATE: 14.1% (108/773)
    
    > Further, do you know if the procedure for SAT, as coded in 60.sa_train, was accurate? On comparing the scripts in 60.sa_train with the Inverse transform procedure described in "Practical implementatins of speaker-adaptive training by Spyros Matsoukas, Rich Schwartz, Hubert Jin and Long Nguyen" I feel there are some errors in the scripts and in the inverse transform code.
    
    There could be bugs of course. If you can show them, that would be very appreciated.
    
    > As far as I can understand 80.mllr_adapt does not create new models. It is simply creating mllr transforms for every speaker in the adaptation database. So, I don't see how it is achieving the same thing as SAT?
    
    Well, it's also a kind of adaptation. You could also adapt the model itself with map_adapt. See
    
    http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/AcousticModelAdaptation
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tarun Pruthi - 2009-08-05
  
  Alright, so I think the central problem here is a difference between our understanding of Speaker Adaptive Training itself.
  
  The way you have described it is as adapting to a particular speaker's data to create a speaker dependent MLLR transform file which can be used to create speaker dependent models. That is what you would do in say a speaker dependent recognizer like Nuance where you would first ask a speaker to "enroll" by reading a set of sentences...the output of enrollment phase being a transform.mllr file for the speaker...which will be used in future livedecoding sessions for this speaker in order to improve recognition accuracy. And this is exactly what 80.mllr_adapt would accomplish. We did write a script like that for MLLR adaptation, and that works very well. No questions about that.
  
  However, the way I have understood speaker adaptive training is that: it is essentially a method to remove inter-speaker variability from the "training data" (not adaptation data) so that the trained models are better tuned to the task at hand (i.e. to distinguish phonemes). To quote from a paper on SAT ("Speaker adaptive training: a maximum likelihood approach to speaker normalization by Tasos Anastasakos, John McDonough and John Makhoul") - "By accounting explicitly for the extraneous speaker-induced variation and reducing its effect in the training data, the resulting acoustic models are truly speaker independent with reduced cross-unit overlap." So, the output of this model should simply be the newly trained models without any concern for the MLLR files generated per speaker (since these MLLR files are just data generated for an intermediate step). I don't think this interpretation is the same as yours.
  
  Please correct me if I am wrong.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2009-08-05
  
  > it is essentially a method to remove inter-speaker variability from the "training data" (not adaptation data) so that the trained models are better tuned to the task at hand
  
  That's what I call "speaker normalization", also in the title of the quoted article. VTLN is indeed a normalization followed by later online or offline adaptation. So you could probably enjoy VTLN, the only issue is to implement wrap factor estimation in the decoder.
  
  About sa_train, actually I'd better wait for David's comment. I think it could be repaired if needed.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tarun Pruthi - 2009-08-05
  
  Yes, I have been look at VTLN also as another speaker normalization technique. Will probably try that next.
  
  I will wait for David's comment on sa_train.
  
  Thanks
  Tarun
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SAT: SphinxTrain/scripts_pl/60.sa_train

Speech Recognition Toolkit

Forums

Help

SAT: SphinxTrain/scripts_pl/60.sa_train

SAT: SphinxTrain/scripts_pl/60.sa_train

Speech Recognition Toolkit

Forums

Help

SAT: SphinxTrain/scripts_pl/60.sa_train document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

SAT: SphinxTrain/scripts_pl/60.sa_train