Menu

SAT: SphinxTrain/scripts_pl/60.sa_train

2009-08-04
2012-09-22
  • Tarun Pruthi

    Tarun Pruthi - 2009-08-04

    Hi:

    I have been using Sphinx to make a large vocabulary recognizer for our custom application. It has been working pretty well, but we were having a lot of variation in accuracy across speakers, especially non-native speakers. So, we decided to improve our models by using Speaker Adaptive Training. Given that in the revision of SphinxTrain I had with me, there was a set of scripts called 60.sa_train for speaker adaptive training, I assumed that it shouldn't be too difficult. However, after I ran SAT using 60.sa_train scripts, I actually observed a ~1.2% absolute reduction in accuracy on the test set (clean+noisy files on 35 speakers).

    After I ran an SVN UPDATE, I observed that 60.sa_train directory was removed in r8525. Is there a good reason for that? Was there a problem with speaker adaptive training? I also observed that a set of scripts for vtln_align were added in r8604. Does vtln_align work better than SAT?

    I would appreciate any help!
    Thanks
    Tarun Pruthi

     
    • Nickolay V. Shmyrev

      > I actually observed a ~1.2% absolute reduction in accuracy on the test set

      To debug this issue it's worth to find out the improvement on adaptation database set. If it increases then there is overtraining.

      > Is there a good reason for that?

      The way to adapt models in 60.sa_train was quite complicated and was overdoing the work for adaptation. The replacement is 80.mllr_adapt that does almost the same in a less complicated way.

      > I also observed that a set of scripts for vtln_align were added in r8604. Does vtln_align
      work better than SAT?

      Vtln_align is used to estimate frequence warp factor for further training of the VTLN model It's used if you'll enable VTLN, it's completely different type of online (while MLLR is it's present form is offline) adaptation.

       
    • Tarun Pruthi

      Tarun Pruthi - 2009-08-05

      > The way to adapt models in 60.sa_train was quite complicated and was overdoing the work for adaptation.
      By "overdoing the work for adaptation" do you mean that you have also seen the test results go down with 60.sa_train? Further, do you know if the procedure for SAT, as coded in 60.sa_train, was accurate? On comparing the scripts in 60.sa_train with the Inverse transform procedure described in "Practical implementatins of speaker-adaptive training by Spyros Matsoukas, Rich Schwartz, Hubert Jin and Long Nguyen" I feel there are some errors in the scripts and in the inverse transform code. Do you know if the procedure described in this paper is the one that was followed in the scripts?

      > The replacement is 80.mllr_adapt that does almost the same in a less complicated way.
      As far as I can understand 80.mllr_adapt does not create new models. It is simply creating mllr transforms for every speaker in the adaptation database. So, I don't see how it is achieving the same thing as SAT?

       
      • Nickolay V. Shmyrev

        > By "overdoing the work for adaptation" do you mean that you have also seen the test results go down with 60.sa_train?

        Well, everything is possible, there could be bugs. But this adaptation do work. For example you can take an4 database, create speaker list with the only speaker an4_test. Create an4_test.ctl and an4_test.lsn just by copying the an4_test.fileids and an4_test.transcription. Then define
        $CFG_SPEAKERLIST = "$CFG_BASE_DIR/etc/speakers";
        Then if you are using latest sphinxtrain, you need to modify scripts a bit:

        diff -upr 60.sa_train/baum_welch.pl 60.sa_train.new/baum_welch.pl
        --- 60.sa_train/baum_welch.pl 2006-12-30 01:27:10.000000000 +0300
        +++ 60.sa_train.new/baum_welch.pl 2009-08-05 20:07:34.000000000 +0400
        @@ -52,6 +52,9 @@ $| = 1; # Turn on autoflushing
        die "USAGE: $0 <iter> <speaker>" if @ARGV != 2;
        my ($iter, $speaker) = @ARGV;

        +use vars qw($MLLT_FILE $MODEL_TYPE);
        +$MLLT_FILE = catfile($ST::CFG_MODEL_DIR, "${ST::CFG_EXPTNAME}.mllt");
        +
        my $modelinitialname="${ST::CFG_EXPTNAME}.cd_${ST::CFG_DIRLABEL}${ST::CFG_N_TIED_STATES}";
        my $modelname="${ST::CFG_EXPTNAME}.sat
        ${ST::CFG_DIRLABEL}";
        my $mdefname="${ST::CFG_EXPTNAME}.$ST::CFG_N_TIED_STATES.mdef";
        @@ -80,17 +83,39 @@ my $topn = 4;
        my $logdir = "$ST::CFG_LOG_DIR/$processname";
        mkdir ($logdir,0777);

        -# If there is an LDA transformation, use it
        -my @lda_args;
        -if (defined($ST::CFG_LDA_TRANSFORM) and -r $ST::CFG_LDA_TRANSFORM) {
        - push(@lda_args,
        - -ldafn => $ST::CFG_LDA_TRANSFORM,
        - -ldadim => $ST::CFG_LDA_DIMENSION);
        -}

        my ($listoffiles, $transcriptfile, $logfile);
        $listoffiles = catfile($ST::CFG_LIST_DIR, "$speaker.ctl");
        $transcriptfile = catfile($ST::CFG_LIST_DIR, "$speaker.lsn");
        +$logfile = "$logdir/${ST::CFG_EXPTNAME}.$iter-$speaker.bw.log";
        +
        +# Add the MLLT transform if it exists
        +my @extra_args;
        +if (defined($ST::CFG_SVSPEC)){
        + push(@extra_args, -svspec =>$ST::CFG_SVSPEC);
        +}
        +if (-r $MLLT_FILE) {
        + push(@extra_args,
        + -ldafn => $MLLT_FILE,
        + -ldadim => $ST::CFG_LDA_DIMENSION);
        +}
        +
        +if ($ST::CFG_CD_VITERBI eq 'yes') {
        + push(@extra_args, -viterbi => 'yes');
        +}
        +
        +if (defined($ST::CFG_PHSEG_DIR)) {
        + open INPUT,"${ST::CFG_LISTOFFILES}" or die "Failed to open $ST::CFG_LISTOFFILES: $!";
        + # Check control file format (determines if we need -outputfullpath)
        + my $line = <INPUT>;
        + if (split(" ", $line) ==1) {
        + # Use full file path
        + push(@extra_args, -outputfullpath => 'yes');
        + }
        + close INPUT;
        + push(@extra_args,
        + -phsegdir => $ST::CFG_PHSEG_DIR);
        +}
        my @mllr_args;
        # If we have an MLLR transform, apply it in Baum-Welch
        if ($iter > 1) {
        @@ -98,7 +123,6 @@ if ($iter > 1) {
        @mllr_args = (-mllrmat => catfile($ST::CFG_MODEL_DIR, $modelname,
        "$ST::CFG_EXPTNAME.$speaker.mllr"));
        }
        -$logfile = "$logdir/${ST::CFG_EXPTNAME}.$iter-$speaker.bw.log";

        my $ctl_counter = 0;
        open INPUT,"<$listoffiles" or die "Failed to open $listoffiles: $!";
        @@ -145,7 +169,7 @@ my $return_value = RunTool
        -diagfull => $ST::CFG_DIAGFULL,
        -feat => $ST::CFG_FEATURE,
        -ceplen => $ST::CFG_VECTOR_LENGTH,
        - @lda_args,
        + @extra_args,
        @mllr_args,
        -timing => "no");

        diff -upr 60.sa_train/norm.pl 60.sa_train.new/norm.pl
        --- 60.sa_train/norm.pl 2006-12-30 01:27:10.000000000 +0300
        +++ 60.sa_train.new/norm.pl 2009-08-05 20:01:41.000000000 +0400
        @@ -83,14 +83,6 @@ HTML_Print ("\t" . ImgSrc("$ST::CFG_BASE
        Log (" Normalization for iteration: $iter ");
        HTML_Print (FormatURL("$logfile", "Log File") . " ");

        -# if there is an LDA transformation, use it
        -my @feat;
        -if (defined($ST::CFG_LDA_TRANSFORM) and -r $ST::CFG_LDA_TRANSFORM) {
        - @feat = (-feat => '1s_c', -ceplen => $ST::CFG_LDA_DIMENSION);
        -}
        -else {
        - @feat = (-feat => $ST::CFG_FEATURE, -ceplen => $ST::CFG_VECTOR_LENGTH);
        -}
        my $return_value = RunTool
        ('norm', $logfile, 0,
        -accumdir => @bwaccumdirs,
        @@ -98,8 +90,7 @@ my $return_value = RunTool
        -tmatfn => $transition_matrices,
        -meanfn => $means,
        -varfn => $variances,
        - -fullvar => $ST::CFG_FULLVAR,
        - @feat
        + -fullvar => $ST::CFG_FULLVAR
        );

        if ($return_value) {

        Then retrain. On my experiment (exact numbers depend on dither) the results are the following:

        Before:

        SENTENCE ERROR: 60.0% (78/130) WORD ERROR RATE: 19.7% (151/773)

        After:
        SENTENCE ERROR: 33.1% (43/130) WORD ERROR RATE: 14.1% (108/773)

        > Further, do you know if the procedure for SAT, as coded in 60.sa_train, was accurate? On comparing the scripts in 60.sa_train with the Inverse transform procedure described in "Practical implementatins of speaker-adaptive training by Spyros Matsoukas, Rich Schwartz, Hubert Jin and Long Nguyen" I feel there are some errors in the scripts and in the inverse transform code.

        There could be bugs of course. If you can show them, that would be very appreciated.

        > As far as I can understand 80.mllr_adapt does not create new models. It is simply creating mllr transforms for every speaker in the adaptation database. So, I don't see how it is achieving the same thing as SAT?

        Well, it's also a kind of adaptation. You could also adapt the model itself with map_adapt. See

        http://www.speech.cs.cmu.edu/cmusphinx/moinmoin/AcousticModelAdaptation

         
    • Tarun Pruthi

      Tarun Pruthi - 2009-08-05

      Alright, so I think the central problem here is a difference between our understanding of Speaker Adaptive Training itself.

      The way you have described it is as adapting to a particular speaker's data to create a speaker dependent MLLR transform file which can be used to create speaker dependent models. That is what you would do in say a speaker dependent recognizer like Nuance where you would first ask a speaker to "enroll" by reading a set of sentences...the output of enrollment phase being a transform.mllr file for the speaker...which will be used in future livedecoding sessions for this speaker in order to improve recognition accuracy. And this is exactly what 80.mllr_adapt would accomplish. We did write a script like that for MLLR adaptation, and that works very well. No questions about that.

      However, the way I have understood speaker adaptive training is that: it is essentially a method to remove inter-speaker variability from the "training data" (not adaptation data) so that the trained models are better tuned to the task at hand (i.e. to distinguish phonemes). To quote from a paper on SAT ("Speaker adaptive training: a maximum likelihood approach to speaker normalization by Tasos Anastasakos, John McDonough and John Makhoul") - "By accounting explicitly for the extraneous speaker-induced variation and reducing its effect in the training data, the resulting acoustic models are truly speaker independent with reduced cross-unit overlap." So, the output of this model should simply be the newly trained models without any concern for the MLLR files generated per speaker (since these MLLR files are just data generated for an intermediate step). I don't think this interpretation is the same as yours.

      Please correct me if I am wrong.

       
    • Nickolay V. Shmyrev

      > it is essentially a method to remove inter-speaker variability from the "training data" (not adaptation data) so that the trained models are better tuned to the task at hand

      That's what I call "speaker normalization", also in the title of the quoted article. VTLN is indeed a normalization followed by later online or offline adaptation. So you could probably enjoy VTLN, the only issue is to implement wrap factor estimation in the decoder.

      About sa_train, actually I'd better wait for David's comment. I think it could be repaired if needed.

       
    • Tarun Pruthi

      Tarun Pruthi - 2009-08-05

      Yes, I have been look at VTLN also as another speaker normalization technique. Will probably try that next.

      I will wait for David's comment on sa_train.

      Thanks
      Tarun

       

Log in to post a comment.