Menu

gauden_scale_densities_bwd: Assertion `finite

Help
Jbob
2010-08-24
2012-09-22
  • Jbob

    Jbob - 2010-08-24

    Moved here from Speech Recognition as requested. Sorry for the repost.

    Hi,

    I am trying to use SphinxTrain to train up some English acoustic models for a
    project. I have trained models with SphinxTrain on numerous prior occassions
    but this time I encountered a new error that I'm unfamiliar with.

    I have about 200hrs of speech data with transcripts and appropriate lexicon,
    etc., I initially went through a training run with a small fraction of this
    data (=~2%) just to confirm that things were more or less setup correctly.
    This completed successfully so I started a run with the full corpus.

    The full corpus setup made it through the first several steps,


    ("$ST::CFG_SCRIPT_DIR/00.verify/verify_all.pl",
    "$ST::CFG_SCRIPT_DIR/02.falign_ci_hmm/slave_convg.pl",
    "$ST::CFG_SCRIPT_DIR/03.force_align/slave_align.pl",
    "$ST::CFG_SCRIPT_DIR/20.ci_hmm/slave_convg.pl",
    "$ST::CFG_SCRIPT_DIR/30.cd_hmm_untied/slave_convg.pl",
    "$ST::CFG_SCRIPT_DIR/40.buildtrees/slave.treebuilder.pl",
    "$ST::CFG_SCRIPT_DIR/45.prunetree/slave.state-tying.pl",


    but eventually bombed out on

    "$ST::CFG_SCRIPT_DIR/50.cd_hmm_tied/slave_convg.pl",

    on 4 gaussians, giving the following error,

    -------------------------------------Hi,

    I am trying to use SphinxTrain to train up some English acoustic models for a
    project. I have trained models with SphinxTrain on numerous prior occassions
    but this time I encountered a new error that I'm unfamiliar with.

    I have about 200hrs of speech data with transcripts and appropriate lexicon,
    etc., I initially went through a training run with a small fraction of this
    data (=~2%) just to confirm that things were more or less setup correctly.
    This completed successfully so I started a run with the full corpus.

    The full corpus setup made it through the first several steps,


    ("$ST::CFG_SCRIPT_DIR/00.verify/verify_all.pl",
    "$ST::CFG_SCRIPT_DIR/02.falign_ci_hmm/slave_convg.pl",
    "$ST::CFG_SCRIPT_DIR/03.force_align/slave_align.pl",
    "$ST::CFG_SCRIPT_DIR/20.ci_hmm/slave_convg.pl",
    "$ST::CFG_SCRIPT_DIR/30.cd_hmm_untied/slave_convg.pl",
    "$ST::CFG_SCRIPT_DIR/40.buildtrees/slave.treebuilder.pl",
    "$ST::CFG_SCRIPT_DIR/45.prunetree/slave.state-tying.pl",


    but eventually bombed out on

    "$ST::CFG_SCRIPT_DIR/50.cd_hmm_tied/slave_convg.pl",

    on 4 gaussians, giving the following error,


    ....

    INFO: corpus.c(1346): Will process 17768 utts starting at 53304
    INFO: main.c(622): Reestimation: Baum-Welch
    bw: gauden.c:1377: gauden_scale_densities_bwd: Assertion `finite(den)' failed.
    ....


    I tracked this down in the source code to,

    gauden.c:

    gauden_scale_densities_bwd(...){
    ....
    / BHIKSHA converged g->n_density to g->n_top; possible bugfix, APR 6 98 /
    for (k = 0; k < g->n_top; k++) {
    / BHIKSHA converged g->n_density to g->n_top; possible bugfix, END /
    den = EXPF(den - scl);
    assert(finite(den));
    ...
    }

    Where the above assertion seems to be what is causing the failure, however I'm
    uncertain as to what this implies about my training data, or the process
    itself.

    The only thing I'm doing that is any different from previous training
    exercises or projects is that I'm using Queue::POSIX and npart > 1, however it
    seems unlikely that this should be the cause of the problem.

    Any pointers as to what this error implies would be greatly
    appreciated!---------------

    ....

    INFO: corpus.c(1346): Will process 17768 utts starting at 53304
    INFO: main.c(622): Reestimation: Baum-Welch
    bw: gauden.c:1377: gauden_scale_densities_bwd: Assertion `finite(den)' failed.
    ....


    I tracked this down in the source code to,

    gauden.c:

    gauden_scale_densities_bwd(...){
    ....
    / BHIKSHA converged g->n_density to g->n_top; possible bugfix, APR 6 98 /
    for (k = 0; k < g->n_top; k++) {
    / BHIKSHA converged g->n_density to g->n_top; possible bugfix, END /
    den = EXPF(den - scl);
    assert(finite(den));
    ...
    }

    Where the above assertion seems to be what is causing the failure, however I'm
    uncertain as to what this implies about my training data, or the process
    itself.

    The only thing I'm doing that is any different from previous training
    exercises or projects is that I'm using Queue::POSIX and npart > 1, however it
    seems unlikely that this should be the cause of the problem.

    Any pointers as to what this error implies would be greatly appreciated!

    Since the initial post I also took a closer look at the mixture_weights files
    and noted that there are several mixtures which are populated by 'nan' values.
    Commenting out the assertion and re-running the training, I was actually able
    to decode, but this is a pretty bad idea...

     
  • Nickolay V. Shmyrev

    Nan is taken from your feature files. It looks like you extracted features
    incorreclty, probably without using dither from zero energy data.

    It also looks like you are using outdated SphinxTrain

     
  • Jbob

    Jbob - 2010-08-24

    thanks for the reply.

    Nan is taken from your feature files. It looks like you extracted features
    incorreclty, probably without using dither from zero energy data.
    I thought about this and checked all the feature files for occurrences of odd
    or nan values. I also used dither during feature extraction so i don't think
    that this explains the problem.

    It also looks like you are using outdated SphinxTrain
    I downloaded the latest version from the website just 3 days ago so it
    shouldn't be out of date. i just double checked this and it seems there is
    only one version of sphinxtrain available anyway, at least from the
    recommended downloads area. i have the latest version of sphinxbase as well.

     
  • Nickolay V. Shmyrev

    Anyway, good practice if something fails is to train on both halfs of the
    training data and check which part fails. This way you can localize
    problematic part of the database.

     
  • Jbob

    Jbob - 2010-08-25

    I've run several more iterations of training, trying to narrow down the
    problem but am still not having any luck. The training process makes it to
    2gaussiians for the CD models and then at the next split it dies.

    In the past when I have seen it die as a result of bad training data it has
    always done so right at the beginning, not halfway through the CD training
    stage. Any further insight as to what might cause such an issue at such a late
    stage would be greatly appreciated.

     
  • Jbob

    Jbob - 2010-09-05

    I eventually resolved this. I had the endianness set wrong. Interestingly
    enough, I was able to train models with a small amount of data (<5hrs) and
    perform tests on it. Results were pretty bad, but bizarrely it actually
    worked. Increasing the amount of data caused the likelihoods to bottom out and
    the training process to crash. Anyway after matching the endianness everything
    worked fine. Pretty retarded.

     

Log in to post a comment.