Menu

Performance of sphinx3_align

Help
Willy
2019-05-08
2019-05-14
  • Willy

    Willy - 2019-05-08

    Hi, I've trained a continuous model by sphinxtran and do the phone segmentaion on training set itself by sphinx3_align, but the performance seems to be very bad.
    The training set is librispeech train-clean-100, 16k, mono .wav file, using 200k-word dictionary.

    I wonder whether it's a model-training issue. Here is part of the setting in sphinx_train.cfg:
    $CFG_HMM_TYPE = '.cont.'
    $CFG_FINAL_NUM_DENSITIES = 32;
    $CFG_N_TIED_STATES = 2000;
    $CFG_QUEUE_TYPE = "Queue::POSIX"; # Using multi-CPU
    $CFG_NPART = 12;
    $DEC_CFG_NPART = 12;
    and I've attach the complete config file in this post

    Please give me some advice on this, thanks.

     
    • Nickolay V. Shmyrev

      I'm not really sure what do you mean by "performance is really bad", so it is hard to help.

      You can also try Kaldi since you are working with librispeech, it will be more accurate in recognition.

       
  • Willy

    Willy - 2019-05-09

    Thanks for the advice. I must have to use sphinx for some personal reason.
    I can here show an example of the alignment result:

    Answer(sec)      Forced alignment(sec)  Forced alignment(Frm)   Label
    s_time  e_time   s_time  e_time         SFrm  EFrm              
    0       0        0       0.03           0     3                 <s>
    0.04    0.57     0.04    0.06           4     6                 <s>
    0.57    0.83     0.07    0.33           7     33                THIS
    0.83    1.11     0.34    0.64           34    64                LITTLE
    1.11    1.45     0.65    0.97           65    97                WORK
    1.45    1.74     0.98    1.23           98    123               WAS
    1.74    2.11     1.24    1.72           124   172               FINISHED
    2.15    2.36     1.73    1.87           173   187               IN
    2.37    2.47     1.88    1.95           18    195               THE
    2.48    2.73     1.96    2.2            196   22                YEAR
    2.73    3.24     2.21    2.76           221   276               EIGHTEEN
    3.24    3.43     2.77    2.94           277   294               O
    3.44    3.82     2.95    3.39           295   339               THREE
    ...
    ...
    

    Forced alignment(Frm) is produced by sphinx3_align, and I transformed it into Forced alignment(sec) by assuming FRAME_SHIFT=0.01 (sec). As you can see it's far away from the manual alignment (Answer).
    BTW, it seems that the forced alignment ends before the end of wav file, so there is a section of audio not labeled at the end of file.
    I don't think it's a reasonable result even if the model is not well trained. It's more like I missed some mechanical setup while aligning or something.

     

    Last edit: Willy 2019-05-09
    • Nickolay V. Shmyrev

      Pocketsphinx/sphinx_fe/sphinx3 removes silence by default so timings might be off. Add -remove_silence no to feature extraction call of sphinx_fe, times will be accurate.

       

      Last edit: Nickolay V. Shmyrev 2019-05-09
  • Willy

    Willy - 2019-05-10

    Thanks, silence must be the reason to this.
    But there is no different after I used that option. My command is:
    sphinx3_align \

        -remove_silence no \
        -hmm train-clean-100/model_parameters/librispeech.cd_cont_2000 \
        -dict  train-clean-100/etc/librispeech.dic \
        -ctl train-clean-100/etc/librispeech_train.fileids \
        -cepdir train-clean-100/feat \
        -cepext .mfc \
        -insent train-clean-100/etc/librispeech_train.transcription \
        -outsent train-clean-100/align_out/librispeech.out \
        -wdsegdir align_out/wdlabdir
    

    Did I make any mistake here?

     

    Last edit: Willy 2019-05-10
  • Willy

    Willy - 2019-05-10

    I found an old post that said -remove_silence is an option to sphinx_fe.
    Does this mean I have to extract the feature of training set again with -remove_silence no?
    * I had already have the training feature which was extracted autoatically from sphinxtrain command while training the model.

     

    Last edit: Willy 2019-05-10
    • Nickolay V. Shmyrev

      found an old post that said -remove_silence is an option to sphinx_fe.

      Yes, it is an option for sphinx_fe, not sphinx3_align. The command above is wrong.

      Does this mean I have to extract the feature of training set again with -remove_silence no?

      Yes

       
  • Willy

    Willy - 2019-05-10

    Thanks for the help.
    I wonder can sphinx_fe read a list of input wav file or can only read one file in a line?
    Thanks.

     
    • Nickolay V. Shmyrev

      Sure, you can check model training scripts, -c option controls that.

       
  • Willy

    Willy - 2019-05-14

    I got it. Thank you very much.

     

Log in to post a comment.