CMU Sphinx / Forums / Help: Performance of sphinx3

Willy - 2019-05-08

Hi, I've trained a continuous model by sphinxtran and do the phone segmentaion on training set itself by sphinx3_align, but the performance seems to be very bad.
The training set is librispeech train-clean-100, 16k, mono .wav file, using 200k-word dictionary.

I wonder whether it's a model-training issue. Here is part of the setting in sphinx_train.cfg:
$CFG_HMM_TYPE = '.cont.'
$CFG_FINAL_NUM_DENSITIES = 32;
$CFG_N_TIED_STATES = 2000;
$CFG_QUEUE_TYPE = "Queue::POSIX"; # Using multi-CPU
$CFG_NPART = 12;
$DEC_CFG_NPART = 12;
and I've attach the complete config file in this post

Please give me some advice on this, thanks.

sphinx_train.cfg

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-08
  
  I'm not really sure what do you mean by "performance is really bad", so it is hard to help.
  
  You can also try Kaldi since you are working with librispeech, it will be more accurate in recognition.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Willy - 2019-05-09

Thanks for the advice. I must have to use sphinx for some personal reason.
I can here show an example of the alignment result:

Answer(sec) Forced alignment(sec) Forced alignment(Frm) Label s_time e_time s_time e_time SFrm EFrm 0 0 0 0.03 0 3 <s> 0.04 0.57 0.04 0.06 4 6 <s> 0.57 0.83 0.07 0.33 7 33 THIS 0.83 1.11 0.34 0.64 34 64 LITTLE 1.11 1.45 0.65 0.97 65 97 WORK 1.45 1.74 0.98 1.23 98 123 WAS 1.74 2.11 1.24 1.72 124 172 FINISHED 2.15 2.36 1.73 1.87 173 187 IN 2.37 2.47 1.88 1.95 18 195 THE 2.48 2.73 1.96 2.2 196 22 YEAR 2.73 3.24 2.21 2.76 221 276 EIGHTEEN 3.24 3.43 2.77 2.94 277 294 O 3.44 3.82 2.95 3.39 295 339 THREE ... ...

Forced alignment(Frm) is produced by sphinx3_align, and I transformed it into Forced alignment(sec) by assuming FRAME_SHIFT=0.01 (sec). As you can see it's far away from the manual alignment (Answer).
BTW, it seems that the forced alignment ends before the end of wav file, so there is a section of audio not labeled at the end of file.
I don't think it's a reasonable result even if the model is not well trained. It's more like I missed some mechanical setup while aligning or something.

Last edit: Willy 2019-05-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-09
  
  Pocketsphinx/sphinx_fe/sphinx3 removes silence by default so timings might be off. Add -remove_silence no to feature extraction call of sphinx_fe, times will be accurate.
  
  Last edit: Nickolay V. Shmyrev 2019-05-09
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thanks, silence must be the reason to this.
But there is no different after I used that option. My command is:
sphinx3_align \

    -remove_silence no \
    -hmm train-clean-100/model_parameters/librispeech.cd_cont_2000 \
    -dict  train-clean-100/etc/librispeech.dic \
    -ctl train-clean-100/etc/librispeech_train.fileids \
    -cepdir train-clean-100/feat \
    -cepext .mfc \
    -insent train-clean-100/etc/librispeech_train.transcription \
    -outsent train-clean-100/align_out/librispeech.out \
    -wdsegdir align_out/wdlabdir

Did I make any mistake here?

Last edit: Willy 2019-05-10

Willy - 2019-05-10

I found an old post that said -remove_silence is an option to sphinx_fe.
Does this mean I have to extract the feature of training set again with -remove_silence no?
* I had already have the training feature which was extracted autoatically from sphinxtrain command while training the model.

Last edit: Willy 2019-05-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-10
  
  found an old post that said -remove_silence is an option to sphinx_fe.
  
  Yes, it is an option for sphinx_fe, not sphinx3_align. The command above is wrong.
  
  Does this mean I have to extract the feature of training set again with -remove_silence no?
  
  Yes
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Willy - 2019-05-10

Thanks for the help.
I wonder can sphinx_fe read a list of input wav file or can only read one file in a line?
Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-05-13
  
  Sure, you can check model training scripts, -c option controls that.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Willy - 2019-05-14

I got it. Thank you very much.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Performance of sphinx3_align

Speech Recognition Toolkit

Forums

Help

Performance of sphinx3_align

Performance of sphinx3_align

Speech Recognition Toolkit

Forums

Help

Performance of sphinx3_align document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Performance of sphinx3_align