Hi, I've trained a continuous model by sphinxtran and do the phone segmentaion on training set itself by sphinx3_align, but the performance seems to be very bad.
The training set is librispeech train-clean-100, 16k, mono .wav file, using 200k-word dictionary.
I wonder whether it's a model-training issue. Here is part of the setting in sphinx_train.cfg:
$CFG_HMM_TYPE = '.cont.'
$CFG_FINAL_NUM_DENSITIES = 32;
$CFG_N_TIED_STATES = 2000;
$CFG_QUEUE_TYPE = "Queue::POSIX"; # Using multi-CPU
$CFG_NPART = 12;
$DEC_CFG_NPART = 12;
and I've attach the complete config file in this post
Forced alignment(Frm) is produced by sphinx3_align, and I transformed it into Forced alignment(sec) by assuming FRAME_SHIFT=0.01 (sec). As you can see it's far away from the manual alignment (Answer).
BTW, it seems that the forced alignment ends before the end of wav file, so there is a section of audio not labeled at the end of file.
I don't think it's a reasonable result even if the model is not well trained. It's more like I missed some mechanical setup while aligning or something.
Last edit: Willy 2019-05-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Pocketsphinx/sphinx_fe/sphinx3 removes silence by default so timings might be off. Add -remove_silence no to feature extraction call of sphinx_fe, times will be accurate.
Last edit: Nickolay V. Shmyrev 2019-05-09
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found an old post that said -remove_silence is an option to sphinx_fe.
Does this mean I have to extract the feature of training set again with -remove_silence no?
* I had already have the training feature which was extracted autoatically from sphinxtrain command while training the model.
Last edit: Willy 2019-05-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, I've trained a continuous model by sphinxtran and do the phone segmentaion on training set itself by sphinx3_align, but the performance seems to be very bad.
The training set is librispeech train-clean-100, 16k, mono .wav file, using 200k-word dictionary.
I wonder whether it's a model-training issue. Here is part of the setting in sphinx_train.cfg:
$CFG_HMM_TYPE = '.cont.'
$CFG_FINAL_NUM_DENSITIES = 32;
$CFG_N_TIED_STATES = 2000;
$CFG_QUEUE_TYPE = "Queue::POSIX"; # Using multi-CPU
$CFG_NPART = 12;
$DEC_CFG_NPART = 12;
and I've attach the complete config file in this post
Please give me some advice on this, thanks.
I'm not really sure what do you mean by "performance is really bad", so it is hard to help.
You can also try Kaldi since you are working with librispeech, it will be more accurate in recognition.
Thanks for the advice. I must have to use sphinx for some personal reason.
I can here show an example of the alignment result:
Forced alignment(Frm) is produced by sphinx3_align, and I transformed it into Forced alignment(sec) by assuming FRAME_SHIFT=0.01 (sec). As you can see it's far away from the manual alignment (Answer).
BTW, it seems that the forced alignment ends before the end of wav file, so there is a section of audio not labeled at the end of file.
I don't think it's a reasonable result even if the model is not well trained. It's more like I missed some mechanical setup while aligning or something.
Last edit: Willy 2019-05-09
Pocketsphinx/sphinx_fe/sphinx3 removes silence by default so timings might be off. Add
-remove_silence noto feature extraction call of sphinx_fe, times will be accurate.Last edit: Nickolay V. Shmyrev 2019-05-09
Thanks, silence must be the reason to this.
But there is no different after I used that option. My command is:
sphinx3_align \
Did I make any mistake here?
Last edit: Willy 2019-05-10
I found an old post that said -remove_silence is an option to sphinx_fe.
Does this mean I have to extract the feature of training set again with -remove_silence no?
* I had already have the training feature which was extracted autoatically from sphinxtrain command while training the model.
Last edit: Willy 2019-05-10
Yes, it is an option for sphinx_fe, not sphinx3_align. The command above is wrong.
Yes
Thanks for the help.
I wonder can sphinx_fe read a list of input wav file or can only read one file in a line?
Thanks.
Sure, you can check model training scripts,
-coption controls that.I got it. Thank you very much.