Hi Nickolay, I have a few questions regarding sphinxtrain, I am trying cmu sphinx with the aim of creating a good acoustic model for 8khz telephone transcriptions (large vocabulary) recorded with the British English accent, and then use pocketsphinx for processing audio to text. I am able to use 1000+ hours of audio recorded in british english + transcriptions.
The Model should be able to transcribe spontaneous British English telephonic audio (with the average length of 10 seconds, occasionally, 1 out of 10 times the audio is more than 20 seconds), with a satisfactory accuracy, so it's not dictation, the context can be any, and the speaker is not known
I tried model adaptation, and by experimenting with the parameters and a custom language model, the best configuration I could achieve is an adapted model from en-us-8khz with a custom language model generated from the transcriptions, the accuracy is 52% on testing sets (using word_align.pl from sphinxtrain) (this number is inconclusive, because the testing set was very small) , but that's not enough, so I am trying model training by following the tutorial situated in http://cmusphinx.sourceforge.net/wiki/tutorialam.
The aim is to tune training (and decoding in pocket sphinx) for Accuracy rather than speed, the program needs to process audio files, not in a live mode, so speed is not the priority, and saving computational power at the expense of accuracy is not needed either.
20 xRT on average is still acceptable.
$CFG_STATESPERHMM = 3;
does it make sense to increase it to 5 in order to improve accuracy? does pocket sphinx support models with HMM STATE of 5 ?
I'm choosing continuous models for accuracy, because I read that PTM models are less accurate, but on the other side, I read that PTM models will be more supported in the future, because of their good speed/accuracy ratio, can a good PTM model match a good continuous model in terms of WER if they are both trained in the same training database? or the PTM is likely to always be approximately 10% less accurate?
$CFG_FINAL_NUM_DENSITIES = 32;
by reading the tutorial, a value of either 32 or 64 seems reasonable to me, would you try any other value?
$CFG_N_TIED_STATES = 8000;
a value in the range of 6000 to 12000 seems reasonable to me, would you try any other value?
# (yes/no) Train multiple-gaussian context-independent models (useful
# for alignment, use 'no' otherwise) in the models created
# specifically for forced alignment
$CFG_FALIGN_CI_MGAU = 'yes';
# (yes/no) Train multiple-gaussian context-independent models (useful
# for alignment, use 'no' otherwise)
$CFG_CI_MGAU = 'yes';
$CFG_FORCEDALIGN = 'yes';
I'm keeping this values set to yes, together with # Use force-aligned transcripts (if available) as input to training
from what I understood, setting this values to yes (together with CFG_FALIGN_CI_MGAU and CFG_CI_MGAU ), causes sphinxtrain to filter out the samples that are not considered to be a good fit in the transcription, (to use it, I compiled sphinx3_align from sphinx 3 and inserted it in my sphinxtrain installation folder + libexec/sphinxtrain as instructed by the corresponding script)
This seems to increase the quality of the training in my case (I don't have conclusive results yet), because the transcriptions I have are not always accurate (approximately 1 out of 8 transcriptions are not 100% accurate), so this would filter out the audio samples that don't seem to correspond to the transcription with a good ratio (and the ratio depends from the $CFG_FORCE_ALIGN_BEAM value). Did I interpret this functionality correctly? Is there any place in the logdir where I can see the number of discarded sentences in the form of percentage?
Is trainingTask/falignout/trainingTask.alignedfiles the list of files that survived the discarding/filtering of sphinx3_align when force alignment is enabled?
If I want to calculate the percentage of discarded samples, is it correct to move to trainingTask/falignout/trainingTask.alignedfiles, get the number of Lines and compare that number with the number of lines in trainingTask/etc/trainingTask_train.fileids ?
I would like to make the process of forced alignment more strict (with the idea of keeping only the samples with a perfect match in the transcription) by modifying $CFG_FORCE_ALIGN_BEAM. In my opinion, I don't mind discarding 50% of the samples, at the cost of keeping only the good ones for better WER/Accuracy on testing sets, from my understanding, it's worth discarding 50% of samples to make sure I keep only the fit ones, if there are enough samples in the database, do you think it's a good idea to have a strict filtering for this task?
Sometimes training crashes at module 00 Phase 3 (Check general format for the fileids file; utterance length (must be positive); files exist)
I get a long list of
"WARNING: Error in '(trainingTask_train.fileids)'', the feature file '(path to file.mfc)' does not exist, or is empty
so when I check the log in logdir/000.comp_feat I notice that at some position sphinx_fe fails giving :
and this causes all the subsequent IDs in that part (I'm using $CFG_NPART with values > 1) to fail. From my understanding this happens because there are some files on my database (maybe one out of 200000 on average) that are somehow corrupted:
(the ID that fails is not the first )
so I "soxi" the ".wav" file ID corresponding to the first error in the list of errors and in fact it says:
soxi FAIL formats: can't open input file `631163.wav': WAVE: RIFF header not found
(sometimes the incriminating audio file that breaks the chain says "input/output error")
If I soxi the other files in the list of errors (for example, the following one, corresponding to the second error(the feature file '(path to file.mfc)' does not exist, or is empty), they seem to be ok, according to soxi, for example:
so my guess is that once an ID fails, it breaks the chain for all the other IDs in that NPART list, and this error goes unnoticed from the scripts until MODULE 00 phase 3.
I could just delete the ID from the _train.fileids and _train.transcription but I'm using a script that creates the training database with different files every time, so just deleting the ID would not solve my issue
My question is: is there any easy way to let MODULE 000 discard IDs that give errors without crashing the training?
For the language model, I normally generate a language model with MITLM, I create it using all the transcriptions in the training data (excluding testing transcriptions), I noticed that there are tools that allow me to optimize the language model to minimize perplexity, would this process of optimization improve significantly my final Accuracy on testing sets? (for example more than 5% improvement ? )
I had a look at the RNNLM Toolkit, if I understood correctly, it can create language models using a recurrent neural network approach, is it likely to create language models that are better (in decoding testing sets) than the ones MITLM would create with default parameters using the same input text?
Do you think training a model with beep dictionary is a better alternative than cmudict0.7 for spontaneous British English 8khz telephonic audio?
What other values/parameters would you experiment with to increase accuracy other than the ones above?
another last question, I had in mind to also try Kaldi, do you know about Kaldi? can Kaldi be a better fit for my task (prioritizing Accuracy over Speed for audio to text conversion in offline mode, speaker independent, large vocabulary, 8khz unpredictable speech)?
Do you think using DNN's with Kaldi is a good choice for my task? is it likely to have a better WER/Accuracy rate for my task?
my first and current training results (Estimated Total Hours Training: 377.89) give me:
MODULE:DECODEDecodingusingmodelspreviouslytrainedAligningresultstofinderrorrateSENTENCEERROR:89.3%(620/694) WORD ERROR RATE: 64.5% (5486/8502)
as seen from sphinxtrain output on screen
TOTAL Words: 8502 Correct: 5095 Errors: 5486
TOTAL Percent correct = 59.93% Error = 64.53% Accuracy = 35.47%
TOTAL Insertions: 2079 Deletions: 929 Substitutions: 2478
that's the same training task result, taken from result/trainingTask.align
by leaving everything default except for NPART, my sphinx_train.cfg is :
# Feature extraction parameters
$CFG_WAVFILE_SRATE = 8000.0;
$CFG_NUM_FILT = 15; # For wideband speech it's 25, for telephone 8khz reasonable value is 15
$CFG_LO_FILT = 200; # For telephone 8kHz speech value is 200
$CFG_HI_FILT = 3500; # For telephone 8kHz speech value is 3500
$CFG_TRANSFORM = "dct"; # Previously legacy transform is used, but dct is more accurate
$CFG_LIFTER = "22"; # Cepstrum lifter is smoothing to improve recognition
$CFG_VECTOR_LENGTH = 13; # 13 is usually enough
$CFG_HMM_TYPE = '.cont.'; # Sphinx 4, PocketSphinx
$CFG_FINAL_NUM_DENSITIES = 32; (In the continuous model elsif section)
$CFG_STATESPERHMM = 4;
# (yes/no) Train multiple-gaussian context-independent models (useful
# for alignment, use 'no' otherwise) in the models created
# specifically for forced alignment
$CFG_FALIGN_CI_MGAU = 'yes';
# (yes/no) Train multiple-gaussian context-independent models (useful
# for alignment, use 'no' otherwise)
$CFG_CI_MGAU = 'yes';
# (yes/no) Train context-dependent models
$CFG_CD_TRAIN = 'yes';
# Number of tied states (senones) to create in decision-tree clustering
$CFG_N_TIED_STATES = 8000;
# How many parts to run Forward-Backward estimatinon in
$CFG_NPART = 46;
# Use force-aligned transcripts (if available) as input to training
$CFG_FORCEDALIGN = 'yes';
Last edit: Orest 2015-04-02
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
does it make sense to increase it to 5 in order to improve accuracy?
No
does pocket sphinx support models with HMM STATE of 5 ?
Yes
can a good PTM model match a good continuous model in terms of WER if they are both trained in the same training database?
No
or the PTM is likely to always be approximately 10% less accurate?
Yes
a value in the range of 6000 to 12000 seems reasonable to me, would you try any other value?
No
(approximately 1 out of 8 transcriptions are not 100% accurate), so this would filter out the audio samples that don't seem to correspond to the transcription with a good ratio (and the ratio depends from the $CFG_FORCE_ALIGN_BEAM value). Did I interpret this functionality correctly?
Not exactly. The purpose of alignment is to select among pronunciation variants in your dictionary and insert silence where appropriate.
Forced alignment is not very good algorithm for filtering bad transcripts, it has significant disadvantages. For example it can filter correct transcripts as well.
It can be used somewhat but strict filtering does not improve accuracy. It is not recommended to reduce the beam because it will filter more correct transcripts this way and the model will not be able to learn.
Kaldi implements more advanced algorithms for cleaning up the transcripts which might be implemented in sphinxtrain one day. For example there is script find_bad_utts.sh which uses unigram decoder to figure out if transcript matches the audio.
Is there any place in the logdir where I can see the number of discarded sentences in the form of percentage?
You can compare number of input utterances and number of aligned utterances in falignout folder.
causes all the subsequent IDs in that part (I'm using $CFG_NPART with values > 1) to fail.
I've just committed a fix where it should issue an error and proceed. Thanks for the report.
For the language model, I normally generate a language model with MITLM, I create it using all the transcriptions in the training data (excluding testing transcriptions), I noticed that there are tools that allow me to optimize the language model to minimize perplexity, would this process of optimization improve significantly my final Accuracy on testing sets? (for example more than 5% improvement ? )
It is better to consider to introduce more language model data from other sources. Most algorithms are not as helpful as additional data. Another helpful thing would be to use RNNLM models which tend to be more accurate. MITLM tricks are not going to be significant advance.
is it likely to create language models that are better (in decoding testing sets) than the ones MITLM would create with default parameters using the same input text?
Yes
Do you think training a model with beep dictionary is a better alternative than cmudict0.7 for spontaneous British English 8khz telephonic audio?
Beep dictionary has non-commercial license. There is no good free dictionary for UK English
Do you think using DNN's with Kaldi is a good choice for my task? is it likely to have a better WER/Accuracy rate for my task?
Yes, Kaldi DNN is significantly more accurate.
SENTENCE ERROR: 89.3% (620/694) WORD ERROR RATE: 64.5% (5486/8502)
This is not very good result, maybe your input data is too dirty or there are issues with language model. It is hard to say offhand.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In the logdir, only "000.comp_feat" folder is created, with 80 logfiles (40 for testing, and 40 for training) and when I move to the last train logfile, at the end it says:
to be sure that's the reason, I deleted the incriminating file from _train.fileids and _train.transcription in etc/, deleted everything except etc/ and /wav and run the training again, and this time the training worked.
I did this test after reinstalling sphinxbase from (https://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxbase/) (where I saw that you modified sphinx_fe.c)
(and my sphinxtrain is from github)
How can I have sphinxtrain skip from training/processing certain files that give errors during sphinx_fe stage ?
Last edit: Orest 2015-04-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is better to consider to introduce more language model data from other sources. Most algorithms are not as helpful as additional data.
I didn't understand this, why it is better to consider to introduce more language model data from other sources ?
Let's assume I want to transcribe speaker-independent telephonic audio of a certain topic (for example product feedback)
and Let's assume I have 1000 hours of audio+transcriptions, and those 1000 hours represent the target audio/topic/accent I need to transcribe, and let's say 1000 hours correspond to 1400000 audiofiles (and 1400000 transcriptions)
I can train an acoustic model with sphinxtrain from this training database, but for the language model, why it is better to introduce external language-model-data to my existing 1400000 transcriptions rather than just creating a language model with those 1400000 transcriptions only?
Another helpful thing would be to use RNNLM models which tend to be more accurate. MITLM tricks are not going to be significant advance.
I tried to familiarize with RNNLM Toolkit (a very cool project) with the idea of creating an ARPA-format language model for pocketsphinx, but recurrent nets are not finite state machines, so, even if a function is implemented to generate an ARPA-format language models using RNNLM Toolkit, that would be just an approximation, and it wouldn't make use of a rnn language model in decoding.
Nickolay, do you think implementing RNN language models (in decoding) in cmu-sphinx would be a good idea?
Last edit: Orest 2015-04-22
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay, I have a few questions regarding sphinxtrain, I am trying cmu sphinx with the aim of creating a good acoustic model for 8khz telephone transcriptions (large vocabulary) recorded with the British English accent, and then use pocketsphinx for processing audio to text. I am able to use 1000+ hours of audio recorded in british english + transcriptions.
The Model should be able to transcribe spontaneous British English telephonic audio (with the average length of 10 seconds, occasionally, 1 out of 10 times the audio is more than 20 seconds), with a satisfactory accuracy, so it's not dictation, the context can be any, and the speaker is not known
I tried model adaptation, and by experimenting with the parameters and a custom language model, the best configuration I could achieve is an adapted model from en-us-8khz with a custom language model generated from the transcriptions, the accuracy is 52% on testing sets (using word_align.pl from sphinxtrain) (this number is inconclusive, because the testing set was very small) , but that's not enough, so I am trying model training by following the tutorial situated in http://cmusphinx.sourceforge.net/wiki/tutorialam.
The aim is to tune training (and decoding in pocket sphinx) for Accuracy rather than speed, the program needs to process audio files, not in a live mode, so speed is not the priority, and saving computational power at the expense of accuracy is not needed either.
20 xRT on average is still acceptable.
does it make sense to increase it to 5 in order to improve accuracy? does pocket sphinx support models with HMM STATE of 5 ?
I'm choosing continuous models for accuracy, because I read that PTM models are less accurate, but on the other side, I read that PTM models will be more supported in the future, because of their good speed/accuracy ratio, can a good PTM model match a good continuous model in terms of WER if they are both trained in the same training database? or the PTM is likely to always be approximately 10% less accurate?
by reading the tutorial, a value of either 32 or 64 seems reasonable to me, would you try any other value?
a value in the range of 6000 to 12000 seems reasonable to me, would you try any other value?
I'm keeping this values set to yes, together with # Use force-aligned transcripts (if available) as input to training
from what I understood, setting this values to yes (together with CFG_FALIGN_CI_MGAU and CFG_CI_MGAU ), causes sphinxtrain to filter out the samples that are not considered to be a good fit in the transcription, (to use it, I compiled sphinx3_align from sphinx 3 and inserted it in my sphinxtrain installation folder + libexec/sphinxtrain as instructed by the corresponding script)
This seems to increase the quality of the training in my case (I don't have conclusive results yet), because the transcriptions I have are not always accurate (approximately 1 out of 8 transcriptions are not 100% accurate), so this would filter out the audio samples that don't seem to correspond to the transcription with a good ratio (and the ratio depends from the $CFG_FORCE_ALIGN_BEAM value). Did I interpret this functionality correctly? Is there any place in the logdir where I can see the number of discarded sentences in the form of percentage?
Is trainingTask/falignout/trainingTask.alignedfiles the list of files that survived the discarding/filtering of sphinx3_align when force alignment is enabled?
If I want to calculate the percentage of discarded samples, is it correct to move to trainingTask/falignout/trainingTask.alignedfiles, get the number of Lines and compare that number with the number of lines in trainingTask/etc/trainingTask_train.fileids ?
I would like to make the process of forced alignment more strict (with the idea of keeping only the samples with a perfect match in the transcription) by modifying $CFG_FORCE_ALIGN_BEAM. In my opinion, I don't mind discarding 50% of the samples, at the cost of keeping only the good ones for better WER/Accuracy on testing sets, from my understanding, it's worth discarding 50% of samples to make sure I keep only the fit ones, if there are enough samples in the database, do you think it's a good idea to have a strict filtering for this task?
Sometimes training crashes at module 00 Phase 3 (Check general format for the fileids file; utterance length (must be positive); files exist)
I get a long list of
so when I check the log in logdir/000.comp_feat I notice that at some position sphinx_fe fails giving :
and this causes all the subsequent IDs in that part (I'm using $CFG_NPART with values > 1) to fail. From my understanding this happens because there are some files on my database (maybe one out of 200000 on average) that are somehow corrupted:
(the ID that fails is not the first )
so I "soxi" the ".wav" file ID corresponding to the first error in the list of errors and in fact it says:
(sometimes the incriminating audio file that breaks the chain says "input/output error")
If I soxi the other files in the list of errors (for example, the following one, corresponding to the second error(the feature file '(path to file.mfc)' does not exist, or is empty), they seem to be ok, according to soxi, for example:
so my guess is that once an ID fails, it breaks the chain for all the other IDs in that NPART list, and this error goes unnoticed from the scripts until MODULE 00 phase 3.
I could just delete the ID from the _train.fileids and _train.transcription but I'm using a script that creates the training database with different files every time, so just deleting the ID would not solve my issue
My question is: is there any easy way to let MODULE 000 discard IDs that give errors without crashing the training?
For the language model, I normally generate a language model with MITLM, I create it using all the transcriptions in the training data (excluding testing transcriptions), I noticed that there are tools that allow me to optimize the language model to minimize perplexity, would this process of optimization improve significantly my final Accuracy on testing sets? (for example more than 5% improvement ? )
I had a look at the RNNLM Toolkit, if I understood correctly, it can create language models using a recurrent neural network approach, is it likely to create language models that are better (in decoding testing sets) than the ones MITLM would create with default parameters using the same input text?
Do you think training a model with beep dictionary is a better alternative than cmudict0.7 for spontaneous British English 8khz telephonic audio?
What other values/parameters would you experiment with to increase accuracy other than the ones above?
another last question, I had in mind to also try Kaldi, do you know about Kaldi? can Kaldi be a better fit for my task (prioritizing Accuracy over Speed for audio to text conversion in offline mode, speaker independent, large vocabulary, 8khz unpredictable speech)?
Do you think using DNN's with Kaldi is a good choice for my task? is it likely to have a better WER/Accuracy rate for my task?
my first and current training results (Estimated Total Hours Training: 377.89) give me:
as seen from sphinxtrain output on screen
that's the same training task result, taken from result/trainingTask.align
by leaving everything default except for NPART, my sphinx_train.cfg is :
Last edit: Orest 2015-04-02
No
Yes
No
Yes
No
Not exactly. The purpose of alignment is to select among pronunciation variants in your dictionary and insert silence where appropriate.
Forced alignment is not very good algorithm for filtering bad transcripts, it has significant disadvantages. For example it can filter correct transcripts as well.
It can be used somewhat but strict filtering does not improve accuracy. It is not recommended to reduce the beam because it will filter more correct transcripts this way and the model will not be able to learn.
Kaldi implements more advanced algorithms for cleaning up the transcripts which might be implemented in sphinxtrain one day. For example there is script find_bad_utts.sh which uses unigram decoder to figure out if transcript matches the audio.
You can compare number of input utterances and number of aligned utterances in falignout folder.
I've just committed a fix where it should issue an error and proceed. Thanks for the report.
It is better to consider to introduce more language model data from other sources. Most algorithms are not as helpful as additional data. Another helpful thing would be to use RNNLM models which tend to be more accurate. MITLM tricks are not going to be significant advance.
Yes
Beep dictionary has non-commercial license. There is no good free dictionary for UK English
Yes, Kaldi DNN is significantly more accurate.
This is not very good result, maybe your input data is too dirty or there are issues with language model. It is hard to say offhand.
Thanks Nickolay, but it seems that it didn't completely solve the issue, I purposedly added a wav where the soxi command returns;
I added it in the corresponding _train.fileids and _train.transcription, in the last line, and set $CFG_NPART to 40
On Phase 3: this time it issues the warning as you said, and it proceeds, but then it stops without feedback on Phase 7:
In the logdir, only "000.comp_feat" folder is created, with 80 logfiles (40 for testing, and 40 for training) and when I move to the last train logfile, at the end it says:
to be sure that's the reason, I deleted the incriminating file from _train.fileids and _train.transcription in etc/, deleted everything except etc/ and /wav and run the training again, and this time the training worked.
I did this test after reinstalling sphinxbase from (https://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/sphinxbase/) (where I saw that you modified sphinx_fe.c)
(and my sphinxtrain is from github)
How can I have sphinxtrain skip from training/processing certain files that give errors during sphinx_fe stage ?
Last edit: Orest 2015-04-08
I didn't understand this, why it is better to consider to introduce more language model data from other sources ?
Let's assume I want to transcribe speaker-independent telephonic audio of a certain topic (for example product feedback)
and Let's assume I have 1000 hours of audio+transcriptions, and those 1000 hours represent the target audio/topic/accent I need to transcribe, and let's say 1000 hours correspond to 1400000 audiofiles (and 1400000 transcriptions)
I can train an acoustic model with sphinxtrain from this training database, but for the language model, why it is better to introduce external language-model-data to my existing 1400000 transcriptions rather than just creating a language model with those 1400000 transcriptions only?
I tried to familiarize with RNNLM Toolkit (a very cool project) with the idea of creating an ARPA-format language model for pocketsphinx, but recurrent nets are not finite state machines, so, even if a function is implemented to generate an ARPA-format language models using RNNLM Toolkit, that would be just an approximation, and it wouldn't make use of a rnn language model in decoding.
Nickolay, do you think implementing RNN language models (in decoding) in cmu-sphinx would be a good idea?
Last edit: Orest 2015-04-22
It might be good idea for you, we do not plan to implement them ourselves.