I am working on a speech assessment system which targets people with speech disorders, currently using pocketsphinx, and I need to recognize specific mispronunciation sounds (such as glottal stops, pharyngeal fricatives and hypernasal consonants) in addition to regular English sounds.
To do this, I would like to train new phones and add them to the default English acoustic model. I want to take advantage of the model in order not to start the training procedure from scratch. For the training, I would use recordings which contain both regular English phones and mispronunciation phones, but I want to learn only the new phones from them. The existing acoustic model could help to generate a better segmentation of the recordings, so that the new phones are trained on the appropriate speech segments.
How can I do this using Sphinx? I guess I have to make some tweaks to sphinxtrain, but I don't understand the training procedure well enough to get started. Any clue, thought or opinion on the matter is welcome!
Thank you in advance for the help,
Cédric
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
but I want to learn only the new phones from them.
An acoustic model contains context-dependent detectors for phones, not just phones. So you can not easily add phones to the model, you have to reestimate that. For more information see the tutorial
An acoustic model contains context-dependent detectors for phones, not just phones.
Sure, I understand that, but still, all the context-dependent phones that do not contain the new phones in their surrounding context have already been learned and do not have to be reestimated, right? Also, I would think that the available models would help to segment the data and improve the training for the new models.
In any case, my problem is that I have very little data available and it would not be sufficient to train the entire set of English phones in addition to the new phones. Even though I could add speech files from VoxForge to my dataset, if possible, I wanted to make use of the acoustic model provided with pocketsphinx, as it has given me better accuracy results so far than the model from VoxForge.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
AFAIK SphinxTrain does not take time stamped transcripts for training.
It doesn't. "Segmenting the data" was not a good way to phrase it, let me explain what I meant in other words.
From what I understand, acoustic model training is an iterative process where the HMMs parameters are re-estimated incrementally using a variant of the Baum-Welch algorithm. During one iteration, the probabilities of state occupation at each time frame of an utterance are computed using the current parameters. Then, using those probabilities, the parameters are updated.
If I use an existing model, the parameters of known HMMs will already be of good quality. Therefore, the computed probabilities will be more accurate, which in turn will allow for a better re-estimation of the unknown HMMs.
The state occupation probabilities give an implicit "alignment" of the data, because they say "it is very likely that at this time frame, we were in this state of this HMM" (the HMM representing a phoneme). This is what I meant by "segment the data".
More details can be found in The HTK Book p.10-11 (this is a general description of the training process in the HMM framework and thus not specific to HTK):
Embedded training uses the same Baum-Welch procedure as for the isolated case but rather than training each model individually all models are trained in parallel. It works in the following steps:
1. Allocate and zero accumulators for all parameters of all HMMs.
2. Get the next training utterance.
3. Construct a composite HMM by joining in sequence the HMMs corresponding to the symbol transcription of the training utterance.
4. Calculate the forward and backward probabilities for the composite HMM.
5. Use the forward and backward probabilities to compute the probabilities of state occupation at each time frame and update the accumulators in the usual way.
6. Repeat from 2 until all training utterances have been processed.
7. Use the accumulators to calculate new parameter estimates for all of the HMMs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello all,
I am working on a speech assessment system which targets people with speech disorders, currently using pocketsphinx, and I need to recognize specific mispronunciation sounds (such as glottal stops, pharyngeal fricatives and hypernasal consonants) in addition to regular English sounds.
To do this, I would like to train new phones and add them to the default English acoustic model. I want to take advantage of the model in order not to start the training procedure from scratch. For the training, I would use recordings which contain both regular English phones and mispronunciation phones, but I want to learn only the new phones from them. The existing acoustic model could help to generate a better segmentation of the recordings, so that the new phones are trained on the appropriate speech segments.
How can I do this using Sphinx? I guess I have to make some tweaks to sphinxtrain, but I don't understand the training procedure well enough to get started. Any clue, thought or opinion on the matter is welcome!
Thank you in advance for the help,
Cédric
An acoustic model contains context-dependent detectors for phones, not just phones. So you can not easily add phones to the model, you have to reestimate that. For more information see the tutorial
http://cmusphinx.sourceforge.net/wiki/tutorial
Hello Nickolay, thank you for your answer.
Sure, I understand that, but still, all the context-dependent phones that do not contain the new phones in their surrounding context have already been learned and do not have to be reestimated, right? Also, I would think that the available models would help to segment the data and improve the training for the new models.
In any case, my problem is that I have very little data available and it would not be sufficient to train the entire set of English phones in addition to the new phones. Even though I could add speech files from VoxForge to my dataset, if possible, I wanted to make use of the acoustic model provided with pocketsphinx, as it has given me better accuracy results so far than the model from VoxForge.
Just a question regarding,
"I would think that the available models would help to segment the data and
improve the training for the new models."
How will segmenting the data help with training of new models? AFAIK
SphinxTrain does not take time stamped transcripts for training.
Hello Pranav, sorry for the late answer.
It doesn't. "Segmenting the data" was not a good way to phrase it, let me explain what I meant in other words.
From what I understand, acoustic model training is an iterative process where the HMMs parameters are re-estimated incrementally using a variant of the Baum-Welch algorithm. During one iteration, the probabilities of state occupation at each time frame of an utterance are computed using the current parameters. Then, using those probabilities, the parameters are updated.
If I use an existing model, the parameters of known HMMs will already be of good quality. Therefore, the computed probabilities will be more accurate, which in turn will allow for a better re-estimation of the unknown HMMs.
The state occupation probabilities give an implicit "alignment" of the data, because they say "it is very likely that at this time frame, we were in this state of this HMM" (the HMM representing a phoneme). This is what I meant by "segment the data".
More details can be found in The HTK Book p.10-11 (this is a general description of the training process in the HMM framework and thus not specific to HTK):