Hi, I am working on boundary detection of syllables. Besides the acoustic source, we have another source which can provide the posterior probabilty of the syllables given the observed features.
We would like to incorporate this probabilty into the viterbi decoding to improve the boundary detection. Where can I find the decoding part of source code?
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It is easier to integrate this new parameter into feature vector, there is no need to modify decoding algorithm itself. You just modify the features and the acoustic model. It is not clear if gaussian classifiers will still work well for you though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
RIght now our ASR performs well in term of recognition. The only problem is the syllable detection. I attach a image here. It has the boundaries of ground truth and detection from the ASR (acoustic source only). The green one is ground truth, the red one is the detected one. You can see the decoder seems not able to find small gaps in the utterance. The segments always start from the end of previous one. I am worried that even though I put the other source as a feature, it still has this problem.
Do you know why we have this problem? (ASR performs well, boundary detection is terrible).
HMM model is not really guaranteed to converge to a proper detector of the phonemes, in particular if it can not discriminate well or has no information to learn real phone boundaries. This is pretty obvious if you consider algorithm itself. It tries to optimize whole result probability and does not care about particular phonemes.
The ways to improve alignment quality for HMM system:
1) Design database in a way that the variety of sound combinations is sufficent for the system to learn what each sound really means. For example mix words "pataka, katapa and pakata" in your training database. Or have separate "pa" "ta" and "ka".
2) Initialize models with manual segments. This is quite helpful and can be done in HTK with HRest for example.
3) Use more distinctive features, for example, use voicing features to make it easy for model to distinguish a with plosives.
4) Have enough training data.
5) Apply MPE training to minimize phone error.
There are many ways, you just need to be more careful with your training data and training process.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We are exploring some discriminative features for that. We also would like to initialize the models with manual segments during the training. However, how can I do that in sphinx?
I have already get some manually labeled transcriptions with syllabel timing boundaries. But I don't know where I should provide the transcriptions during the training.
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, I am working on boundary detection of syllables. Besides the acoustic source, we have another source which can provide the posterior probabilty of the syllables given the observed features.
We would like to incorporate this probabilty into the viterbi decoding to improve the boundary detection. Where can I find the decoding part of source code?
Thank you.
It is easier to integrate this new parameter into feature vector, there is no need to modify decoding algorithm itself. You just modify the features and the acoustic model. It is not clear if gaussian classifiers will still work well for you though.
Thank you.
RIght now our ASR performs well in term of recognition. The only problem is the syllable detection. I attach a image here. It has the boundaries of ground truth and detection from the ASR (acoustic source only). The green one is ground truth, the red one is the detected one. You can see the decoder seems not able to find small gaps in the utterance. The segments always start from the end of previous one. I am worried that even though I put the other source as a feature, it still has this problem.
Do you know why we have this problem? (ASR performs well, boundary detection is terrible).
Thank you once again
HMM model is not really guaranteed to converge to a proper detector of the phonemes, in particular if it can not discriminate well or has no information to learn real phone boundaries. This is pretty obvious if you consider algorithm itself. It tries to optimize whole result probability and does not care about particular phonemes.
The ways to improve alignment quality for HMM system:
1) Design database in a way that the variety of sound combinations is sufficent for the system to learn what each sound really means. For example mix words "pataka, katapa and pakata" in your training database. Or have separate "pa" "ta" and "ka".
2) Initialize models with manual segments. This is quite helpful and can be done in HTK with HRest for example.
3) Use more distinctive features, for example, use voicing features to make it easy for model to distinguish a with plosives.
4) Have enough training data.
5) Apply MPE training to minimize phone error.
There are many ways, you just need to be more careful with your training data and training process.
Thank you.
We are exploring some discriminative features for that. We also would like to initialize the models with manual segments during the training. However, how can I do that in sphinx?
I have already get some manually labeled transcriptions with syllabel timing boundaries. But I don't know where I should provide the transcriptions during the training.
Thank you.
Does anyone else know how to initialize model with manual segments in Sphinx?
Thanks.
It is not possible to initialize segments with manual labels with cmusphinx. I wrote you above it is possible with HTK.
Thank you.
I will try that with HTK.