I am working on a task of pronunciation evaluation. I need evaluate if the speakers pronounce well or not given a "standard" model. My idea is to do forced alignment (I have the transcripts for each speech), and get the probability of p(o|model), which is the likelihood. However, I see the sphinx only output acoustic score, which is a "normalized" state likelihood plus transition probability.
Now I just want to know if it's OK, I directly use this score to evaluate the pronunciation? I want to clarify here that I have the transcripts and just want to evaluate words in the given text.
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I determine some threshold on this score and say the words are pronounced well if the scores are higher than the threshold and the words are not pronounced well when the scores are lower than the threshold.
Of course, I will work on how to get the threshold. But I now just want to verify this idea can work.
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I determine some threshold on this score and say the words are pronounced well if the scores are higher than the threshold and the words are not pronounced well when the scores are lower than the threshold.
The score is a fit of the model and the data. It might be that there is a perfectly pronounced words which has worse score (because they do not fit the model) than badly pronounced words. The score would be best for the speakers from the training database, not for the speakers who pronounce words properly.
Another issue is that score is computed over whole utterance. If someone mispronounces just a single phone the score difference would be small. If there is noise but pronunciation is perfect the overall score will be very bad.
So there are disadvantages in the approach you selected.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for the confusion of the user. When I posted first time, I was using other's computer. So it showed his name.
I see your point here. Yes I agree there will be disadvantages. While if I assume the model trained is a "standard" model. The speech does not fit this model can be treated as bad pronunciation.
Regarding of the noise, it is a problem because it will affect the acoustic features. I'd better develop some approach which can adjust the threshold according to the environment.
BTW, do you have any suggestion on this topic? Thank you very much!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,dear all,
I am working on a task of pronunciation evaluation. I need evaluate if the speakers pronounce well or not given a "standard" model. My idea is to do forced alignment (I have the transcripts for each speech), and get the probability of p(o|model), which is the likelihood. However, I see the sphinx only output acoustic score, which is a "normalized" state likelihood plus transition probability.
Now I just want to know if it's OK, I directly use this score to evaluate the pronunciation? I want to clarify here that I have the transcripts and just want to evaluate words in the given text.
Thank you.
It is not clear what do you mean by "OK".
It is also not clear what do you mean by "use". Since you do not describe the algorithm in details it is hard to evaluate it.
Thank you.
If I determine some threshold on this score and say the words are pronounced well if the scores are higher than the threshold and the words are not pronounced well when the scores are lower than the threshold.
Of course, I will work on how to get the threshold. But I now just want to verify this idea can work.
Thank you.
Are you "tfpeach" same as "qiqi"?
The score is a fit of the model and the data. It might be that there is a perfectly pronounced words which has worse score (because they do not fit the model) than badly pronounced words. The score would be best for the speakers from the training database, not for the speakers who pronounce words properly.
Another issue is that score is computed over whole utterance. If someone mispronounces just a single phone the score difference would be small. If there is noise but pronunciation is perfect the overall score will be very bad.
So there are disadvantages in the approach you selected.
Thank you.
Sorry for the confusion of the user. When I posted first time, I was using other's computer. So it showed his name.
I see your point here. Yes I agree there will be disadvantages. While if I assume the model trained is a "standard" model. The speech does not fit this model can be treated as bad pronunciation.
Regarding of the noise, it is a problem because it will affect the acoustic features. I'd better develop some approach which can adjust the threshold according to the environment.
BTW, do you have any suggestion on this topic? Thank you very much!
Please read the theory first before asking questions.