I'm working on a research project to provide machine based scoring for pronunciation in language learning process. After a bit of googling, I stumbled upon a handful of papers and projects that based this idea on the alignment function.
I managed to create a demo, where the first step is to use sphinx_fe (from sphinxbase 0.8) to extract features (mfc). And the second step is to use sphinx3_align (from sphinx3-0.8) to align the recording with given text. Eventually, the results in .phseg and .wdseg contains 'Acoustic Score' for each phoneme or word, which suppose to mean something in terms of how good the samples are. (Correct me if I'm wrong here, but sphinx4 can't do this?)
Though I read in several papers, that the Acoustic Score can be both positive and negative, I'm seeing consistent negative scores in all of my experiments.
I'm wondering what exactly is Acoustic Score. And how do I use it as a reference to determine whether the pronunciation is good. Or is it even adequate?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
About the sign of likelihood and what acoustic score is. Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the CDF is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.
For the sign. If you think of discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.)
One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.
As for determining whether the pronunciation is good, what usually people do is to calculate some kind of phone confidence. There should be plenty of paper on that topic, try google it a bit first.
Hope this help.
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Small addition to Arthur's reply. Acoustic score per frame comes from two sources. [1] Gaussian CDF [2] Discrete PDF corresponding to state transition matrix ("a" transitions in https://en.wikipedia.org/wiki/File:HiddenMarkovModel.svg). The first one has larger contribution.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hey Pranav...... Thanks for your supplementary answer. You did provide more details for Mr. Shiming.
I guess one point I want to clarify...... acoustic score per phone or per word do come from two sources. In which, the state-transition or phone-phone transition probability need to be calculated.
On the other hand, acoustic score per frame usually just means the Gaussian distribution (CDF). (Since the cdf of gmm and discrete pdf are independent. The latter doesn't depend on the frame vector.)
Of course, we might be just talking about something with slightly different terms.
Thanks for making a distinction between per frame and per phone. Do the scores in *.stseg come from Gaussian continuous PD alone or combination of both continuous and discrete PD? I thought it was the latter.
(I don't know why fonts are messed up after sourceforge site modifications).
Last edit: Pranav Jawale 2013-01-26
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.
Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?
For phone confidence calculation, I've done a bit of search already. It looks like I should get the TIMIT database for reference baseline. I wonder if I'm heading in the right direction. Should I run feature extraction and alignment on TIMIT data, and calculate the mean and std deviation of each phoneme, then compare the variance of user sample to mean TIMIT sample to get the pronunciation score?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.
You need to understand that acoustic score is a joint probability of the model and the observation sequence and has no relation to pronuncation quality. Confidence score does if you construct it, but acoustic score is not a confidence score.
Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?
Increase beam
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Absolute acoustic scores are not all that meaningful. It you are interested
in scores per se you might be better off looking at the distribution across
classes per decoding unit. So, if the the next-best class is very close to
the first-best, that's worse than if the best-score is significantly better
than the 2nd-best. It's all relative.
But actually that's too complicated. You're better off examining posterior
probabilities, i.e. the 'confidence' score.
The 'final-state-not-reached' condition, in my experience, is often
attributable to bad end-pointing. That is, the recognition process assumes
that all utts end in a silence; specifically, a transition to a silence
state. If the end of the utt is noisy, or cuts off speech, meaning there's
no good silence, you get this error. I would suggest first looking at the
endpointing.
I have been reviewing the scores manually. It turns out there are
limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most
likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often
get a high score (-18000 to -10000). It appears silent consonants are
treated as inferior pronunciation.
You need to understand that acoustic score is a joint probability of the
model and the observation sequence and has no relation to pronuncation
quality. Confidence score does if you construct it, but acoustic score is
not a confidence score.
Another limitation is sometimes sphinx3_align dies with 'final state not
reached' error. This happens in cases where a user pronounced a word
incorrectly. It looks like sphinx3_align stops searching further as soon as
failing to align one word. Is there any method to improve this?
Thank you all for your help. I'll experiment with beam settings and try silence out all utterances.
[Update]: I tried silencing out the utterances, by using sox fade 1s 0 1s. Weird thing happend, sphinx3_align spend a long time on each, and then dies with 'final state not reached'. Fade in is okay. But fade out will cause this problem. Usually sphinx3_align spends 1-2 secs on each utterance, but for fade out ones, it spends up to 10 secs.
About the 'confidence' score, I'm wondering if using the acoustic score of TIMIT data is a good start for evaluation?
Last edit: He Shiming 2013-01-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Dear Community,
I'm working on a research project to provide machine based scoring for pronunciation in language learning process. After a bit of googling, I stumbled upon a handful of papers and projects that based this idea on the alignment function.
I managed to create a demo, where the first step is to use sphinx_fe (from sphinxbase 0.8) to extract features (mfc). And the second step is to use sphinx3_align (from sphinx3-0.8) to align the recording with given text. Eventually, the results in .phseg and .wdseg contains 'Acoustic Score' for each phoneme or word, which suppose to mean something in terms of how good the samples are. (Correct me if I'm wrong here, but sphinx4 can't do this?)
Though I read in several papers, that the Acoustic Score can be both positive and negative, I'm seeing consistent negative scores in all of my experiments.
I'm wondering what exactly is Acoustic Score. And how do I use it as a reference to determine whether the pronunciation is good. Or is it even adequate?
Thanks.
checkout this: http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation
About the sign of likelihood and what acoustic score is. Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the CDF is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.
For the sign. If you think of discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.)
In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1
(Also see http://blog.stata.com/2011/02/16/positive-log-likelihood-values-happen/) So those are the time you will see a positive value.
One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.
As for determining whether the pronunciation is good, what usually people do is to calculate some kind of phone confidence. There should be plenty of paper on that topic, try google it a bit first.
Hope this help.
Arthur
Hi Arthur
Can you suggest something to calculate phoneme confidence?
Small addition to Arthur's reply. Acoustic score per frame comes from two sources. [1] Gaussian CDF [2] Discrete PDF corresponding to state transition matrix ("a" transitions in https://en.wikipedia.org/wiki/File:HiddenMarkovModel.svg). The first one has larger contribution.
Hey Pranav...... Thanks for your supplementary answer. You did provide more details for Mr. Shiming.
I guess one point I want to clarify...... acoustic score per phone or per word do come from two sources. In which, the state-transition or phone-phone transition probability need to be calculated.
On the other hand, acoustic score per frame usually just means the Gaussian distribution (CDF). (Since the cdf of gmm and discrete pdf are independent. The latter doesn't depend on the frame vector.)
Of course, we might be just talking about something with slightly different terms.
Arthur
Btw, I put a little bit of an explanation post on my blog. Feel free to take a look if you are interested.
http://grandjanitor.blogspot.com/2013/01/acoustic-score-and-its-signness.html
Arthur
Thanks for making a distinction between per frame and per phone. Do the scores in *.stseg come from Gaussian continuous PD alone or combination of both continuous and discrete PD? I thought it was the latter.
(I don't know why fonts are messed up after sourceforge site modifications).
Last edit: Pranav Jawale 2013-01-26
Thank you all very much for the explanation.
I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.
Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?
For phone confidence calculation, I've done a bit of search already. It looks like I should get the TIMIT database for reference baseline. I wonder if I'm heading in the right direction. Should I run feature extraction and alignment on TIMIT data, and calculate the mean and std deviation of each phoneme, then compare the variance of user sample to mean TIMIT sample to get the pronunciation score?
You need to understand that acoustic score is a joint probability of the model and the observation sequence and has no relation to pronuncation quality. Confidence score does if you construct it, but acoustic score is not a confidence score.
Increase beam
What Nickolay said.
Absolute acoustic scores are not all that meaningful. It you are interested
in scores per se you might be better off looking at the distribution across
classes per decoding unit. So, if the the next-best class is very close to
the first-best, that's worse than if the best-score is significantly better
than the 2nd-best. It's all relative.
But actually that's too complicated. You're better off examining posterior
probabilities, i.e. the 'confidence' score.
The 'final-state-not-reached' condition, in my experience, is often
attributable to bad end-pointing. That is, the recognition process assumes
that all utts end in a silence; specifically, a transition to a silence
state. If the end of the utt is noisy, or cuts off speech, meaning there's
no good silence, you get this error. I would suggest first looking at the
endpointing.
Alex
On Sat, Jan 26, 2013 at 2:00 PM, Nickolay V. Shmyrev
nshmyrev@users.sf.netwrote:
Thank you all for your help. I'll experiment with beam settings and try silence out all utterances.
[Update]: I tried silencing out the utterances, by using
sox fade 1s 0 1s
. Weird thing happend, sphinx3_align spend a long time on each, and then dies with 'final state not reached'. Fade in is okay. But fade out will cause this problem. Usually sphinx3_align spends 1-2 secs on each utterance, but for fade out ones, it spends up to 10 secs.About the 'confidence' score, I'm wondering if using the acoustic score of TIMIT data is a good start for evaluation?
Last edit: He Shiming 2013-01-28