When i uses sphinx to recognize a sentence, how can i get confidence score for every word?
Is it acoustic score ( AScr(UnNorm) ) ?
And what is the meaning of the value of AScr(UnNorm)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
However I should mention that's done at an utterance level.
For word level confidence scoring, word posterior probabilities are the generally accepted way to do things, although they don't work particularly well. There is code for this in Sphinx3 but I am not sure that it is correct, because it doesn't actually give you probabilities, just some magic numbers, and it also contains a lot of mysterious scaling factors. The guy who wrote it was using its output as input to a neural network classifier so he didn't really care if it was correct as long it they gave good results.
The most effective way to get actual word posterior probabilities is to dump out HTK format lattices (-outlatfmt htk -outlatdir .) from Sphinx3, then run SRILM's rescoring tool on them. SRILM is not free software but it's freely available for research purposes: http://www.speech.sri.com/projects/srilm/
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks!
Yes, AScr is the probability of the observation sequence in log domain. In my opinion, the probability must less than 1, so AScr should be a negative.
But when i use Sphinx3, the output of AScr sometime is a positive number, i don't know why.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is a "feature" of Sphinx3. The actual observation probabilities that are used in search are not really probabilities, they are actually Gaussian densities. While the area under a Gaussian integrates to 1.0, the actual density value at any point can be greater than 1.0 (in fact, as the variance approaches zero, the density value for the mean approaches infinity).
Sphinx2 always normalizes Gaussian densities so that they appear to be probabilities. Sphinx3, for some reason, does not. Therefore the acoustic score can be positive.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In sphinx4, I ended up grabbing all the confidence features I could find (from the acoustic model, the language model, etc), and throwing it into a classifier that I trained on transcribed data. Basically what they describe in this paper: http://citeseer.ist.psu.edu/hazen00recognition.html
I just played around with classifiers in weka, and we're getting about 85% accurate accept/reject decisions. The most useful feature is the span of the parse of the utterance. If a lot of it parsed, it was probably correct.
Stefanie
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, that is the best approach. The probabilities the recognizer gives you (even posterior probabilities) are not very reliable for confidence scoring even with smart thresholding. This is pretty much the same approach that the Ravenclaw/Olympus dialog framework here at CMU uses - there is a confidence agent called Helios which integrates dialog, parsing, and ASR information to do accept/reject decisions. See: http://reports-archive.adm.cs.cmu.edu/anon/2002/CMU-CS-02-190.ps
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When i uses sphinx to recognize a sentence, how can i get confidence score for every word?
Is it acoustic score ( AScr(UnNorm) ) ?
And what is the meaning of the value of AScr(UnNorm)?
However I should mention that's done at an utterance level.
For word level confidence scoring, word posterior probabilities are the generally accepted way to do things, although they don't work particularly well. There is code for this in Sphinx3 but I am not sure that it is correct, because it doesn't actually give you probabilities, just some magic numbers, and it also contains a lot of mysterious scaling factors. The guy who wrote it was using its output as input to a neural network classifier so he didn't really care if it was correct as long it they gave good results.
See the code in sphinx3/src/libs3decoder/libconfidence if you dare. Or read the original paper: http://citeseer.ist.psu.edu/wessel98using.html
The most effective way to get actual word posterior probabilities is to dump out HTK format lattices (-outlatfmt htk -outlatdir .) from Sphinx3, then run SRILM's rescoring tool on them. SRILM is not free software but it's freely available for research purposes: http://www.speech.sri.com/projects/srilm/
About confidence, see
https://sourceforge.net/forum/forum.php?thread_id=1847237&forum_id=5470
AScr is just a score from acoustic model the probability of the observation sequence with our HMM, what else can it be?
Thanks!
Yes, AScr is the probability of the observation sequence in log domain. In my opinion, the probability must less than 1, so AScr should be a negative.
But when i use Sphinx3, the output of AScr sometime is a positive number, i don't know why.
Hi,
This is a "feature" of Sphinx3. The actual observation probabilities that are used in search are not really probabilities, they are actually Gaussian densities. While the area under a Gaussian integrates to 1.0, the actual density value at any point can be greater than 1.0 (in fact, as the variance approaches zero, the density value for the mean approaches infinity).
Sphinx2 always normalizes Gaussian densities so that they appear to be probabilities. Sphinx3, for some reason, does not. Therefore the acoustic score can be positive.
Thank you all.
But where can i get some document for this? For example, some formulas.
Thanks again.
> But where can i get some document for this? For example, some formulas.
There are lot of articles and books on HMM and sphinx in particular. For example fast GMM search is described here:
http://www.cs.cmu.edu/~jsherwan/pubs/icslp2004.pdf
but often it's hard to establish relationship between formulae and the code and comine it in a single document :(
In sphinx4, I ended up grabbing all the confidence features I could find (from the acoustic model, the language model, etc), and throwing it into a classifier that I trained on transcribed data. Basically what they describe in this paper: http://citeseer.ist.psu.edu/hazen00recognition.html
I just played around with classifiers in weka, and we're getting about 85% accurate accept/reject decisions. The most useful feature is the span of the parse of the utterance. If a lot of it parsed, it was probably correct.
Stefanie
Yes, that is the best approach. The probabilities the recognizer gives you (even posterior probabilities) are not very reliable for confidence scoring even with smart thresholding. This is pretty much the same approach that the Ravenclaw/Olympus dialog framework here at CMU uses - there is a confidence agent called Helios which integrates dialog, parsing, and ASR information to do accept/reject decisions. See: http://reports-archive.adm.cs.cmu.edu/anon/2002/CMU-CS-02-190.ps