When speaking strings of digits or alphabetic characters, users sometimes pause in the middle of the sequence. For example, when speaking a North American phone number, users will often pause after the first three digits and prior to the last four digits. I am having trouble with Sphinx inserting arbitrary numbers after recognizing a partial string when users pause in the middle of speaking a string of digits.
VoiceXML has some parameters for dealing with this type of problem, including the following:
(a) incomplete timeout (the required length of silence following user speech after which the recognizer returns an incomplete match of a grammar)
(b) sensitivity (sensitive to quiet input vs background noise).
(c) confidence level (adjust the level of acceptance of the ASR-generated confidence level)
Are there equivalent parameters for Sphinx? Do you have recommendations for solving my problem of recognizing invalid strings when the user pauses?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hm, it depends on language model you are using, can't you just insert filler there?
About confidence, sensitivity and so on, there are silpenalty, word insertion penalty, frontend can be tuned on amount of silence too, but such changes aren't so easy like in commercial recognizers.
Can you provide a small test set on this problem, probably we can try to tune recognition rate. Are you testing on tidigits?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You just need to set a wider beam to get good accuracy. Take wavfile demo. Change it to use wsj model. Set relative beam width to 1e-120. Set word insertion probability to 1e-40. Everything will be recognized correctly. And use a simple grammar with all words in a loop.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Of course good recognition performance doesn't provide you enough to build a stable system.
As for confidence, with JSGF confidence doesn't work. You can use a trigram language model like in confidence demo. Timeout also can be handled once you'll insert speech marker and non speech data filter with corresponding object properties.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When speaking strings of digits or alphabetic characters, users sometimes pause in the middle of the sequence. For example, when speaking a North American phone number, users will often pause after the first three digits and prior to the last four digits. I am having trouble with Sphinx inserting arbitrary numbers after recognizing a partial string when users pause in the middle of speaking a string of digits.
VoiceXML has some parameters for dealing with this type of problem, including the following:
(a) incomplete timeout (the required length of silence following user speech after which the recognizer returns an incomplete match of a grammar)
(b) sensitivity (sensitive to quiet input vs background noise).
(c) confidence level (adjust the level of acceptance of the ASR-generated confidence level)
Are there equivalent parameters for Sphinx? Do you have recommendations for solving my problem of recognizing invalid strings when the user pauses?
Hm, it depends on language model you are using, can't you just insert filler there?
About confidence, sensitivity and so on, there are silpenalty, word insertion penalty, frontend can be tuned on amount of silence too, but such changes aren't so easy like in commercial recognizers.
Can you provide a small test set on this problem, probably we can try to tune recognition rate. Are you testing on tidigits?
I don't understand how to insert filler. Can you advise me on how I would do this (or point me to any relevant literature)?
Just give the the recording and I'll show you. I don't think sphinx4 configuration is described in literature except javadoc files:
http://cmusphinx.sourceforge.net/sphinx4/javadoc/index.html
I have put together a small test set. What is the best way to send you the files?
Upload it to mediafire.com and give a link
http://web.cecs.pdx.edu/~ekuo/tenninetest.tar.gz
You just need to set a wider beam to get good accuracy. Take wavfile demo. Change it to use wsj model. Set relative beam width to 1e-120. Set word insertion probability to 1e-40. Everything will be recognized correctly. And use a simple grammar with all words in a loop.
Thank you for all your help. I will try that and see how it works.
Correct link
http://www.mediafire.com/?zjtdrlnjto0
Check the complete example here:
http://www.mediafire.com/upload_complete.php?id=iegimlwbsla
Of course good recognition performance doesn't provide you enough to build a stable system.
As for confidence, with JSGF confidence doesn't work. You can use a trigram language model like in confidence demo. Timeout also can be handled once you'll insert speech marker and non speech data filter with corresponding object properties.