Confidence estimation requires good score for an alternative path. In grammar with few words only it is hard to make a reliable estimate. In langauge model with few words it does not work as well, you need large vocabulary to reliably estimate confidence currently. Or you need to introduce a phone loop like we did in keyword spotting.
Next, confidence with small vocabulary is still a subject for research, for example short words of 1-2 syllables are very hard to detect reliably within current approach.
Third, grammars are not a priority for pocketsphinx most likely, it is hard to make voice interface right with grammars. In the future pocketsphinx will probably support only large language model and small language model built from example phrases.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the quick response, I understand the constraints (within the limits of my knowledge of SREs).
I am currently developing an application which uses pocketsphinx with jsgf grammars for In-Vehicle Infotainment systems, and is working quite well. However, when background noise (especially voices) is present and the user is not saying anything, the recognizer gets a non-null hypothesis anyway most of the times.
I tried to improve the robustness of the recognizer by tuning some parameters (mainly increasing vad_startspeech, vad_threshold and maxhmmpf) and I saw some improvements, but a great solution would have been setting a threshold; I am now thinking of adapting the acoustic model (Italian) by feeding some audio taken on the vehicle (background noise and voices coming from the street), do you think this could bring significant improvements?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually the recognition is triggered with a button, so keyword spotting is not necessary. I guess I'll restrict the grammar possibilities to significantly long phrases, in order to reduce the false alarms, or maybe use two microphones with audio subtraction to achieve directionality.
Anyway, the Italian acoustic model works very well with grammars, despite being basic.
I just hope you will not definitely move away from grammars in the near future, I think for driving embedded system (or any kind of software/application) is fundamental.
Last edit: Carlo Benussi 2017-02-03
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What about creating two recognizers, one with the grammar actually needed, and one with a filler grammar comprehending the most common and short words in italian (together with the recursive option on the grammar rule)?
Then I could send the audio input to both recognizers and get the scores (from ps_get_hyp), and compare the two. If the score of my grammar is bigger than the score of the filler grammar, the result would be accepted, otherwise rejected. It seems feasible?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried to use two recognizers like I said above (the filler Grammar has only one recursive rule, which is an or between all the phonemes) and the results regarding the scoring are quite good. But like you pointed out, the recognition is much slower now.
Which parameter of the filler recognizer can I tune to make the recognition over the filler grammar faster (even if less reliable)? I was thinking about lowering maxhmmpf and/or maxwpf for pruning, but I am far from sure since it is not my field.
Thanks for the availability, and sorry for bothering so much.
Last edit: Carlo Benussi 2017-02-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Which parameter of the filler recognizer can I tune to make the recognition over the filler grammar faster (even if less reliable)? I was thinking about lowering maxhmmpf and/or maxwpf for pruning, but I am far from sure since it is not my field.
Hello,
I would like to ask why it is not implemented (I guess since it is hard)
the confidence score for jsgf grammar (ps_getProb returns always zero).
Since it is implemented for language models, what makes it hard (or
impossible?) to implement it on grammars to?
Thanks in advance
Confidence estimation requires good score for an alternative path. In grammar with few words only it is hard to make a reliable estimate. In langauge model with few words it does not work as well, you need large vocabulary to reliably estimate confidence currently. Or you need to introduce a phone loop like we did in keyword spotting.
Next, confidence with small vocabulary is still a subject for research, for example short words of 1-2 syllables are very hard to detect reliably within current approach.
Third, grammars are not a priority for pocketsphinx most likely, it is hard to make voice interface right with grammars. In the future pocketsphinx will probably support only large language model and small language model built from example phrases.
Thanks for the quick response, I understand the constraints (within the limits of my knowledge of SREs).
I am currently developing an application which uses pocketsphinx with jsgf grammars for In-Vehicle Infotainment systems, and is working quite well. However, when background noise (especially voices) is present and the user is not saying anything, the recognizer gets a non-null hypothesis anyway most of the times.
I tried to improve the robustness of the recognizer by tuning some parameters (mainly increasing vad_startspeech, vad_threshold and maxhmmpf) and I saw some improvements, but a great solution would have been setting a threshold; I am now thinking of adapting the acoustic model (Italian) by feeding some audio taken on the vehicle (background noise and voices coming from the street), do you think this could bring significant improvements?
We recommend keyword spotting mode for continuous listening. It can be tuned to avoid false alarms.
Current Italian model is pretty basic, you need much more data to make it realiable. With training a new model you can introduce common noises.
Actually the recognition is triggered with a button, so keyword spotting is not necessary. I guess I'll restrict the grammar possibilities to significantly long phrases, in order to reduce the false alarms, or maybe use two microphones with audio subtraction to achieve directionality.
Anyway, the Italian acoustic model works very well with grammars, despite being basic.
I just hope you will not definitely move away from grammars in the near future, I think for driving embedded system (or any kind of software/application) is fundamental.
Last edit: Carlo Benussi 2017-02-03
This would be the step in a right direction.
What about creating two recognizers, one with the grammar actually needed, and one with a filler grammar comprehending the most common and short words in italian (together with the recursive option on the grammar rule)?
Then I could send the audio input to both recognizers and get the scores (from ps_get_hyp), and compare the two. If the score of my grammar is bigger than the score of the filler grammar, the result would be accepted, otherwise rejected. It seems feasible?
It is ok, you do not even need two recognizers, you can combine both grammars into a single one with two branches.
The question is only how fast that would work and how reliable. It requires experiments.
I tried to use two recognizers like I said above (the filler Grammar has only one recursive rule, which is an or between all the phonemes) and the results regarding the scoring are quite good. But like you pointed out, the recognition is much slower now.
Which parameter of the filler recognizer can I tune to make the recognition over the filler grammar faster (even if less reliable)? I was thinking about lowering maxhmmpf and/or maxwpf for pruning, but I am far from sure since it is not my field.
Thanks for the availability, and sorry for bothering so much.
Last edit: Carlo Benussi 2017-02-07
All parameters described here:
http://cmusphinx.sourceforge.net/wiki/pocketsphinxhandhelds