Hello everyone, I'm currently trying to use pocketsphinx as part of a project. I want to run it in continous kws mode while picking up certain keywords (around 50). I intend to have this project be used by multiple people in a loud setting. I've already ran tests and the results are varied. The accuracy is all over the place depending on how loud and clear the speaker is speaking and of course tinkering with the parameters. so here are a few questions:
1- Theortically, can pocketsphinx (in continous kws mode) have an accuracy close to 100% for multiple different users without any training?
2- The parameters I'm using to modify accuracy are the kws_threshold, kws_delay, and indvidual thresholds for each word. Are there any other parameters that would help me improve accuracy?
3- I'm using a lapel microphone for stage use in order to cancel out the background noise and it works great, but is there a specific microphone type or hardware in general that might boost accuracy that long time users here can recommend?
4 - How do I maintain consistent accuracy?
Last edit: Bahgat A 2018-02-18
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello everyone, I'm currently trying to use pocketsphinx as part of a project. I want to run it in continous kws mode while picking up certain keywords (around 50). I intend to have this project be used by multiple people in a loud setting. I've already ran tests and the results are varied. The accuracy is all over the place depending on how loud and clear the speaker is speaking and of course tinkering with the parameters. so here are a few questions:
1- Theortically, can pocketsphinx (in continous kws mode) have an accuracy close to 100% for multiple different users without any training?
Keyword spotting is only supposed to work for 2-3 phrases. For 50 phrases by many people you need to run a good large vocabulary recognizer and simply process the output.
2- The parameters I'm using to modify accuracy are the kws_threshold, kws_delay, and indvidual thresholds for each word. Are there any other parameters that would help me improve accuracy?
When overall approach is wrong parameters do not matter.
3- I'm using a lapel microphone for stage use in order to cancel out the background noise and it works great, but is there a specific microphone type or hardware in general that might boost accuracy that long time users here can recommend
4 - How do I maintain consistent accuracy?
Microphone does not matter much. Neural networks in recognizer are much more important.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have the same question. In addition to the relatively low accuracy (and I have only about 20 (!) keywords, which simply get often "ignored", I have a long delay after I finish pronouncing it - up to 4-5 seconds in average. It seems like the silence recognition is not well tuned. What can I do about these issues?
Thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Keyword spotting is not evaluated by accuracy but rather by alarm/rejection rates. Good working poing requires careful tuning and good acoustic models. Silence is a not very relevant here. To get help on the accuracy you'd better provide test data.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nickolay, thank you for your response. It is the first time ever I am dealing with voice recognition and therefore is definitely a newbie in the field. What I am trying to do is adding the command recognition option to our equipment control application. For that purpose I selected two sets of the most relevant commands: the "basic" set, containing 28 commands, and "extended" set with approximately 50 or so commands. At this time I am working with the basic set - and - like I mentioned - it works, but (sorry for the wrong terminology!) the erroneous recognition rate (including the reaction to an environmental noise) is relatively high, as well as non-recognizing of the actual commands, i.e. what they call in statistics, both type I errors and type II errors occur pretty often. And the delay after the correctly recognized command is - like I mentioned - usually pretty long. What would you recommend me to tune - and where can I learn these procedure(s)? What kind of the test data can I provide to clarify the problem?
Thank you once again for your support,
Mike Faynberg
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay. on the additional note: how can I reduce the delay between command and its recognition? Is it described somewhere?
Best regards,
Mike Faynberg
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay, I would love to provide whatever information would be helpful, however I do not know what exactly. I have a list of (as of today) about 20 commands. They are recognized (with a certain degree of success), but it takes approximately 3 to 5 seconds before the program responds. The attached is the excerpt from my code - actually it lacks only the commands pre-processing, which is irrelevant at this time, since I do not have problems with it. What happens is this delay after I finish pronouncing a command and prior I see the "Heard something..." trace message. I will greatly appreciate if you suggest what additional evidence is needed to understand the problem.
In your code you only check for detected keyword only once silence appears. You can check for detected word continuously right after process_raw. Silence detection is not required
Hello everyone, I'm currently trying to use pocketsphinx as part of a project. I want to run it in continous kws mode while picking up certain keywords (around 50). I intend to have this project be used by multiple people in a loud setting. I've already ran tests and the results are varied. The accuracy is all over the place depending on how loud and clear the speaker is speaking and of course tinkering with the parameters. so here are a few questions:
1- Theortically, can pocketsphinx (in continous kws mode) have an accuracy close to 100% for multiple different users without any training?
2- The parameters I'm using to modify accuracy are the kws_threshold, kws_delay, and indvidual thresholds for each word. Are there any other parameters that would help me improve accuracy?
3- I'm using a lapel microphone for stage use in order to cancel out the background noise and it works great, but is there a specific microphone type or hardware in general that might boost accuracy that long time users here can recommend?
4 - How do I maintain consistent accuracy?
Last edit: Bahgat A 2018-02-18
I'm wondering this question too.
Have you tried model model adaptation? did it give some improvement in the accuracy?
Keyword spotting is only supposed to work for 2-3 phrases. For 50 phrases by many people you need to run a good large vocabulary recognizer and simply process the output.
When overall approach is wrong parameters do not matter.
Microphone does not matter much. Neural networks in recognizer are much more important.
I have the same question. In addition to the relatively low accuracy (and I have only about 20 (!) keywords, which simply get often "ignored", I have a long delay after I finish pronouncing it - up to 4-5 seconds in average. It seems like the silence recognition is not well tuned. What can I do about these issues?
Thank you!
Keyword spotting is not evaluated by accuracy but rather by alarm/rejection rates. Good working poing requires careful tuning and good acoustic models. Silence is a not very relevant here. To get help on the accuracy you'd better provide test data.
Nickolay, thank you for your response. It is the first time ever I am dealing with voice recognition and therefore is definitely a newbie in the field. What I am trying to do is adding the command recognition option to our equipment control application. For that purpose I selected two sets of the most relevant commands: the "basic" set, containing 28 commands, and "extended" set with approximately 50 or so commands. At this time I am working with the basic set - and - like I mentioned - it works, but (sorry for the wrong terminology!) the erroneous recognition rate (including the reaction to an environmental noise) is relatively high, as well as non-recognizing of the actual commands, i.e. what they call in statistics, both type I errors and type II errors occur pretty often. And the delay after the correctly recognized command is - like I mentioned - usually pretty long. What would you recommend me to tune - and where can I learn these procedure(s)? What kind of the test data can I provide to clarify the problem?
Thank you once again for your support,
Mike Faynberg
Hi Nickolay. on the additional note: how can I reduce the delay between command and its recognition? Is it described somewhere?
Best regards,
Mike Faynberg
Is there a delay? I have no delay here. You need to provide more information to get help.
Hi Nickolay, I would love to provide whatever information would be helpful, however I do not know what exactly. I have a list of (as of today) about 20 commands. They are recognized (with a certain degree of success), but it takes approximately 3 to 5 seconds before the program responds. The attached is the excerpt from my code - actually it lacks only the commands pre-processing, which is irrelevant at this time, since I do not have problems with it. What happens is this delay after I finish pronouncing a command and prior I see the "Heard something..." trace message. I will greatly appreciate if you suggest what additional evidence is needed to understand the problem.
In your code you only check for detected keyword only once silence appears. You can check for detected word continuously right after process_raw. Silence detection is not required
The sample python code is here
https://github.com/cmusphinx/pocketsphinx/blob/master/swig/python/test/kws_test.py