I'm working on getting PocketSphinx to indentify a handful of keywords. I have a keyword file and I switch it to keyword mode. I've created a variety of short recordings to help me test the keyword recognition and it works great for some and not at all for others.
Here is my keyword file:
or(3) control /1e-30/
exit /1e-10/
tourniquet checkpoint /1e-50/
surgery start checkpoint /1e-50/
surgery end checkpoint /1e-50/
Works great for the first three. After that, it fails pretty badly (1 out of 4). I've tried adjusting the thresholds, but it doesn't seem to make any difference or make things worse. I read in another post that the limits where between 1 and 1e-50.
All of these words are in the English dictionary, but I added or(3) to get the correct pronuciation.
I'm at a loss at this point and unsure how to proceed.
Thanks in advance.
--Stephen
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour
It is not really helpful to tune threshold on a short files like you are doing, pocketsphinx does not have time to adapt to volume and noise level, you need to test on long files
I'm attaching one of the failing sound files.
You also have noise and echo. If you are building real-life system it's better to look for a good microphone.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Several other things: if you want to use thresholds over 1e-50, you need to set beam to larger value, for example -beam 1e-150.
It is better to split long phrases on subphrases, this way you get much more reliable estimation since you check both of them. You can check for "surgery start" and "checkpoint" instead of "surgery start checkpoint".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Originally tried that, but my results were worse than with them combined. Ulitmately a command will be something like, "or control surgery start checkpoint".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Not sure where I'm going to come up with an hour long audio file with a few references to things like "or control" and "surgery start checkpoint". Can I just record talking and occassional say those things without any context to the rest of the talking?
As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.
Other than the thresholds in the keywords file, is there anything I can tune?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can I just record talking and occassional say those things without any context to the rest of the talking?
You can
As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.
You need hardware noise cancellation then. Either microphone array or headset.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.
Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.
Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else. The length of the audio should be approximately 1 hour
Run keyword spotting on that file with different thresholds for every keyword, use the following command:
pocketsphinx_continuous -infile <your_file.wav> -keyphrase <"your keyphrase"> -kws_threshold \
<your_threshold> -time yes
It will print many lines, some of them are keywords with detection times and confidences. You can also disable extra logs with -logfn your_file.log option to avoid clutter.</your_threshold></your_file.wav>
From keyword spotting results count how many false alarms and missed detections you've encountered
Select the threshold with smallest amount of false alarms and missed detections
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Take a long recording with few occurrences of your keywords and some other sounds. You can take a movie sound or something else.
Should I splice in my keywords to the movie sound track? There aren't any movies that are going to have "or control surgery start checkpoint" in it. Or would the splicing likely invalidate the exercise because the audio quality and volume will be very different from the movie sound track?
Thanks for all your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm working on getting PocketSphinx to indentify a handful of keywords. I have a keyword file and I switch it to keyword mode. I've created a variety of short recordings to help me test the keyword recognition and it works great for some and not at all for others.
Here is my keyword file:
or(3) control /1e-30/
exit /1e-10/
tourniquet checkpoint /1e-50/
surgery start checkpoint /1e-50/
surgery end checkpoint /1e-50/
Works great for the first three. After that, it fails pretty badly (1 out of 4). I've tried adjusting the thresholds, but it doesn't seem to make any difference or make things worse. I read in another post that the limits where between 1 and 1e-50.
All of these words are in the English dictionary, but I added or(3) to get the correct pronuciation.
I'm at a loss at this point and unsure how to proceed.
Thanks in advance.
--Stephen
Tutorial says:
It is not really helpful to tune threshold on a short files like you are doing, pocketsphinx does not have time to adapt to volume and noise level, you need to test on long files
You also have noise and echo. If you are building real-life system it's better to look for a good microphone.
Several other things: if you want to use thresholds over 1e-50, you need to set beam to larger value, for example -beam 1e-150.
It is better to split long phrases on subphrases, this way you get much more reliable estimation since you check both of them. You can check for "surgery start" and "checkpoint" instead of "surgery start checkpoint".
Tutorial also mentions that:
"Surgery start checkpoint" is only 6 syllables.... so nothing I have is over 10 syllables. Do you still recommend splitting them?
Why it, it increases the reliability of detection.
Originally tried that, but my results were worse than with them combined. Ulitmately a command will be something like, "or control surgery start checkpoint".
Not sure where I'm going to come up with an hour long audio file with a few references to things like "or control" and "surgery start checkpoint". Can I just record talking and occassional say those things without any context to the rest of the talking?
As to noise and echo, I fully expect that our customer deployments will be worse. It will be in an operating room possibly with music in the background. Certainly people talking and noise from surgical equipment.
Other than the thresholds in the keywords file, is there anything I can tune?
You can
You need hardware noise cancellation then. Either microphone array or headset.
Would a fairly inexpensive microphone array like the one below make a difference?
https://www.amazon.com/Andrea-Electronics-Technology-Microphone-C1-1024200-1/dp/B00H85ANIE?SubscriptionId=AKIAILSHYYTFIVPWUY6Q&tag=duckduckgo-d-20&linkCode=xm2&camp=2025&creative=165953&creativeASIN=B00H85ANIE
Headsets are not going to work.
Yes, something like that should be much better than what you have now.
I'm attaching one of the failing sound files.
Even if I had an hour long audio recorded by a good microphone, it still isn't clear to me how I would "tune". Actually, it isn't even clear that "tune" is the right word, although I'm convinced "train" is the wrong word.
I've read http://cmusphinx.sourceforge.net/wiki/tutorialtuning, but it is unclear what is being tuned and how. What do I do with my results? What should I be changing to improve them?
I've also read some of http://www.speech.cs.cmu.edu/sphinx/tutorial.html, but that seems like overkill for some simple adjustments.
http://cmusphinx.sourceforge.net/wiki/tutoriallm#keyword_lists
Should I splice in my keywords to the movie sound track? There aren't any movies that are going to have "or control surgery start checkpoint" in it. Or would the splicing likely invalidate the exercise because the audio quality and volume will be very different from the movie sound track?
Thanks for all your help.
Yes, something like that should work. Threshold is a single number you don't need much data to estimate it.
Also there is "adapting" (http://cmusphinx.sourceforge.net/wiki/tutorialadapt), but that seems wrong because it is for one person or one environment.