I am writing a wrapper around pocketsphinx with PHP. The intended goal is to process WAV files (podcasts and public speech) and "auto-tag" each processed audio file according to a pre-compiled keyword list. The results would be weighted and tags would then be created based upon the resulted words with the highest score.
Moving on, I am having a little difficult wrapping my mind around the KWS_THRESHOLD argument for pocketsphinx. Can I used decimal values (0.95), or is e notation (1e-10) required? If I use a threshold value of 1e-10, it seems a little accurate (a handful of false positives), but if I raise it to 1e-30, the amount of false positives increase significantly.
What are the best practices regarding keyword spotting and assigning threshold, and can I depend upon the accuracy (obviously not fully) of pocketsphinx to process the audio and recognize keywords accordingly? Is it speaker independent?
Thanks in advance for you help, and please forgive my ignorance on this subject.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am writing a wrapper around pocketsphinx with PHP
cool
Can I used decimal values (0.95), or is e notation (1e-10) required?
You should use e-notation because very big/small values are expected as threshold
If I use a threshold value of 1e-10, it seems a little accurate (a handful of false positives), but if I raise it to 1e-30, the amount of false positives increase significantly
It is expected. 1e-10 > 1e-30. Bigger threshold drops uncertain occurrences of keyword, decreasing amount of false alarms.
What are the best practices regarding keyword spotting and assigning threshold, and can I depend upon the accuracy (obviously not fully) of pocketsphinx to process the audio and recognize keywords accordingly?
Collect some audio data with your keyword and try to select best threshold for it (optimal position on ROC curve). It depends on length of keyword, mismatch between acoustic model and provided audio. For sure accurate keyword spotting requires accurate acoustic modeling.
Is it speaker independent?
Depends on acoustic model you're using. Default AM provided with pocketsphinx is speaker independent.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Good day!
I am writing a wrapper around pocketsphinx with PHP. The intended goal is to process WAV files (podcasts and public speech) and "auto-tag" each processed audio file according to a pre-compiled keyword list. The results would be weighted and tags would then be created based upon the resulted words with the highest score.
Moving on, I am having a little difficult wrapping my mind around the KWS_THRESHOLD argument for pocketsphinx. Can I used decimal values (0.95), or is e notation (1e-10) required? If I use a threshold value of 1e-10, it seems a little accurate (a handful of false positives), but if I raise it to 1e-30, the amount of false positives increase significantly.
I am using the following cli call:
pocketsphinx_continuous -infile some.wav -dict /usr/local/share/pocketsphinx/model/en-us/cmudict-en-us.dict -kws keywords.txt -kws_threshold 0.95
What are the best practices regarding keyword spotting and assigning threshold, and can I depend upon the accuracy (obviously not fully) of pocketsphinx to process the audio and recognize keywords accordingly? Is it speaker independent?
Thanks in advance for you help, and please forgive my ignorance on this subject.
cool
You should use e-notation because very big/small values are expected as threshold
It is expected. 1e-10 > 1e-30. Bigger threshold drops uncertain occurrences of keyword, decreasing amount of false alarms.
Collect some audio data with your keyword and try to select best threshold for it (optimal position on ROC curve). It depends on length of keyword, mismatch between acoustic model and provided audio. For sure accurate keyword spotting requires accurate acoustic modeling.
Depends on acoustic model you're using. Default AM provided with pocketsphinx is speaker independent.
Thanks for your assistance!
Any resources to help a layman better understand acoustic models? What would be the ideal way to customize an acoustic model?
One more thing...
I am assuming, by your reply, that I am able to assign a threshold to individual keywords? How is this achieved within the keyword text file?
Last edit: Troy McComas 2015-05-08
Last edit: Troy McComas 2015-05-08
http://cmusphinx.sourceforge.net/wiki/tutorialadapt
http://stackoverflow.com/questions/25748113/recognizing-multiple-keywords-using-pocketsphinx