I've been testing PocketSphinx (pocketsphinx_continuous on Linux) to see if it could be used (in keyword spotting mode) to detect a small subset of simple words. Right now I'm trying to detect some Dutch words, which works quite well. However, one of them is 'aap' (monkey) and it often doesn't get recognized. The .dict I use has 'aap a p' and there's also an entry for it in the .list file I use.
However, no matter whether the entry in the .list file is 'aap /1/' or 'aap /1e-100/' or 'aap /1000/' the detection doesn't get any better or worse. I understand detecting a keyword should be easier with more complex words (more to go on), but still: howcome threshold doesn't seem to have any effect? Thanks for any insight into this matter!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm simply trying (through my laptop's mic) to get PocketSphinx to reliably detect a few words, pronounced by various dutch native speakers (my girlfriend, my daughter). I don't actually need pronunciation to be perfect, that's why I'm trying to really lower the threshold to just have it detect the words in almost all cases. But configuring the threshold doesn't seem to have any effect at all....
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, so I tried with a different word: 'slang' (Dutch word for snake). I pronounced it three times in the attached sound file ('slang', 'sang' and 'lang'). Right now running the command:
..seems to detect the first two (?) pronunciations, and I'm not able to get different results when I significantly modify the treshold for 'slang' in keyphrase.list (e.g. "slang /1e-100/" vs "slang /1/"). Even removing the the line completely from the .list file doesn't change my results. Which leads me to believe somehow the .list file might not even be interpreted...?! (it does read the correct file though, I checked)
Thanks for your response, not using the -lm option fixed things for me with regard to thresholds!
My intended usecase is perhaps somewhat strange: I want to be very forgiving in recognizing just a handul of words, so a kid could mispronounce the (Dutch) word 'slang' and still have the word correctly recognized. I can somewhat allow for this by adding various extra pronunciations for the word in my .dict file (combined with low thresholds) - but when I'm trying to recognize words that are rather short it often results in multiple of them being recognized. For example, the dutch word for monkey is 'aap', which is often recognized as part of other words that contain the 'a' vowel (in that case I could just pick the longest word... I guess).
Are there better ways of getting to know which word was presumably intended? I don't think there's such a thing as word boundaries in PocketSphinx (or speech recognition for that matter), right? Maybe recognized length combined with some form of confidence?
Using more complex words is a 'solution' to this problem, but I'm wondering whether there are ways to make this work a little better with short / simple words as well.
Thanks again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, children speech recognition requires a specialized acoustic model anyway, it will not accurately work out of box. And it will never reliably work with short words like "ap" unless you detect silence around the word with algorithm modification. Detected word length is available in command line with -time yes and with ps_seg API.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've been testing PocketSphinx (pocketsphinx_continuous on Linux) to see if it could be used (in keyword spotting mode) to detect a small subset of simple words. Right now I'm trying to detect some Dutch words, which works quite well. However, one of them is 'aap' (monkey) and it often doesn't get recognized. The .dict I use has 'aap a p' and there's also an entry for it in the .list file I use.
However, no matter whether the entry in the .list file is 'aap /1/' or 'aap /1e-100/' or 'aap /1000/' the detection doesn't get any better or worse. I understand detecting a keyword should be easier with more complex words (more to go on), but still: howcome threshold doesn't seem to have any effect? Thanks for any insight into this matter!
You could provide an audio file you are using for tests and the command line you are using to get help on this issue.
Thanks for getting back to me (and quickly, wow). The command I'm using is this:
pocketsphinx_continuous -hmm cmusphinx-nl-5.2/model_parameters/voxforge_nl_sphinx.cd_cont_2000 -lm cmusphinx-nl-5.2/etc/voxforge_nl_sphinx.lm.bin -inmic yes -kws keyphrase.list -dict keyphrase.dict
I'm simply trying (through my laptop's mic) to get PocketSphinx to reliably detect a few words, pronounced by various dutch native speakers (my girlfriend, my daughter). I don't actually need pronunciation to be perfect, that's why I'm trying to really lower the threshold to just have it detect the words in almost all cases. But configuring the threshold doesn't seem to have any effect at all....
You need to make experiments with the audio file first, not with a microphone. That enables us to reproduce your problems and help you.
Ok, so I tried with a different word: 'slang' (Dutch word for snake). I pronounced it three times in the attached sound file ('slang', 'sang' and 'lang'). Right now running the command:
pocketsphinx_continuous -hmm cmusphinx-nl-5.2/model_parameters/voxforge_nl_sphinx.cd_cont_2000 -lm cmusphinx-nl-5.2/etc/voxforge_nl_sphinx.lm.bin -infile slang.wav -kws keyphrase.list -dict keyphrase.dict
..seems to detect the first two (?) pronunciations, and I'm not able to get different results when I significantly modify the treshold for 'slang' in keyphrase.list (e.g. "slang /1e-100/" vs "slang /1/"). Even removing the the line completely from the .list file doesn't change my results. Which leads me to believe somehow the .list file might not even be interpreted...?! (it does read the correct file though, I checked)
Any help much appreciated!
PS. My .dict file contains 'slang s l aa nn'.
Last edit: Jelmer Feenstra 2017-08-27
With a dictionary test.dict
and kws file test.kws
and command line (note that -lm conflicts with kws)
result is
Thanks for your response, not using the -lm option fixed things for me with regard to thresholds!
My intended usecase is perhaps somewhat strange: I want to be very forgiving in recognizing just a handul of words, so a kid could mispronounce the (Dutch) word 'slang' and still have the word correctly recognized. I can somewhat allow for this by adding various extra pronunciations for the word in my .dict file (combined with low thresholds) - but when I'm trying to recognize words that are rather short it often results in multiple of them being recognized. For example, the dutch word for monkey is 'aap', which is often recognized as part of other words that contain the 'a' vowel (in that case I could just pick the longest word... I guess).
Are there better ways of getting to know which word was presumably intended? I don't think there's such a thing as word boundaries in PocketSphinx (or speech recognition for that matter), right? Maybe recognized length combined with some form of confidence?
Using more complex words is a 'solution' to this problem, but I'm wondering whether there are ways to make this work a little better with short / simple words as well.
Thanks again.
Well, children speech recognition requires a specialized acoustic model anyway, it will not accurately work out of box. And it will never reliably work with short words like "ap" unless you detect silence around the word with algorithm modification. Detected word length is available in command line with
-time yes
and withps_seg
API.