I have a few .wav files which are roughy 20-30 seconds each. I edited these files to be mono channel (22050Hz), 16 bit PCM and normalized them using audacity. I'm using the latest sphinxbase and pocketsphinx libraries. I have an arbitrary list of keywords as follows:
university /1e-20/
awful /1e-10/
athletes /1e-20/
terrible /1e-25/
The problem I'm having is that I get a lot of false positives even on words with 3+ syllables (for example, "terrible" from the list above isn't in any of the audio files but is detected) and also for shorter words like "awful".
Do I need to train or adapt the default us-en acoustic model for this to be more accurate when using keyword spotting? I tried adjusting the thresholds to limit false positives but with no luck. Is there any way to improve accuracy for shorter words ("awful") other than adjusting the thresholds?
If it helps, below is my code for doing this. I'm using a C# wrapper and using PInvoke to access the pocketsphinx library. I'm reading in a .wav file as a byte array and processing it 1024 bytes at a time:
int index = 0;
byte[] wavBytes = File.ReadAllBytes("../speech.wav");
Config c = Decoder.default_config();
c.set_string("-hmm", "../../../pocketsphinx/model/en-us/en-us");
c.set_string("-dict", "../../../pocketsphinx/model/en-us/cmudict-en-us.dict");
c.set_float("-samprate", 22050);
c.set_int("-nfft", 1024);
Decoder d = new Decoder(c);
d.set_kws("keywords", "../../../pocketsphinx/model/en-us/en-us/keywords.list");
d.set_search("keywords");
d.start_utt();
while (index < wavBytes.Length)
{
byte[] subset = testimonial.Skip(index).Take(1024).ToArray();
d.process_raw(subset, subset.Count(), false, false);
if(d.hyp() != null)
{
var h = d.hyp().hypstr;
d.end_utt();
d.start_utt();
}
index += 1024;
}
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Everyone,
I have a few .wav files which are roughy 20-30 seconds each. I edited these files to be mono channel (22050Hz), 16 bit PCM and normalized them using audacity. I'm using the latest sphinxbase and pocketsphinx libraries. I have an arbitrary list of keywords as follows:
The problem I'm having is that I get a lot of false positives even on words with 3+ syllables (for example, "terrible" from the list above isn't in any of the audio files but is detected) and also for shorter words like "awful".
Do I need to train or adapt the default us-en acoustic model for this to be more accurate when using keyword spotting? I tried adjusting the thresholds to limit false positives but with no luck. Is there any way to improve accuracy for shorter words ("awful") other than adjusting the thresholds?
If it helps, below is my code for doing this. I'm using a C# wrapper and using PInvoke to access the pocketsphinx library. I'm reading in a .wav file as a byte array and processing it 1024 bytes at a time:
You can share the files and keyword list to get help on this issue.