I'm a noob with PocketSphinx. But, I've looked over all the docs and think what I'm trying to do is correct.
I have a file that contains audio. The only spoken word in the audio is: "remarks".
In looking at the pocketsphinx dictionary the word is contained in there. So I would assume that it should be understood correctly. But, apparently not. I'm testing the simple application located on the CMU PocketSphinx websiste site: https://cmusphinx.github.io/wiki/tutorialpocketsphinx/
My file is attached. It is called foo.mp3 I convert it to 16bit, single chanel, 16K sample rate using the following command.
I'm a noob with PocketSphinx. But, I've looked over all the docs and think what I'm trying to do is correct.
I have a file that contains audio. The only spoken word in the audio is: "remarks".
In looking at the pocketsphinx dictionary the word is contained in there. So I would assume that it should be understood correctly. But, apparently not. I'm testing the simple application located on the CMU PocketSphinx websiste site: https://cmusphinx.github.io/wiki/tutorialpocketsphinx/
My file is attached. It is called
foo.mp3
I convert it to 16bit, single chanel, 16K sample rate using the following command.ffmpeg -i file.mp3 -f s16le -ar 16K -ac 1 -acodec pcm_s16le file.pcm
Yet the recognized text is incorrectly..
Here is what pocketsphix says it see..
Recognized: what are our act
Which is not correct.. I've attached the log output from pocketsphinx.
Any insight/help is most appreciated!
-brad w.
Last edit: Brad Walker 2017-06-09
Here are my files to help with my question..
https://cmusphinx.github.io/wiki/faq/#q-what-is-sample-rate-and-how-does-it-affect-accuracy
Okay.. Makes sense..
So I downloaded the 8k acoustic model from the CMU Sphinx download directory..
Create an 8k sample rate file from my mp3 using the following command..
ffmpeg -i foo.mp3 -ac 1 -ar 8k foo.wav
Then used pocketsphinux_continuous as follows
pocketsphinx_continuous -samprate 8000 -hmm en-us-8khz -infile foo.wav
I won't bore you with the logs.. But, I did see the following on the output..
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 0 words
And there was no recognized words..
Any advice?
-brad w.
BTW. Thanks very much for the prompt response!!
Your audio bandwidth is less than 8khz, you have to train an acoustic model to recognize it.
This is not a good idea for a technical forum.