I'm using pocketsphinx_continous from svn (r11331) on Ubuntu 11.10 as follows:
% pocketsphinx_continous -input mic -lm mymodel.lm -dict mydict.dic
mymodel.lm and mydict.dic were generated from http://www.speech.cs.cmu.edu/tools/lmthttp://www.speech.cs.cmu.edu/tools
/lmtool-new.htmlool-new.html using a corpus file that
has about 30 words. The intended use is as a command-and-control interface for
a Qt GUI application. I've already adapted continuous.c into my Qt code and am
able to use it successfully there. But the issue I will describe occurs in the
pocketsphinx_continous program as well, so I'll keep that the focus of the
conversation for clarity.
What I've noticed is that sometimes the response time is slow -- about 1.5
seconds. Other times it is quite fast, returning a (correct) hypothesis almost
instantly. There doesn't seem to be any inbetween: it's either fast or
(relatively) slow.
Furthermore, I've noticed that if I speak a word into the mic immediately
after the previous word is recognized and printed to the screen, the new word
is almost always recognized quickly. Thus I can successfully have many words
recognized quickly in succession if I immediately say the next word after
pocketsphinx_continuous says "READY....", and before it says "Listening...."
If I wait a moment (until "Listening..." appears or later) to speak, then most
of the time -- but not always -- the recognition of the word takes ~ 1.5
seconds.
I discovered that the first call to ps_process_raw(...) in recognize_from_microphone() (right after ps_start_utt(...) is the one
that is taking up all the time when the recognition goes slowly. I've tested
the speed perhaps hundreds of times, and ps_process_raw(...) either takes 5ms,
1290ms +/- 5ms, or 1510ms +/- 5ms on my system. Those three values are
encountered consistently unless the utterance is 2-3 words, in which case the
speed is often a multple of one of those, such as 2580ms or 3020ms. It's very
consistent, which would indicate a deterministic issue rather than a random
one.
When recognizing the word "LINK" on two subsequent attempts, here are the INFO
messages:
I noticed some differences in the first (slow) and second (fast) case,
especially the fwdtree 1.35 wall vs 0.02 wall, which I guess indicates the
area where things went slowly the first time. I also saw the bit about "
not found in last frame" on the slower attempt, but I haven't yet figured out
the significance of that or whether it is related to the slower search.
The accuracy of the recognition is outstanding, and it was easy to integrate
into my Qt app -- it took maybe 1 hour. If I can get it to consistently
recognize the commands quickly rather than a frustrating 1.5s delay, this will
significantly enhance the user experience for my GUI. I would appreciate any
hints on what might be causing the issues I'm seeing, or ideas on what I could
differently in order to achieve faster results for a command & control
application.
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I should also mention that I'm using a smaller value than
DEFAULT_SAMPLES_PER_SEC to determine the end of the utterance. I believe the
default is 16000, which results in a required 1s of silence before the
utterance is ended. I'm using 1000, which allows the utterance to end after
1/16th of a second of silence (I think).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A followup, just to save anyone from wasting any time on this. Somehow I
missed the following message that appeared just before "READY...":
Warning: Could not find Mic element
I must have seen it before but forgotten about it/ignored it since the mic did
appear to work properly. At any rate, when I was playing around with
pocketsphinx_continuous yesterday, I made an accidental discovery: the
recognition in my Qt app (based on continuous.c) was very fast when I already
had an instance of the python/gstreamer demoapp.py (using the gstreamer
pocketsphinx plugin) running.
I haven't quite pieced together why two instances of running at the same time
would speed up the ps_process_raw function, but I suspected it had something
to do with the state of the audio buffer for my mic. So I looked more closely
at the error statements and that's when I noticed the "Could not find Mic"
warning.
A little more googling led me to try "-adcdev plughw:2,0" instead of "-input
mic", and that was the key. Using -adcdev results in fast recognition with no
more long delays. Fantastic!
In case anyone is wondering, I did:
% cat /proc/asound/cards
to determine that my USB device was device #2. That's where the 2 comes from
in plughw:2,0.
I hope this information is useful to someone who might run into the same
issues I did.
A big thanks to the authors of the software. It works flawlessly for my
application. I might investigate the possibility of using partial results to
anticipate and recognize a valid spoken command even before the user is done
saying it, which would speed things up even more.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A big thanks to the authors of the software. It works flawlessly for my
application. I might investigate the possibility of using partial results to
anticipate and recognize a valid spoken command even before the user is done
saying it, which would speed things up even more.
You are welcome. It's nice that your problem got resolved so fast, wish you
even more adventures in the future.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Since OSX is one of the supported platforms for my Qt application, I'm
motivated to figure out the feasibility/effort required to support live mic
input on OSX. Do you have any suggestions on who I could talk to, where I
could look in the code, and any other resources I might consider in attempting
to add this feature? I'm happy to work on this myself, but I could use some
advice on how to get started.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using pocketsphinx_continous from svn (r11331) on Ubuntu 11.10 as follows:
% pocketsphinx_continous -input mic -lm mymodel.lm -dict mydict.dic
mymodel.lm and mydict.dic were generated from
http://www.speech.cs.cmu.edu/tools/lmthttp://www.speech.cs.cmu.edu/tools
/lmtool-new.htmlool-new.html using a corpus file that
has about 30 words. The intended use is as a command-and-control interface for
a Qt GUI application. I've already adapted continuous.c into my Qt code and am
able to use it successfully there. But the issue I will describe occurs in the
pocketsphinx_continous program as well, so I'll keep that the focus of the
conversation for clarity.
What I've noticed is that sometimes the response time is slow -- about 1.5
seconds. Other times it is quite fast, returning a (correct) hypothesis almost
instantly. There doesn't seem to be any inbetween: it's either fast or
(relatively) slow.
Furthermore, I've noticed that if I speak a word into the mic immediately
after the previous word is recognized and printed to the screen, the new word
is almost always recognized quickly. Thus I can successfully have many words
recognized quickly in succession if I immediately say the next word after
pocketsphinx_continuous says "READY....", and before it says "Listening...."
If I wait a moment (until "Listening..." appears or later) to speak, then most
of the time -- but not always -- the recognition of the word takes ~ 1.5
seconds.
I discovered that the first call to ps_process_raw(...) in
recognize_from_microphone() (right after ps_start_utt(...) is the one
that is taking up all the time when the recognition goes slowly. I've tested
the speed perhaps hundreds of times, and ps_process_raw(...) either takes 5ms,
1290ms +/- 5ms, or 1510ms +/- 5ms on my system. Those three values are
encountered consistently unless the utterance is 2-3 words, in which case the
speed is often a multple of one of those, such as 2580ms or 3020ms. It's very
consistent, which would indicate a deterministic issue rather than a random
one.
When recognizing the word "LINK" on two subsequent attempts, here are the INFO
messages:
I noticed some differences in the first (slow) and second (fast) case,
especially the fwdtree 1.35 wall vs 0.02 wall, which I guess indicates the
area where things went slowly the first time. I also saw the bit about "
not found in last frame" on the slower attempt, but I haven't yet figured out
the significance of that or whether it is related to the slower search.
The accuracy of the recognition is outstanding, and it was easy to integrate
into my Qt app -- it took maybe 1 hour. If I can get it to consistently
recognize the commands quickly rather than a frustrating 1.5s delay, this will
significantly enhance the user experience for my GUI. I would appreciate any
hints on what might be causing the issues I'm seeing, or ideas on what I could
differently in order to achieve faster results for a command & control
application.
Thanks!
I should also mention that I'm using a smaller value than
DEFAULT_SAMPLES_PER_SEC to determine the end of the utterance. I believe the
default is 16000, which results in a required 1s of silence before the
utterance is ended. I'm using 1000, which allows the utterance to end after
1/16th of a second of silence (I think).
A followup, just to save anyone from wasting any time on this. Somehow I
missed the following message that appeared just before "READY...":
I must have seen it before but forgotten about it/ignored it since the mic did
appear to work properly. At any rate, when I was playing around with
pocketsphinx_continuous yesterday, I made an accidental discovery: the
recognition in my Qt app (based on continuous.c) was very fast when I already
had an instance of the python/gstreamer demoapp.py (using the gstreamer
pocketsphinx plugin) running.
I haven't quite pieced together why two instances of running at the same time
would speed up the ps_process_raw function, but I suspected it had something
to do with the state of the audio buffer for my mic. So I looked more closely
at the error statements and that's when I noticed the "Could not find Mic"
warning.
A little more googling led me to try "-adcdev plughw:2,0" instead of "-input
mic", and that was the key. Using -adcdev results in fast recognition with no
more long delays. Fantastic!
In case anyone is wondering, I did:
to determine that my USB device was device #2. That's where the 2 comes from
in plughw:2,0.
I hope this information is useful to someone who might run into the same
issues I did.
A big thanks to the authors of the software. It works flawlessly for my
application. I might investigate the possibility of using partial results to
anticipate and recognize a valid spoken command even before the user is done
saying it, which would speed things up even more.
You are welcome. It's nice that your problem got resolved so fast, wish you
even more adventures in the future.
Your wish came true, re: more adventures in the future.
After getting sphinxbase and pocketsphinx built on my Mac, I had trouble using
pocketsphinx_continous with -input mic. Searching led me to https://sourcefor
ge.net/projects/cmusphinx/forums/forum/5471/topic/4005379, which I assume is
still the case: no live input on OSX.
Since OSX is one of the supported platforms for my Qt application, I'm
motivated to figure out the feasibility/effort required to support live mic
input on OSX. Do you have any suggestions on who I could talk to, where I
could look in the code, and any other resources I might consider in attempting
to add this feature? I'm happy to work on this myself, but I could use some
advice on how to get started.
I had a code for it, but never intergrated it really. The issue is that ad is
not really flexible to support mac coreaudio. Check it here:
http://dl.dropbox.com/u/26073448/mac-
ad.tar.gz