First of all I would like to thank you for providing this free speech recognition engine, everything I have heard from your engine are positive comments.
I must precise that I am new in programming (c/c++).
I plan to use the kinect to recognize words with CMU-SPhinx.
I have already seen the tutorial and tried to recognize from file, then I have tried to recognize from a microphone using the continuous code but I have troubles :
- it is a little be hard to understand the code.
- I have seen it will be different to use the kinet with CMU-Sphinx and I do not know how to handle that.
Can you provide an example of how to use the recognition with a microphone (c code) ?
If it is not to much asking, can you tell me how to use the kinect with sphinx using only c code?
I would not like to use an external tool to make this possible.
I wish to thanks you in advance for your time and consideration.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks you Nickolay V for your answer.
I tried to use the code of recognize_from_microphone and the script is stuck to "READY..."
I have some question about the way recognize_from_microphone works :
what
ad_open_dev(cmd_ln_str_r(config, "-adcdev"), (int) cmd_ln_float32_r(config,"-samprate")))
do ?
Is it NULL if no mic are detected? If this is not NULL then does it mean the kinect mic has been successfully detected? (sorry if the answer is obvious but I have to be sure of this information)
In the for(;;) loop at the line : ad_read(ad, adbuf, 2048), is it there I have should pass recorded audio data or is it more complicated ?
Thanks you again for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ad_open_dev does not support Kinect, you have to use Kinect SDK to capture audio, so ad_open_dev must be replaced with SDK-specific audio recorder initialization
ad_read reads audio from the device, you have to replace it with your Kinect-specific code for reading the audio
audio data is processed with ps_process_raw function, you pass the audio you recorded with Kinect into this function.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I use the code provided with AudioBasic-D2D where audio data is used to calculate his energy :
// Calculate energy from audio
for (UINT i = 0; i < cbProduced; i += 2)
{
// compute the sum of squares of audio samples that will get accumulated
// into a single energy value.
short audioSample = static_cast<short>(pProduced[i] | (pProduced[i+1] << 8));</short>
...
++m_iAccumulatedSampleCount;
...
}
So each time we enter in the loop, I call
//buffer = kinect audio data, sBuffer = size of the buffer
processAudio(buffer,sBuffer);
where buffer is created from the audioSample.
Then I use them :
if (sBuffer>0) //if buffer not empty
{
ps_process_raw(ps, buffer, sBuffer, FALSE, FALSE);
}
But the program is stuck at "READY...".
I do not know what is wrong. I guess it never detects when the user starts to speak or stops to speak.
Thanks you for your support, I really appreciate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Probably there is something wrong with data you're passing for recognition. You can dump audio that you use in ps_process_raw and check if it's ok. Add "-rawlogdir path/to/some/dir" to pocketsphinx initialization, it will dump audio that is recognized into audio files in specified dir. Check them too. If it will be still unclear what is wrong - share files
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I manage to solve the problem, as you said, the problem was in the data I processed, I made a mistake with the buffer when taking in argument to my function processAudio.
Now it does go in listen then recognized then go to READY... again and no more listen.
So it recognizes once each time I run the script.
Do you have an idea what can be the problem ?
Thanks you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your answer, I changed my code like you suggested.
Now it is working continuously.
I have an other trouble now, it seems to work badly, it does not recognize words am saying, and when am not talking anymore, it keeps listening and will recognize something totally wrong.
Do you have any idea what can be the problem?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have made some tries and it seems to work good with single short words like 'yes' 'no' 'true' 'wrong'. It works bad when the audio reception is high, and works better when I reduce the microphone gain.
I should add that my English pronunciation is not very good and am sorry to have hidden that.
Speech reco is resource consuming task. I'm not sure about kinect hardware but if it isn't powerful enough you'll get gaps in your audio and that will lead to drastic accuracy drop. You should remove sleep statement and check if long phrases are passed for recognition without gaps
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I could not answer earlier because I was on week-end.
I use this recognition for a project, the creation of an embodied conversational agent, we chose the kinect because it allows us to track face/body and to register audio for sound localization and speech recognition.
The recognition must work for any new user who speak to the ECA, using adaptation is not the best option in my opinion.
With the engine windows provides, the recognition will process the noise and we have false recognitions, especially on short words like yes and no, words we use in a scenario. Making the input audio signal lower can help to improve recognition but then sound source localization will not work unless the user speaks very loud, we want the user to speak naturally.
With pocketsphinx false recognitions should not be a problem. With max input audio volume we have sometimes false recognitions but less than before and not on the words we need.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I changed the model language to french and the recognition does not seem to work well.
Is that because the model is not good enough?
Concerning the grammar I have a couple of questions.
Creating a grammar is to restrict the number of words we want the engine to recognize?
Once it is created, I should load it in the config like this :
"-dict", modeldir "path/to/grammar.gram ?
Thank you for your help again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I change the "-lm", "path/to/lm.dmp", with "-jsfg", "path/to/grammar.gram", and tested the recognition in the same conditions than before (noise).
As expected it is quite good and we have less false recognitions compare with the windows engine and I am pretty happy about that, I can only imagine how good it can be with better language model.
I have two questions :
Is it possible to hide all the INFO appearing ?
I would like to see informations concerning the word(s) recognized only.
Is it possible to get the tag information instead of the word recognized?
Thank you again for your time and your diligence.
Last edit: michael-student 2015-06-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually there is no false recognition suppression in grammar search. If it's possible for you to limit recognition to several words (yes/no for example) I suggest to try keyword spotting mode. Remember you can switch between searches.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried to add the line -logfn /dev/null in the instruction code :
config = cmd_ln_init(etc...)
and it does not work, I assume I misunderstand something.
Concerning the keyword spotting mode; I tried it and it is not better than the other solution, probably because the language model is not good enough.
To recap how I used the keyword spotting
create a keyword.txt where inside I have :
yes
no
hello
(all of these in french)
config with "-lm" argument
ps_set_kws(ps,"keyword","path/to/keyword.txt")
ps_set_search(ps,"keyword")
Last edit: michael-student 2015-06-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried to add the line -logfn /dev/null in the instruction code
can't say if you're doing it right without the code. btw, since you're using api you can also try err_set_logfile(char *filename)
I tried it and it is not better than the other solution
it requires per keyword threshold tuning since your keywords are very short (1 syllable). Check forum for this, there were a lot of topics on it recently.
config with "-lm" argument
Not sure if you actually used kws. Probably ngram search worked insted. To be sure, try config with "-kws path/to/keyword.txt" instead
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First of all I would like to thank you for providing this free speech recognition engine, everything I have heard from your engine are positive comments.
I must precise that I am new in programming (c/c++).
I plan to use the kinect to recognize words with CMU-SPhinx.
I have already seen the tutorial and tried to recognize from file, then I have tried to recognize from a microphone using the continuous code but I have troubles :
- it is a little be hard to understand the code.
- I have seen it will be different to use the kinet with CMU-Sphinx and I do not know how to handle that.
Can you provide an example of how to use the recognition with a microphone (c code) ?
If it is not to much asking, can you tell me how to use the kinect with sphinx using only c code?
I would not like to use an external tool to make this possible.
I wish to thanks you in advance for your time and consideration.
You can read the recognize_from_microphone function here:
https://github.com/cmusphinx/pocketsphinx/blob/master/src/programs/continuous.c#L228
Just like with microphone you record audio data from Kinect, you can find example here:
https://msdn.microsoft.com/en-us/library/jj883681.aspx
And then you pass recorded audio data into the decoder in a loop.
Thanks you Nickolay V for your answer.
I tried to use the code of recognize_from_microphone and the script is stuck to "READY..."
I have some question about the way recognize_from_microphone works :
what
ad_open_dev(cmd_ln_str_r(config, "-adcdev"), (int) cmd_ln_float32_r(config,"-samprate")))
do ?
Is it NULL if no mic are detected? If this is not NULL then does it mean the kinect mic has been successfully detected? (sorry if the answer is obvious but I have to be sure of this information)
In the for(;;) loop at the line : ad_read(ad, adbuf, 2048), is it there I have should pass recorded audio data or is it more complicated ?
Thanks you again for your help.
ad_open_dev does not support Kinect, you have to use Kinect SDK to capture audio, so ad_open_dev must be replaced with SDK-specific audio recorder initialization
ad_read reads audio from the device, you have to replace it with your Kinect-specific code for reading the audio
audio data is processed with ps_process_raw function, you pass the audio you recorded with Kinect into this function.
Ok,
I use the code provided with AudioBasic-D2D where audio data is used to calculate his energy :
// Calculate energy from audio
for (UINT i = 0; i < cbProduced; i += 2)
{
// compute the sum of squares of audio samples that will get accumulated
// into a single energy value.
short audioSample = static_cast<short>(pProduced[i] | (pProduced[i+1] << 8));</short>
...
++m_iAccumulatedSampleCount;
...
}
So each time we enter in the loop, I call
//buffer = kinect audio data, sBuffer = size of the buffer
processAudio(buffer,sBuffer);
where buffer is created from the audioSample.
Then I use them :
if (sBuffer>0) //if buffer not empty
{
ps_process_raw(ps, buffer, sBuffer, FALSE, FALSE);
}
But the program is stuck at "READY...".
I do not know what is wrong. I guess it never detects when the user starts to speak or stops to speak.
Thanks you for your support, I really appreciate.
Probably there is something wrong with data you're passing for recognition. You can dump audio that you use in ps_process_raw and check if it's ok. Add "-rawlogdir path/to/some/dir" to pocketsphinx initialization, it will dump audio that is recognized into audio files in specified dir. Check them too. If it will be still unclear what is wrong - share files
So pocketsphinx initialization is :
config = cmd_ln_init(NULL,
ps_args(),
TRUE,
"-hmm", MODELDIR "/en-us/en-us",
"-lm", MODELDIR "/en-us/en-us.lm.dmp",
"-dict", MODELDIR "/en-us/cmudict-en-us.dict",
"-rawlogdir", "C:/Users/Cen/Desktop/sphinx/pocketsphinx/recognized",
NULL);
When running the script I have this information : Writing raw audio log file " the path in the initialization/000000000.raw"
then it's stuck to READY... and no recognition happened.
I checked my kinect audio stream information :
// Format of Kinect audio stream
static const WORD AudioFormat = WAVE_FORMAT_PCM;
// Number of channels in Kinect audio stream
static const WORD AudioChannels = 1;
// Samples per second in Kinect audio stream
static const DWORD AudioSamplesPerSecond = 16000;
// Average bytes per second in Kinect audio stream
static const DWORD AudioAverageBytesPerSecond = 32000;
// Block alignment in Kinect audio stream
static const WORD AudioBlockAlign = 2;
// Bits per audio sample in Kinect audio stream
static const WORD AudioBitsPerSample = 16;
And I am thinking if wave_format_pcm can be the problem.
Thanks you.
Ok,
I manage to solve the problem, as you said, the problem was in the data I processed, I made a mistake with the buffer when taking in argument to my function processAudio.
Now it does go in listen then recognized then go to READY... again and no more listen.
So it recognizes once each time I run the script.
Do you have an idea what can be the problem ?
Thanks you.
Please share the code you wrote to make it easier to help you.
Here is my code, I have join 2 cpp file I use for the recognition.
Thanks you.
the problem is how you get audio out of SpeechRecognition::processAudio. For some reason you dynamically create inner buffer:
where you free it? You assign pointer to new address inside the function:
this has no effect outside.
I believe you can just assign values into buffer directly:
unless cbProduced / 2 <= 4096
I suggested to check audio dumped with -rawlogdir. You could have seen the problem if you opened that 0000.raw file.
Thank you for your answer, I changed my code like you suggested.
Now it is working continuously.
I have an other trouble now, it seems to work badly, it does not recognize words am saying, and when am not talking anymore, it keeps listening and will recognize something totally wrong.
Do you have any idea what can be the problem?
record several phrases with -rawlogdir and share them via dropbox or something. let me check
If I give you all of this in a .rar folder is it ok for you?
sure
I have made some tries and it seems to work good with single short words like 'yes' 'no' 'true' 'wrong'. It works bad when the audio reception is high, and works better when I reduce the microphone gain.
I should add that my English pronunciation is not very good and am sorry to have hidden that.
With this post come the attachments.
Thank you.
Well looks fine for me. I mean i was able to recognize your phrases correctly. To get help on accuracy you'll need to record some audio and run tests to get objective performance: http://cmusphinx.sourceforge.net/wiki/tutorialadapt#testing_the_adaptation. You are using generic language model. What is your usecase? I assume you want to use speech reco in games (since you're running pocketsphinx on kinnect). Maybe recognizing with grammar will be enough for you http://cmusphinx.sourceforge.net/wiki/tutoriallm#building_language_model?
One more thing, you have
Speech reco is resource consuming task. I'm not sure about kinect hardware but if it isn't powerful enough you'll get gaps in your audio and that will lead to drastic accuracy drop. You should remove sleep statement and check if long phrases are passed for recognition without gaps
I could not answer earlier because I was on week-end.
I use this recognition for a project, the creation of an embodied conversational agent, we chose the kinect because it allows us to track face/body and to register audio for sound localization and speech recognition.
The recognition must work for any new user who speak to the ECA, using adaptation is not the best option in my opinion.
With the engine windows provides, the recognition will process the noise and we have false recognitions, especially on short words like yes and no, words we use in a scenario. Making the input audio signal lower can help to improve recognition but then sound source localization will not work unless the user speaks very loud, we want the user to speak naturally.
With pocketsphinx false recognitions should not be a problem. With max input audio volume we have sometimes false recognitions but less than before and not on the words we need.
I changed the model language to french and the recognition does not seem to work well.
Is that because the model is not good enough?
Concerning the grammar I have a couple of questions.
Creating a grammar is to restrict the number of words we want the engine to recognize?
Once it is created, I should load it in the config like this :
"-dict", modeldir "path/to/grammar.gram ?
Thank you for your help again.
most likely
yes. we also expect phrases of certain structure while LM allows any word transitions
"-jsgf"
So,
I change the "-lm", "path/to/lm.dmp", with "-jsfg", "path/to/grammar.gram", and tested the recognition in the same conditions than before (noise).
As expected it is quite good and we have less false recognitions compare with the windows engine and I am pretty happy about that, I can only imagine how good it can be with better language model.
I have two questions :
Is it possible to hide all the INFO appearing ?
I would like to see informations concerning the word(s) recognized only.
Is it possible to get the tag information instead of the word recognized?
Thank you again for your time and your diligence.
Last edit: michael-student 2015-06-17
-logfn /dev/null
Actually there is no false recognition suppression in grammar search. If it's possible for you to limit recognition to several words (yes/no for example) I suggest to try keyword spotting mode. Remember you can switch between searches.
I tried to add the line -logfn /dev/null in the instruction code :
config = cmd_ln_init(etc...)
and it does not work, I assume I misunderstand something.
Concerning the keyword spotting mode; I tried it and it is not better than the other solution, probably because the language model is not good enough.
To recap how I used the keyword spotting
yes
no
hello
(all of these in french)
config with "-lm" argument
ps_set_kws(ps,"keyword","path/to/keyword.txt")
ps_set_search(ps,"keyword")
Last edit: michael-student 2015-06-17
can't say if you're doing it right without the code. btw, since you're using api you can also try err_set_logfile(char *filename)
it requires per keyword threshold tuning since your keywords are very short (1 syllable). Check forum for this, there were a lot of topics on it recently.
Not sure if you actually used kws. Probably ngram search worked insted. To be sure, try config with "-kws path/to/keyword.txt" instead
The code concerning -logfn is :
config = cmd_ln_init(NULL, ps_args(), TRUE,
"-hmm", MODELDIR "/broadcastnews",
"-kws", "path/to/keyword.txt",
"-dict", MODELDIR "/frenchWords62K.dic",
"-logfn", "/dev/null",
NULL);
I use thresholds on one syllable word, I am afraid it does not improve the recognition.
Thank you for your help.