I am nearing the completion of a training simulator that uses PocketSphinx for voice recognition. I need to tune PocketSphinx so it runs a little quicker so I can maintain smooth real-time like operation (~3-5 second delay is acceptable). I read the suggestions at http://cmusphinx.sourceforge.net/wiki/pocketsphinxhandhelds and have found that I can indeed make PocketSphinx run fast enough for my application, however I would prefer to have more information on what each parameter does so I'm not just "hitting buttons randomly and hoping for the best result."
My default configuration (that runs a little slow) is:
samprate 8000
fwdflat yes
bestpath yes
hmm hub4wsj_sc_8k
lm and -dict were generated using the online quick tool and a tiny corpus (58 sentences)
I used Very Sleepy to do some profiling and found my top 3 most expensive functions were:
(Exclusive %) (Inclusive %)
ps_lattice_bestpath // in pocketsphinx, 40.36%, 52.39%
cont_ad_read // in sphinxbase, 10.79%, 11.44%
ps_get_fe // in pocketsphinx 0.29%, 0.29%
Everything else I have control over is less than 0.01%
Do you have any suggestions on performance improvement without sacrificing too much accuracy? Is there someplace I can read more detail about the different config params so I can make a more informed decision?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Performance tuning should start with a test database. You need a 50 transcribed utterances to measure the accuracy and speed. Do you have something like that?
You can disable bestpath search altogether probably with -fwdflat no -bestpath no.
however I would prefer to have more information on what each parameter does so I'm not just "hitting buttons randomly and hoping for the best result."
To understand the parameters you need to understand the algorithm used which is not trivial itself. The description could take a book. I don't think it's the right way to "understand" things. Probably you could just share the database and I will help you to choose the right params.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As a quick background: This training simulator is being built for the University of Iowa's Theatre department to train Stage Managers. Each "simulation" is composed of a video of a stage performance and an (optionally muted) audio track that demonstrates EXACTLY (and ONLY contains) what the student is expected to say and when. Would a collection of audio tracks like this be useful to build the test database you mentioned?
How would you define an "utterance" in this context? If an utterance is roughly equivalent to an English sentence, I currently have approximately 30 phrases/sentences(about 10 minutes of recording with lots of silent spaces). However, I am currently only working on a short demo module and a very short tutorial module. The amount of recorded audio I will have in the future will be much larger.
I have attached the current corpus file. It is currently at 50 phrases, but many of the phrases are duplicates with very small variances (IE lights vs light) that are not reflected in the recorded audio. This corpus will be expanded on as I build additional simulation modules.
Would a collection of audio tracks like this be useful to build the test database you mentioned?
Yes
How would you define an "utterance" in this context? If an utterance is roughly equivalent to an English sentence, I currently have approximately 30 phrases/sentences(about 10 minutes of recording with lots of silent spaces).
It's enough to optimize speed and accuracy. Please share the audio data for this corpus and also provide the information about models you are using and exact command line parameters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've attached the most interesting of the 2 audio segments (least amount of silence). The audio actually starts at 30 seconds and is most "intense" at 4:42. This version is encoded in MP3 to keep file size down, but I could probably get a wav version tomorrow if needed.
The exact params I'm using on cmd_ln_init are:
speechConfig = cmd_ln_init(NULL, ps_args(), TRUE,
"-hmm", pathToHmm, // This is "hub4wsj_sc_8k" that was included
"-lm", pathToLm,
"-dict", pathToDict,
"-samprate", freq, // Currently "8000"
"-fwdflat", "yes",
"-bestpath", "yes",
NULL);
We are trying to build pocketsphinx version 4 for an embedded platform to enable the speech assistance feature for our solution. Our objective is to execute command by command speech recognition for menu navigation.
For improving the recognition accuracy we did the following as described in the wiki:
Modified the language model,
Dictionary
Adapting Acoustics model.
Our voice commands to recognize are standard menu navigation commands which are often a single word commands (e.g. "Home", "Next", "Previous" etc.).
But sometimes we observed that the accuracy is poor and system is taking a random value from the dictionary which is not matching with the voice command.
Kindly suggest a solution whether we are missing something.
Please help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The corpus file we created contains only one line. The wave file corresponding to that line is arctic_0001.wav. No additional audio files are created since our corpus file has single line.
An arctic_0001.mfc file is created after executing sphinx_fe.
Please see the attached arctic_0001.mfc.
We have created a new test set in which its corpus file contains 28 lines and 28 corresponding wave files and .mfc files are created. Please see the attached TestData.zip which contains corpus files,wave files,mfc files and mllr_matrix generated.
Hi Nickolay,
I am not able to calculate WER since hypothesis file is not generated. Could you suggest a way to generate hypothesis file using PocketSphinxAndroid.
Also when we tested speech recognition on two different android Devices(samsung and acer) with same lm and hmm performance was different.
Kindly suggest a solution for above issue also.
Last edit: Renu K Pillai 2013-07-31
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you.
WER before and after adaptation is same.
It is 14.925373%.
The only change in the input argument i have given is with the hmm model while executing pocketsphinx_batch.
Do i need to change any other input arguments to generate hyp file for calculating WER value before and after adaptation?
Case with Android Devices:
We have installed PocketSphinxAndroidDemo.apk on two android devices.But the two devices generate output to the same speech input in different ways for same hmm and lm.
One device(Acer) is showing better performance but on the otherhand performance is poor for samsung tablet. Most of the time samsung tablet logcat displays Recognition failure output or displays any random value from dictionary not matching with voice command.
Do we need to calibrate or change any speech parameter depending on the device using?
Also, sometimes the speech command "video" is recognized as "media" or "map" as "back". How can we avoid such issue?
Kindly suggest solutions for above issues.
Last edit: Renu K Pillai 2013-08-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
WER before and after adaptation is same.
It is 14.925373%.
The only change in the input argument i have given is with the hmm model while executing pocketsphinx_batch.
It looks something went wrong with the adaptation. Provide adaptation logs. Share the latest adaptation folder too together with the language model used for testing.
Do i need to change any other input arguments to generate hyp file for calculating WER value before and after adaptation?
No
We have installed PocketSphinxAndroidDemo.apk on two android devices.But the two devices generate output to the same speech input in different ways for same hmm and lm. One device(Acer) is showing better performance but on the otherhand performance is poor for samsung tablet. Most of the time samsung tablet logcat displays Recognition failure output or displays any random value from dictionary not matching with voice command. Do we need to calibrate or change any speech parameter depending on the device using? Also, sometimes the speech command "video" is recognized as "media" or "map" as "back". How can we avoid such issue?
To ask questions on unrelated issues start separate threads. One question per thread.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Here i am attaching screenshots of logs after executing mllr_solve and map_adapt.
Also please find the attached zip files of hmm(hub4wsj_sc_8kadapt.zip) and lm.Kindly note that the lm folder contains the lm file,DMP files and dic files in the en_US folder.
Here i am attaching screenshots of logs after executing mllr_solve and map_adapt.
Screenshot are useless. When you ask about technical problem learn to provide text logs, not screenshots. To store the output of the tool to the log you can use redirection
command >& command.log
Or tee command to show the log on the output and store it to the file:
command 2>&1 | tee command.log
You need to provide logs of all the tools, not 2 random
Also please find the attached zip files of hmm(hub4wsj_sc_8kadapt.zip) and lm.Kindly note that the lm folder contains the lm file,DMP files and dic files in the en_US folder.
You need to provide an archive of the whole training folder, not the parts of it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
We are trying to develop an android app which is continuously listening to speech using pocketsphinx-android(Speech recognition is not between the interval of button press and release as in pocketsphinxandroiddemo). Every command switches to different activity.
But commands are not recognized in an expected way. For example the command "Camera" when correctly recognized as "CAMERA" it switches to camera Activity but it is recognised as "A AN CAMERA" or "A CAMERA" etc. results in a no-match situation.
Please help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am nearing the completion of a training simulator that uses PocketSphinx for voice recognition. I need to tune PocketSphinx so it runs a little quicker so I can maintain smooth real-time like operation (~3-5 second delay is acceptable). I read the suggestions at http://cmusphinx.sourceforge.net/wiki/pocketsphinxhandhelds and have found that I can indeed make PocketSphinx run fast enough for my application, however I would prefer to have more information on what each parameter does so I'm not just "hitting buttons randomly and hoping for the best result."
My default configuration (that runs a little slow) is:
I used Very Sleepy to do some profiling and found my top 3 most expensive functions were:
(Exclusive %) (Inclusive %)
ps_lattice_bestpath // in pocketsphinx, 40.36%, 52.39%
cont_ad_read // in sphinxbase, 10.79%, 11.44%
ps_get_fe // in pocketsphinx 0.29%, 0.29%
Everything else I have control over is less than 0.01%
Do you have any suggestions on performance improvement without sacrificing too much accuracy? Is there someplace I can read more detail about the different config params so I can make a more informed decision?
Thanks!
Hi
Performance tuning should start with a test database. You need a 50 transcribed utterances to measure the accuracy and speed. Do you have something like that?
You can disable bestpath search altogether probably with -fwdflat no -bestpath no.
To understand the parameters you need to understand the algorithm used which is not trivial itself. The description could take a book. I don't think it's the right way to "understand" things. Probably you could just share the database and I will help you to choose the right params.
Thank you, you are very kind!
As a quick background: This training simulator is being built for the University of Iowa's Theatre department to train Stage Managers. Each "simulation" is composed of a video of a stage performance and an (optionally muted) audio track that demonstrates EXACTLY (and ONLY contains) what the student is expected to say and when. Would a collection of audio tracks like this be useful to build the test database you mentioned?
How would you define an "utterance" in this context? If an utterance is roughly equivalent to an English sentence, I currently have approximately 30 phrases/sentences(about 10 minutes of recording with lots of silent spaces). However, I am currently only working on a short demo module and a very short tutorial module. The amount of recorded audio I will have in the future will be much larger.
I have attached the current corpus file. It is currently at 50 phrases, but many of the phrases are duplicates with very small variances (IE lights vs light) that are not reflected in the recorded audio. This corpus will be expanded on as I build additional simulation modules.
Thanks again!
Yes
It's enough to optimize speed and accuracy. Please share the audio data for this corpus and also provide the information about models you are using and exact command line parameters.
I've attached the most interesting of the 2 audio segments (least amount of silence). The audio actually starts at 30 seconds and is most "intense" at 4:42. This version is encoded in MP3 to keep file size down, but I could probably get a wav version tomorrow if needed.
I just use the lm and dict file generated from this corpus at http://www.speech.cs.cmu.edu/tools/lmtool-new.html (I would attach them but don't know how to attach more than 1 file per SF forum post).
The exact params I'm using on cmd_ln_init are:
speechConfig = cmd_ln_init(NULL, ps_args(), TRUE,
"-hmm", pathToHmm, // This is "hub4wsj_sc_8k" that was included
"-lm", pathToLm,
"-dict", pathToDict,
"-samprate", freq, // Currently "8000"
"-fwdflat", "yes",
"-bestpath", "yes",
NULL);
Last edit: Scorx Ion 2013-03-14
You need to split database on utterances and put files in a certain layout. For more details see testing section in the adaptation tutorial:
http://cmusphinx.sourceforge.net/wiki/tutorialadapt#testing_the_adaptation
Last edit: Nickolay V. Shmyrev 2013-03-14
We are trying to build pocketsphinx version 4 for an embedded platform to enable the speech assistance feature for our solution. Our objective is to execute command by command speech recognition for menu navigation.
For improving the recognition accuracy we did the following as described in the wiki:
Our voice commands to recognize are standard menu navigation commands which are often a single word commands (e.g. "Home", "Next", "Previous" etc.).
But sometimes we observed that the accuracy is poor and system is taking a random value from the dictionary which is not matching with the voice command.
Kindly suggest a solution whether we are missing something.
Please help.
Please provide a test set you are using to estimate the accuracy
Please see the attached test sets.
Last edit: Renu K Pillai 2013-07-26
You need to share the audio
Please check the audio file attached.
Last edit: Renu K Pillai 2013-07-29
You need to share whole audio set used for adaptation, not just 1 file.
The corpus file we created contains only one line. The wave file corresponding to that line is arctic_0001.wav. No additional audio files are created since our corpus file has single line.
An arctic_0001.mfc file is created after executing sphinx_fe.
Please see the attached arctic_0001.mfc.
We have created a new test set in which its corpus file contains 28 lines and 28 corresponding wave files and .mfc files are created. Please see the attached TestData.zip which contains corpus files,wave files,mfc files and mllr_matrix generated.
Last edit: Renu K Pillai 2013-07-30
What is the WER before and after the adaptation?
Hi Nickolay,
I am not able to calculate WER since hypothesis file is not generated. Could you suggest a way to generate hypothesis file using PocketSphinxAndroid.
Also when we tested speech recognition on two different android Devices(samsung and acer) with same lm and hmm performance was different.
Kindly suggest a solution for above issue also.
Last edit: Renu K Pillai 2013-07-31
WER is unrelated to android. WER estimation procedure is described in tutorial
http://cmusphinx.sourceforge.net/wiki/tutorialadapt#testing_the_adaptation
Thank you.
WER before and after adaptation is same.
It is 14.925373%.
The only change in the input argument i have given is with the hmm model while executing pocketsphinx_batch.
Do i need to change any other input arguments to generate hyp file for calculating WER value before and after adaptation?
Case with Android Devices:
We have installed PocketSphinxAndroidDemo.apk on two android devices.But the two devices generate output to the same speech input in different ways for same hmm and lm.
One device(Acer) is showing better performance but on the otherhand performance is poor for samsung tablet. Most of the time samsung tablet logcat displays Recognition failure output or displays any random value from dictionary not matching with voice command.
Do we need to calibrate or change any speech parameter depending on the device using?
Also, sometimes the speech command "video" is recognized as "media" or "map" as "back". How can we avoid such issue?
Kindly suggest solutions for above issues.
Last edit: Renu K Pillai 2013-08-01
It looks something went wrong with the adaptation. Provide adaptation logs. Share the latest adaptation folder too together with the language model used for testing.
No
To ask questions on unrelated issues start separate threads. One question per thread.
Here i am attaching screenshots of logs after executing mllr_solve and map_adapt.
Also please find the attached zip files of hmm(hub4wsj_sc_8kadapt.zip) and lm.Kindly note that the lm folder contains the lm file,DMP files and dic files in the en_US folder.
Last edit: Renu K Pillai 2013-08-02
Screenshot are useless. When you ask about technical problem learn to provide text logs, not screenshots. To store the output of the tool to the log you can use redirection
command >& command.log
Or tee command to show the log on the output and store it to the file:
command 2>&1 | tee command.log
You need to provide logs of all the tools, not 2 random
You need to provide an archive of the whole training folder, not the parts of it.
My adaptation archive file can be downloaded from below link.Please check.
https://docs.google.com/file/d/0B4G2VoYYkIJMLVhVSW1sQ2c1RUk/edit?usp=sharing
Last edit: Renu K Pillai 2013-08-05
We are trying to develop an android app which is continuously listening to speech using pocketsphinx-android(Speech recognition is not between the interval of button press and release as in pocketsphinxandroiddemo). Every command switches to different activity.
But commands are not recognized in an expected way. For example the command "Camera" when correctly recognized as "CAMERA" it switches to camera Activity but it is recognised as "A AN CAMERA" or "A CAMERA" etc. results in a no-match situation.
Please help.