<p>May I introduce myself very quickly. My name is Eakachai Charoenchaimonkon. I'm a blind student doing my master in Information management at AIT.</p>
<p>I'm interested in doing the research on corporation between ASR and assistive technologies for visually impaired people as similar to myself who always works with screen-reader several years. I'd tried using Sphinx-4 that's purely Java, and made it successfully run on my PC; however, the application I plan to integrate ASR feature in is purely VC++. TO avoid implementing JNI, alternatively I decided to learn how to develop a simple ASR application using Sphinx-3 which is fuly c++.</p>
<p>So far, I can make Sphinx-3 and SphinxBase built under VC++ 2005. However, I don't know how to develop a small application that can make use the build-in trained corpus in order to do a simple English speech recognition. For example, in Sphinx-4, there are a lot of demo that do 'Hello Digits' or 'Hello world', but I don't think I can find it in Sphinx-3.</p>
<p>One of my friend advised me to use WSJ trained corpus from Cambridge. I got it over the web, and again having no idea how to use this corpus. Does anyone in the forum would kindly give me the hand on describing me what or how can I use Sphinx3_decoder, or kindly show me the sample code or small tutorial that helps me learn how to develop a small ASR application myself. I dream in the first stage to develop a 'HelloWorld' application that is closely alike the one I found in Sphinx-4.</p>
<p>Your kind assistance is highly valuable and would shade me the light on my research.</p>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I call the sphinx3_livedecode and follow by this config file. Hit enter for start recording. I notice the stream of text appear on the screen continuously until I hit enter again key to stop thus return to the shell prompt. The best recognition result I got is 'good'. (from good morning maybe). Sometimes, I got 'Paul' sometimes I got 'will'. Are there any configuration parameter should I put here, since I got only one word returned after recognition. Or I did somethings mistake in particular?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you very much for your quick reply. However, I guess the argument sphinx3/model/lm/an4/args.an4 is not available in my VC++ version. I'll try to list down here all files I have in an4 folder--args.an4.in, args.an4.test.cls.in, args.an4.test.fsg.in, args.an4.test.in, args.an4.test.mllr.in, and args.an4.test.win32. I also tried to call some of them instead of invoking args.an4 that seems to be found in Unix-like version, but it fails at some points.
I'll try to force the rebuild option in the VC++ project solution with the latest version of Sphinx-3 over SVN. I hope it'd work fine.
Yeah, if I put the parameter /model/lm/an4/args.an4.test.in, it willl stop and complain: "FATAL_ERROR" "mdef.c", line 679: no mdef-file.
if I change the parameter from args.an4.test.in to just args.an4.in, the output is a little change but lengthy.
system_error "mdef.c" line 68: fopen(@prefix@/share/sphinx3/model/hmm/hub4_....) fail; no such file or directory.
However, I tried to put /model/lm/an4/args.an4.test.win32, the result is slightly different, it printed lots of information like Windows console application, and prompt me to hit enter for start recording. I hit enter and there are rapid stream of text that I can't easily read through screen-reader. I hit one more enter, and I found out.raw in Sphinx3 root directory. I don't know how to make use of this raw file, but it remind me that I got it from PortAudioRecPlay.
Am I on the right track? And what is the next move? Thank you very much indeed, it sounds like I'm a very small and innosense baby in this forum.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> However, I tried to put /model/lm/an4/args.an4.test.win32, the result is slightly different, it printed lots of information like Windows console application, and prompt me to hit enter for start recording. I hit enter and there are rapid stream of text that I can't easily read through screen-reader.
Great, so you are making progress. Indeed there are a lot of debugging information printed, but it's not the case. In this demo you have to press ENTER, say something, then press ENTER again. The recognized text will be dumped to console in the following form across other output:
FWDVIT: TURN LEFT (* 108 6 4Z111232)
Alternatively, add the following line to args.an4.test.win32 file:
-hyp rec.hyp
then hypothesis will be dumped to the rec.hyp file as well.
To avoid such extensive logging you can try to change the code in sphinxbase/src/libsphinxbase/util/err.c to not use stderr output by default.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you very much for your quick reply and sensitive support.
I did it already, I didn't aware args.an4.test.win32 can be viewed by notepad or other text editor. However, I wonder why the recognition accuracy is merely lower than I found when comparing with Sphinx-4. It's a lot far from the "HelloDigits" or "HelloWorld" in Sphinx-4. Are there any techniques in particular for gaining more accuracy.
I also tried to create a new configuration file in order to make it works with WSJ trained corpus; however, the rate of accuracy is not increased as well;
Well, HelloWorld was an example of very limited vocabulary recognition. While you are trying rather big 5k vocabulary. If you'll try to use the same jsgf with sphinx3, it will work nicely as well. To do that, you need to convert jsgf to fsg with sphinx_jsgf2fsg from sphinxbase and then submit it to decoder with
-op_mode 2
-fsg your.fst
If you are interested in large vocabulary recognition, probably you need to submit the sample of your recording so we could try to reproduce your result.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
although it runs on unix, I think it can be closely linked to the situation I have.
this guide has one parameter in the config file that is different from mine-- -hmm, so I decide to put it in here, hope it wouldn't hurt my recognition result:
In terms of converting jsgf to fsg with sphinx_jsgf2fsg, I found this tool in ../sphinx/sphinxbase/bin/release, however I don't know how to use this tool. I tried to search on the Google and .JSGF in my harddrive., unfortunately no return having only 10 files for .fsg.
Though would you mind if I leaved with two questions:
- I can build PortAudioRecPlay from Kieth's website. The result of his tools is the *.raw file created it after recording my voice. If I would like to record, for example 'turn left' and save into 'left.raw', What parameter should I use in order to make sphinx3_livedecode can read 'left.raw' file as constant input and return with the recognition result. Should I use this tool, or preferable change to Sphinx3_livepretend.
Second, how can I convert or let say use sphinx_jsgf2fsg in sphinxbase? How it work? Will it improve recognition accuracy?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> this guide has one parameter in the config file that is different from mine-- -hmm, so I decide to put it in here, hope it wouldn't hurt my recognition result:
-hmm is the replacement for -mdef, -mean, -var and so on. If you'll put everything to one folder (and copy something.mdef file to just mdef), you can use
-hmm ./model_parameters
instead of -mean ./model_params -var ./model_params and so on. For example try
Hello all,
<p>May I introduce myself very quickly. My name is Eakachai Charoenchaimonkon. I'm a blind student doing my master in Information management at AIT.</p>
<p>I'm interested in doing the research on corporation between ASR and assistive technologies for visually impaired people as similar to myself who always works with screen-reader several years. I'd tried using Sphinx-4 that's purely Java, and made it successfully run on my PC; however, the application I plan to integrate ASR feature in is purely VC++. TO avoid implementing JNI, alternatively I decided to learn how to develop a simple ASR application using Sphinx-3 which is fuly c++.</p>
<p>So far, I can make Sphinx-3 and SphinxBase built under VC++ 2005. However, I don't know how to develop a small application that can make use the build-in trained corpus in order to do a simple English speech recognition. For example, in Sphinx-4, there are a lot of demo that do 'Hello Digits' or 'Hello world', but I don't think I can find it in Sphinx-3.</p>
<p>One of my friend advised me to use WSJ trained corpus from Cambridge. I got it over the web, and again having no idea how to use this corpus. Does anyone in the forum would kindly give me the hand on describing me what or how can I use Sphinx3_decoder, or kindly show me the sample code or small tutorial that helps me learn how to develop a small ASR application myself. I dream in the first stage to develop a 'HelloWorld' application that is closely alike the one I found in Sphinx-4.</p>
<p>Your kind assistance is highly valuable and would shade me the light on my research.</p>
Leaving the forum a few days, I still confuse the recognition result I got;
My current config file for sphinx3_livedecode is:
-mdef .\model_architecture\wsj_all_cont_3no_8000.mdef
-mean .\model_parameters\wsj_all_cont_3no_8000_16.cd\means
-var .\model_parameters\wsj_all_cont_3no_8000_16.cd\variances
-mixw .\model_parameters\wsj_all_cont_3no_8000_16.cd\mixture_weights
-tmat .\model_parameters\wsj_all_cont_3no_8000_16.cd\transition_matrices
-dict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.dic
-fdict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.filler
-lm .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp_3gram.arpa.dmp
-op_mode 2
-fsg hello.fsg
-hyp rec.hyp
I call the sphinx3_livedecode and follow by this config file. Hit enter for start recording. I notice the stream of text appear on the screen continuously until I hit enter again key to stop thus return to the shell prompt. The best recognition result I got is 'good'. (from good morning maybe). Sometimes, I got 'Paul' sometimes I got 'will'. Are there any configuration parameter should I put here, since I got only one word returned after recognition. Or I did somethings mistake in particular?
As an example you can run
sphinx3_livedecode.exe with the argument sphinx3/model/lm/an4/args.an4
It's the analog of HelloWorld sample from sphinx4. The source of the sample program sphinx3_livedecode is located in
sphinx3/src/programs/main_livedecode.c
For reference on sphinx3 decoder API read
sphinx3/include/s3_decode.h
or online doxygen docs:
http://www.speech.cs.cmu.edu/sphinx/doc/doxygen/sphinx3/s3__decode_8h.html
Thank you very much for your quick reply. However, I guess the argument sphinx3/model/lm/an4/args.an4 is not available in my VC++ version. I'll try to list down here all files I have in an4 folder--args.an4.in, args.an4.test.cls.in, args.an4.test.fsg.in, args.an4.test.in, args.an4.test.mllr.in, and args.an4.test.win32. I also tried to call some of them instead of invoking args.an4 that seems to be found in Unix-like version, but it fails at some points.
I'll try to force the rebuild option in the VC++ project solution with the latest version of Sphinx-3 over SVN. I hope it'd work fine.
I used to work with SphinxSimpleRec project from Cambridge at:
http://www.inference.phy.cam.ac.uk/kv227/simplerec/, but I fail to build portAudioRecPlay, and his sample code is not up to date.
args.an4 is just a list of arguments, you can build it yourself taking args.an4.in as a example. Also look into args.an4.test.win32.
> but it fails at some points.
What point exactly? Provide more information if you need help.
Yeah, if I put the parameter /model/lm/an4/args.an4.test.in, it willl stop and complain: "FATAL_ERROR" "mdef.c", line 679: no mdef-file.
if I change the parameter from args.an4.test.in to just args.an4.in, the output is a little change but lengthy.
system_error "mdef.c" line 68: fopen(@prefix@/share/sphinx3/model/hmm/hub4_....) fail; no such file or directory.
However, I tried to put /model/lm/an4/args.an4.test.win32, the result is slightly different, it printed lots of information like Windows console application, and prompt me to hit enter for start recording. I hit enter and there are rapid stream of text that I can't easily read through screen-reader. I hit one more enter, and I found out.raw in Sphinx3 root directory. I don't know how to make use of this raw file, but it remind me that I got it from PortAudioRecPlay.
Am I on the right track? And what is the next move? Thank you very much indeed, it sounds like I'm a very small and innosense baby in this forum.
> However, I tried to put /model/lm/an4/args.an4.test.win32, the result is slightly different, it printed lots of information like Windows console application, and prompt me to hit enter for start recording. I hit enter and there are rapid stream of text that I can't easily read through screen-reader.
Great, so you are making progress. Indeed there are a lot of debugging information printed, but it's not the case. In this demo you have to press ENTER, say something, then press ENTER again. The recognized text will be dumped to console in the following form across other output:
FWDVIT: TURN LEFT (* 108 6 4Z111232)
Alternatively, add the following line to args.an4.test.win32 file:
-hyp rec.hyp
then hypothesis will be dumped to the rec.hyp file as well.
To avoid such extensive logging you can try to change the code in sphinxbase/src/libsphinxbase/util/err.c to not use stderr output by default.
Thank you very much for your quick reply and sensitive support.
I did it already, I didn't aware args.an4.test.win32 can be viewed by notepad or other text editor. However, I wonder why the recognition accuracy is merely lower than I found when comparing with Sphinx-4. It's a lot far from the "HelloDigits" or "HelloWorld" in Sphinx-4. Are there any techniques in particular for gaining more accuracy.
I also tried to create a new configuration file in order to make it works with WSJ trained corpus; however, the rate of accuracy is not increased as well;
..\sphinx\sphinx3\bin\release\sphinx3_livedecode new-config.txt
Inside new-config.txt, there are;
-mdef .\model_architecture\wsj_all_cont_3no_8000.mdef
-mean .\model_parameters\wsj_all_cont_3no_8000_16.cd\means
-var .\model_parameters\wsj_all_cont_3no_8000_16.cd\variances
-mixw .\model_parameters\wsj_all_cont_3no_8000_16.cd\mixture_weights
-tmat .\model_parameters\wsj_all_cont_3no_8000_16.cd\transition_matrices
-dict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.dic
-fdict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.filler
-lm .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp_3gram.arpa.dmp
-hyp rec.hyp
The result I got are funny :), or my English accent is extremely poor?
Well, HelloWorld was an example of very limited vocabulary recognition. While you are trying rather big 5k vocabulary. If you'll try to use the same jsgf with sphinx3, it will work nicely as well. To do that, you need to convert jsgf to fsg with sphinx_jsgf2fsg from sphinxbase and then submit it to decoder with
-op_mode 2
-fsg your.fst
If you are interested in large vocabulary recognition, probably you need to submit the sample of your recording so we could try to reproduce your result.
Thank you again for your quick reply. I accidentally found this online tutorial:
http://sphinx.subwiki.com/sphinx/index.php/Hello_World_Decoder_QuickStart_Guide
although it runs on unix, I think it can be closely linked to the situation I have.
this guide has one parameter in the config file that is different from mine-- -hmm, so I decide to put it in here, hope it wouldn't hurt my recognition result:
-hmm ..\sphinx\sphinx3\model\hmm\hub4_cd_continuous_8gau_1s_c_d_dd
-mdef .\model_architecture\wsj_all_cont_3no_8000.mdef
-mean .\model_parameters\wsj_all_cont_3no_8000_16.cd\means
-var .\model_parameters\wsj_all_cont_3no_8000_16.cd\variances
-mixw .\model_parameters\wsj_all_cont_3no_8000_16.cd\mixture_weights
-tmat .\model_parameters\wsj_all_cont_3no_8000_16.cd\transition_matrices
-dict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.dic
-fdict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.filler
-lm .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp_3gram.arpa.dmp
-hyp rec.hyp
In terms of converting jsgf to fsg with sphinx_jsgf2fsg, I found this tool in ../sphinx/sphinxbase/bin/release, however I don't know how to use this tool. I tried to search on the Google and .JSGF in my harddrive., unfortunately no return having only 10 files for .fsg.
Though would you mind if I leaved with two questions:
- I can build PortAudioRecPlay from Kieth's website. The result of his tools is the *.raw file created it after recording my voice. If I would like to record, for example 'turn left' and save into 'left.raw', What parameter should I use in order to make sphinx3_livedecode can read 'left.raw' file as constant input and return with the recognition result. Should I use this tool, or preferable change to Sphinx3_livepretend.
> this guide has one parameter in the config file that is different from mine-- -hmm, so I decide to put it in here, hope it wouldn't hurt my recognition result:
-hmm is the replacement for -mdef, -mean, -var and so on. If you'll put everything to one folder (and copy something.mdef file to just mdef), you can use
-hmm ./model_parameters
instead of -mean ./model_params -var ./model_params and so on. For example try
-hmm sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd
> how can I convert or let say use sphinx_jsgf2fsg in sphinxbase? How it work? Will it improve recognition accuracy?
In sphinx4 there in src/demo/sphinx/HelloWorld there is hello.gram, it's a finite state grammar in jsgf format. You can convert it to fsg with
sphinx_jsgf2fsg.exe hello.gram > hello.fsg
and then use hello.fsg in sphinx3. The contents of fsg file will be just a text:
FSG_BEGIN <hello.greet>
NUM_STATES 15
START_STATE 0
FINAL_STATE 1
Transitions
TRANSITION 2 4 0.500000 Hello
TRANSITION 4 3 1.000000
TRANSITION 2 5 0.500000 Good
TRANSITION 5 6 1.000000 morning
TRANSITION 6 3 1.000000
TRANSITION 0 2 1.000000
TRANSITION 7 9 0.166667 Will
TRANSITION 9 8 1.000000
TRANSITION 7 10 0.166667 Rita
TRANSITION 10 8 1.000000
TRANSITION 7 11 0.166667 Philip
TRANSITION 11 8 1.000000
TRANSITION 7 12 0.166667 Paul
TRANSITION 12 8 1.000000
TRANSITION 7 13 0.166667 Evandro
TRANSITION 13 8 1.000000
TRANSITION 7 14 0.166667 Bhiksha
TRANSITION 14 8 1.000000
TRANSITION 3 7 1.000000
TRANSITION 8 1 1.000000
FSG_END
> I can build PortAudioRecPlay from Kieth's website.
No idea, what problem do you have with building it? You can use audacity for recording speech I suppose.
> What parameter should I use in order to make sphinx3_livedecode can read 'left.raw' file as constant input and return with the recognition result.
You need sphinx3_livepretend. Create ctl file with the list of file names:
test
Record a test.raw (it must be 16kHz 16 bit wav file), then run livepretend with the arguments:
sphinx3_livepretend.exe test.ctl . test.args
Check sphinx3/src/tests/regression/test-livepretend.sh for unix sample.