CMU Sphinx / Forums / Help: Sphinx-3 and their simple demo

Eakachai Charoenchaimonkon - 2008-07-02

Hello all,

May I introduce myself very quickly. My name is Eakachai Charoenchaimonkon. I'm a blind student doing my master in Information management at AIT.

I'm interested in doing the research on corporation between ASR and assistive technologies for visually impaired people as similar to myself who always works with screen-reader several years. I'd tried using Sphinx-4 that's purely Java, and made it successfully run on my PC; however, the application I plan to integrate ASR feature in is purely VC++. TO avoid implementing JNI, alternatively I decided to learn how to develop a simple ASR application using Sphinx-3 which is fuly c++.

So far, I can make Sphinx-3 and SphinxBase built under VC++ 2005. However, I don't know how to develop a small application that can make use the build-in trained corpus in order to do a simple English speech recognition. For example, in Sphinx-4, there are a lot of demo that do 'Hello Digits' or 'Hello world', but I don't think I can find it in Sphinx-3.

One of my friend advised me to use WSJ trained corpus from Cambridge. I got it over the web, and again having no idea how to use this corpus. Does anyone in the forum would kindly give me the hand on describing me what or how can I use Sphinx3_decoder, or kindly show me the sample code or small tutorial that helps me learn how to develop a small ASR application myself. I dream in the first stage to develop a 'HelloWorld' application that is closely alike the one I found in Sphinx-4.

Your kind assistance is highly valuable and would shade me the light on my research.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eakachai Charoenchaimonkon - 2008-07-07
 
 Leaving the forum a few days, I still confuse the recognition result I got;
 
 My current config file for sphinx3_livedecode is:
 
 -mdef .\model_architecture\wsj_all_cont_3no_8000.mdef
 -mean .\model_parameters\wsj_all_cont_3no_8000_16.cd\means
 -var .\model_parameters\wsj_all_cont_3no_8000_16.cd\variances
 -mixw .\model_parameters\wsj_all_cont_3no_8000_16.cd\mixture_weights
 -tmat .\model_parameters\wsj_all_cont_3no_8000_16.cd\transition_matrices
 -dict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.dic
 -fdict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.filler
 -lm .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp_3gram.arpa.dmp
 -op_mode 2
 -fsg hello.fsg
 -hyp rec.hyp
 
 I call the sphinx3_livedecode and follow by this config file. Hit enter for start recording. I notice the stream of text appear on the screen continuously until I hit enter again key to stop thus return to the shell prompt. The best recognition result I got is 'good'. (from good morning maybe). Sometimes, I got 'Paul' sometimes I got 'will'. Are there any configuration parameter should I put here, since I got only one word returned after recognition. Or I did somethings mistake in particular?
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2008-07-02
 
 As an example you can run
 
 sphinx3_livedecode.exe with the argument sphinx3/model/lm/an4/args.an4
 
 It's the analog of HelloWorld sample from sphinx4. The source of the sample program sphinx3_livedecode is located in
 
 sphinx3/src/programs/main_livedecode.c
 
 For reference on sphinx3 decoder API read
 
 sphinx3/include/s3_decode.h
 
 or online doxygen docs:
 
 http://www.speech.cs.cmu.edu/sphinx/doc/doxygen/sphinx3/s3__decode_8h.html
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eakachai Charoenchaimonkon - 2008-07-03
 
 Thank you very much for your quick reply. However, I guess the argument sphinx3/model/lm/an4/args.an4 is not available in my VC++ version. I'll try to list down here all files I have in an4 folder--args.an4.in, args.an4.test.cls.in, args.an4.test.fsg.in, args.an4.test.in, args.an4.test.mllr.in, and args.an4.test.win32. I also tried to call some of them instead of invoking args.an4 that seems to be found in Unix-like version, but it fails at some points.
 
 I'll try to force the rebuild option in the VC++ project solution with the latest version of Sphinx-3 over SVN. I hope it'd work fine.
 
 I used to work with SphinxSimpleRec project from Cambridge at:
 http://www.inference.phy.cam.ac.uk/kv227/simplerec/, but I fail to build portAudioRecPlay, and his sample code is not up to date.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Nickolay V. Shmyrev - 2008-07-03
 
 args.an4 is just a list of arguments, you can build it yourself taking args.an4.in as a example. Also look into args.an4.test.win32.
 
 > but it fails at some points.
 
 What point exactly? Provide more information if you need help.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eakachai Charoenchaimonkon - 2008-07-04
 
 Yeah, if I put the parameter /model/lm/an4/args.an4.test.in, it willl stop and complain: "FATAL_ERROR" "mdef.c", line 679: no mdef-file.
 
 if I change the parameter from args.an4.test.in to just args.an4.in, the output is a little change but lengthy.
 system_error "mdef.c" line 68: fopen(@prefix@/share/sphinx3/model/hmm/hub4_....) fail; no such file or directory.
 
 However, I tried to put /model/lm/an4/args.an4.test.win32, the result is slightly different, it printed lots of information like Windows console application, and prompt me to hit enter for start recording. I hit enter and there are rapid stream of text that I can't easily read through screen-reader. I hit one more enter, and I found out.raw in Sphinx3 root directory. I don't know how to make use of this raw file, but it remind me that I got it from PortAudioRecPlay.
 
 Am I on the right track? And what is the next move? Thank you very much indeed, it sounds like I'm a very small and innosense baby in this forum.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Nickolay V. Shmyrev - 2008-07-04
 
 > However, I tried to put /model/lm/an4/args.an4.test.win32, the result is slightly different, it printed lots of information like Windows console application, and prompt me to hit enter for start recording. I hit enter and there are rapid stream of text that I can't easily read through screen-reader.
 
 Great, so you are making progress. Indeed there are a lot of debugging information printed, but it's not the case. In this demo you have to press ENTER, say something, then press ENTER again. The recognized text will be dumped to console in the following form across other output:
 
 FWDVIT: TURN LEFT (* 108 6 4Z111232)
 
 Alternatively, add the following line to args.an4.test.win32 file:
 
 -hyp rec.hyp
 
 then hypothesis will be dumped to the rec.hyp file as well.
 
 To avoid such extensive logging you can try to change the code in sphinxbase/src/libsphinxbase/util/err.c to not use stderr output by default.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eakachai Charoenchaimonkon - 2008-07-04
 
 Thank you very much for your quick reply and sensitive support.
 
 I did it already, I didn't aware args.an4.test.win32 can be viewed by notepad or other text editor. However, I wonder why the recognition accuracy is merely lower than I found when comparing with Sphinx-4. It's a lot far from the "HelloDigits" or "HelloWorld" in Sphinx-4. Are there any techniques in particular for gaining more accuracy.
 
 I also tried to create a new configuration file in order to make it works with WSJ trained corpus; however, the rate of accuracy is not increased as well;
 
 ..\sphinx\sphinx3\bin\release\sphinx3_livedecode new-config.txt
 
 Inside new-config.txt, there are;
 -mdef .\model_architecture\wsj_all_cont_3no_8000.mdef
 -mean .\model_parameters\wsj_all_cont_3no_8000_16.cd\means
 -var .\model_parameters\wsj_all_cont_3no_8000_16.cd\variances
 -mixw .\model_parameters\wsj_all_cont_3no_8000_16.cd\mixture_weights
 -tmat .\model_parameters\wsj_all_cont_3no_8000_16.cd\transition_matrices
 -dict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.dic
 -fdict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.filler
 -lm .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp_3gram.arpa.dmp
 -hyp rec.hyp
 
 The result I got are funny :), or my English accent is extremely poor?
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Nickolay V. Shmyrev - 2008-07-04
 
 Well, HelloWorld was an example of very limited vocabulary recognition. While you are trying rather big 5k vocabulary. If you'll try to use the same jsgf with sphinx3, it will work nicely as well. To do that, you need to convert jsgf to fsg with sphinx_jsgf2fsg from sphinxbase and then submit it to decoder with
 
 -op_mode 2
 -fsg your.fst
 
 If you are interested in large vocabulary recognition, probably you need to submit the sample of your recording so we could try to reproduce your result.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eakachai Charoenchaimonkon - 2008-07-05
 
 Thank you again for your quick reply. I accidentally found this online tutorial:
 
 http://sphinx.subwiki.com/sphinx/index.php/Hello_World_Decoder_QuickStart_Guide
 
 although it runs on unix, I think it can be closely linked to the situation I have.
 
 this guide has one parameter in the config file that is different from mine-- -hmm, so I decide to put it in here, hope it wouldn't hurt my recognition result:
 
 -hmm ..\sphinx\sphinx3\model\hmm\hub4_cd_continuous_8gau_1s_c_d_dd
 -mdef .\model_architecture\wsj_all_cont_3no_8000.mdef
 -mean .\model_parameters\wsj_all_cont_3no_8000_16.cd\means
 -var .\model_parameters\wsj_all_cont_3no_8000_16.cd\variances
 -mixw .\model_parameters\wsj_all_cont_3no_8000_16.cd\mixture_weights
 -tmat .\model_parameters\wsj_all_cont_3no_8000_16.cd\transition_matrices
 -dict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.dic
 -fdict .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp.sphinx.filler
 -lm .\lm_giga_5k_nvp_3gram\lm_giga_5k_nvp_3gram.arpa.dmp
 -hyp rec.hyp
 
 In terms of converting jsgf to fsg with sphinx_jsgf2fsg, I found this tool in ../sphinx/sphinxbase/bin/release, however I don't know how to use this tool. I tried to search on the Google and .JSGF in my harddrive., unfortunately no return having only 10 files for .fsg.
 
 Though would you mind if I leaved with two questions:
 - I can build PortAudioRecPlay from Kieth's website. The result of his tools is the *.raw file created it after recording my voice. If I would like to record, for example 'turn left' and save into 'left.raw', What parameter should I use in order to make sphinx3_livedecode can read 'left.raw' file as constant input and return with the recognition result. Should I use this tool, or preferable change to Sphinx3_livepretend.
 
 Second, how can I convert or let say use sphinx_jsgf2fsg in sphinxbase? How it work? Will it improve recognition accuracy?
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
 - Nickolay V. Shmyrev - 2008-07-05
 
 > this guide has one parameter in the config file that is different from mine-- -hmm, so I decide to put it in here, hope it wouldn't hurt my recognition result:
 
 -hmm is the replacement for -mdef, -mean, -var and so on. If you'll put everything to one folder (and copy something.mdef file to just mdef), you can use
 
 -hmm ./model_parameters
 
 instead of -mean ./model_params -var ./model_params and so on. For example try
 
 -hmm sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd
 
 > how can I convert or let say use sphinx_jsgf2fsg in sphinxbase? How it work? Will it improve recognition accuracy?
 
 In sphinx4 there in src/demo/sphinx/HelloWorld there is hello.gram, it's a finite state grammar in jsgf format. You can convert it to fsg with
 
 sphinx_jsgf2fsg.exe hello.gram > hello.fsg
 
 and then use hello.fsg in sphinx3. The contents of fsg file will be just a text:
 
 FSG_BEGIN <hello.greet>
 NUM_STATES 15
 START_STATE 0
 FINAL_STATE 1
 
 Transitions
 
 TRANSITION 2 4 0.500000 Hello
 TRANSITION 4 3 1.000000
 TRANSITION 2 5 0.500000 Good
 TRANSITION 5 6 1.000000 morning
 TRANSITION 6 3 1.000000
 TRANSITION 0 2 1.000000
 TRANSITION 7 9 0.166667 Will
 TRANSITION 9 8 1.000000
 TRANSITION 7 10 0.166667 Rita
 TRANSITION 10 8 1.000000
 TRANSITION 7 11 0.166667 Philip
 TRANSITION 11 8 1.000000
 TRANSITION 7 12 0.166667 Paul
 TRANSITION 12 8 1.000000
 TRANSITION 7 13 0.166667 Evandro
 TRANSITION 13 8 1.000000
 TRANSITION 7 14 0.166667 Bhiksha
 TRANSITION 14 8 1.000000
 TRANSITION 3 7 1.000000
 TRANSITION 8 1 1.000000
 FSG_END
 
 > I can build PortAudioRecPlay from Kieth's website.
 
 No idea, what problem do you have with building it? You can use audacity for recording speech I suppose.
 
 > What parameter should I use in order to make sphinx3_livedecode can read 'left.raw' file as constant input and return with the recognition result.
 
 You need sphinx3_livepretend. Create ctl file with the list of file names:
 
 test
 
 Record a test.raw (it must be 16kHz 16 bit wav file), then run livepretend with the arguments:
 
 sphinx3_livepretend.exe test.ctl . test.args
 
 Check sphinx3/src/tests/regression/test-livepretend.sh for unix sample.
 
 If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sphinx-3 and their simple demo

Speech Recognition Toolkit

Forums

Help

Sphinx-3 and their simple demo document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Transitions

Sphinx-3 and their simple demo