im in the need of using sphinx suit for a real world speech recognition task which involve a large vocabulary.
for the simplicity i omit training by myself and relied on resources present in http://www.speech.cs.cmu.edu/sphinx/models/.
i used HUB4 (broadcast news) acoustic models - for wideband (16kHz) speech - 6000 senone, 8 Gaussian continuous density acoustic model together with given dictionary, filter dictionary, phone list and language model from the resources present in the above link.
i used those resources to run sphinx3_livedecode.exe with following parameters.
with this setup decoder initializes successfully, but with some errors saying some words are not found in the dictionary and being ignored. There are around 800 such errors.
however recognition accuracy of the system is extremely poor, and it happened to recognize only one word (hello) one time and in all other runs it didn't accurately recognize any word. i used a fairly good microphone and sound card.
im sure im doing something wrong here, because this system definitely should have better performance to have this size of community and support. Im not sure about the configuration as well as the audio/dictionary data i used. im expecting some help on this regard.
thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It's not quite clear what language model are you using - bn99_64000_lm.arpa. Also why are you using op_mode 4. Also it would be helpful if you could upload speech sample.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i downloaded the language model from the link specified. (it has bigram and trigram probabilities).
i initially did't specify any op_mode (the default is -1 i think), and the results were poor, so i tried with op_mode 3 and 4 as well. with op_mode 3, it constantly gave <s> as the partial hypothesis, but never respond to microphone, no matter how loud i spoke to it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Share your audio then, upload it to rapidshare or to mediafire for example and give us a link.
I still don't understand where did get your large model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the only difference is proper language model, you have to build your own one with lmtool anyhow and the one you are using is not correct for sure. Again, there is no such file bn99_64000.arpa on the page you linked to.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Never use it, check the page again and download proper models - wsj1 and lm_giga. Also make sure you understand that for 60000 vocabulary accuracty won't be better than 60%. You have to use smaller vocabulary (not more than 20000) to get acceptable results.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Because it's targetted on broadcast news decoding. This domain is actually very specific, speakers are mostly professionals, vocabulary is not usual too. Every model has domain of it's use and if you attempt to use it for another domain most likely you'll get bad result.
> because i've seen lot of people have used it.
It was reasonable few years ago, now there are better models like wsj using new sphinx3 features like mltt.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes i used your models as well. but results are same. (100% Word Error Rate even with your models).
did you try the system with those models you've uploaded?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i used a recent version of sphinx-3 and i think it should be mature enough.
if possible could you please upload all files related to this from a well working set of yours. (i mean language models, binaries, libraries, acoustic models, etc, so that i can try with that). It will be very helpful as im running out of options at the moment.
manusha
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
im in the need of using sphinx suit for a real world speech recognition task which involve a large vocabulary.
for the simplicity i omit training by myself and relied on resources present in
http://www.speech.cs.cmu.edu/sphinx/models/.
i used HUB4 (broadcast news) acoustic models - for wideband (16kHz) speech - 6000 senone, 8 Gaussian continuous density acoustic model together with given dictionary, filter dictionary, phone list and language model from the resources present in the above link.
i used those resources to run sphinx3_livedecode.exe with following parameters.
-dither yes
-feat 1s_c_d_dd
-cmn current
-agc none
-alpha 0.97
-frate 100
-wlen 0.0256
-nfilt 40
-lowerf 133.33334
-upperf 6855.49756
-mdef mdef
-mean means
-var variances
-mixw mixture_weights
-tmat transition_matrices
-dict cmudict.06d
-fdict fillerdict.txt
-lm bn99_64000_lm.arpa
-hyp out.txt
-op_mode 4
here is the initial output of the decoder:
+++++++++++++++++++++++++++++++++++++++++++++++++++++
Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-alpha 0.97 9.700000e-001
-backtrace yes yes
-beam 1.0e-55 1.000000e-055
-bestpath no no
-bestpathlw 0.000000e+000
-bestscoredir
-bestsenscrdir
-bghist no no
-bptbldir
-bptblsize 32768 32768
-cb2mllr .1cls. .1cls.
-cep2spec no no
-ceplen 13 13
-ci_pbeam 1e-80 1.000000e-080
-cmn current current
-cond_ds no no
-ctl
-ctlcount 1000000000 1000000000
-ctloffset 0 0
-ctl_lm
-ctl_mllr
-dagfudge 2 2
-dict cmudict.06d
-dist_ds no no
-dither no yes
-doublebw no no
-ds 1 1
-epl 3 3
-fdict fillerdict.txt
-feat 1s_c_d_dd 1s_c_d_dd
-fillpen
-fillprob 0.1 1.000000e-001
-frate 100 100
-fsg
-fsgusealtpron yes yes
-fsgusefiller yes yes
-gs
-gs4gs yes yes
-hmm
-hmmdump no no
-hmmdumpef 200000000 200000000
-hmmdumpsf 200000000 200000000
-hmmhistbinsize 5000 5000
-hyp out.txt
-hypseg
-hypsegscore_unscale yes yes
-inlatdir
-inlatwin 50 50
-input_endian little little
-kdmaxbbi -1 -1
-kdmaxdepth 0 0
-kdtree
-latcompress yes yes
-latext lat.gz lat.gz
-lda
-ldadim 0 0
-lextreedump 0 0
-lifter 0 0
-lm bn99_64000_lm.arpa
-lmctlfn
-lmdumpdir
-lmname
-log3table yes yes
-logbase 1.0003 1.000300e+000
-logspec no no
-lowerf 133.33334 1.333333e+002
-lts_mismatch no no
-lw 9.5 9.500000e+000
-machine_endian little little
-maxcdsenpf 100000 100000
-maxedge 2000000 2000000
-maxhistpf 100 100
-maxhmmpf 20000 20000
-maxhyplen 1000 1000
-maxlmop 100000000 100000000
-maxlpf 40000 40000
-maxppath 1000000 1000000
-maxwpf 20 20
-mdef mdef
-mean means
-min_endfr 3 3
-mixw mixture_weights
-mixwfloor 0.0000001 1.000000e-007
-mllr
-mode fwdtree fwdtree
-nbest 200 200
-nbestdir
-nbestext nbest.gz nbest.gz
-ncep 13 13
-nfft 512 512
-nfilt 40 40
-Nlextree 3 3
-Nstalextree 25 25
-op_mode -1 4
-outlatdir
-outlatfmt s3 s3
-pbeam 1.0e-50 1.000000e-050
-pheurtype 0 0
-phonepen 1.0 1.000000e+000
-phypdump yes yes
-pl_beam 1.0e-80 1.000000e-080
-pl_window 1 1
-ppathdebug no no
-ptranskip 0 0
-rawext .raw .raw
-remove_dc no no
-round_filters yes yes
-samprate 16000 1.600000e+004
-seed -1 -1
-senmgau .cont. .cont.
-silprob 0.1 1.000000e-001
-smoothspec no no
-spec2cep no no
-subvq
-subvqbeam 3.0e-3 3.000000e-003
-svq4svq no no
-tighten_factor 0.5 5.000000e-001
-tmat transition_matrices
-tmatfloor 0.0001 1.000000e-004
-topn 4 4
-tracewhmm
-transform legacy legacy
-treeugprob yes yes
-unit_area yes yes
-upperf 6855.4976 6.855498e+003
-uw 0.7 7.000000e-001
-var variances
-varfloor 0.0001 1.000000e-004
-varnorm no no
-verbose no no
-vqeval 3 3
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 1.0e-35 1.000000e-035
-wend_beam 1.0e-80 1.000000e-080
-wip 0.7 7.000000e-001
-wlen 0.025625 2.560000e-002
-worddumpef 200000000 200000000
-worddumpsf 200000000 200000000
+++++++++++++++++++++++++++++++++++++++++++++++++++++
with this setup decoder initializes successfully, but with some errors saying some words are not found in the dictionary and being ignored. There are around 800 such errors.
however recognition accuracy of the system is extremely poor, and it happened to recognize only one word (hello) one time and in all other runs it didn't accurately recognize any word. i used a fairly good microphone and sound card.
im sure im doing something wrong here, because this system definitely should have better performance to have this size of community and support. Im not sure about the configuration as well as the audio/dictionary data i used. im expecting some help on this regard.
thank you
It's not quite clear what language model are you using - bn99_64000_lm.arpa. Also why are you using op_mode 4. Also it would be helpful if you could upload speech sample.
i downloaded the language model from the link specified. (it has bigram and trigram probabilities).
i initially did't specify any op_mode (the default is -1 i think), and the results were poor, so i tried with op_mode 3 and 4 as well. with op_mode 3, it constantly gave <s> as the partial hypothesis, but never respond to microphone, no matter how loud i spoke to it.
Share your audio then, upload it to rapidshare or to mediafire for example and give us a link.
I still don't understand where did get your large model.
hi,
i uploaded some sample files im playing with.
here is the link.
http://rapidshare.com/files/101998812/up.zip.html
thanks
Everything is fine with your files. Check my example, they decoded perfectly:
http://www.mediafire.com/?sukdxwdtjdm
the only difference is proper language model, you have to build your own one with lmtool anyhow and the one you are using is not correct for sure. Again, there is no such file bn99_64000.arpa on the page you linked to.
this is the exact location of the language model i used. its a large vocabulary language model.
http://www.speech.cs.cmu.edu/sphinx/models/hub4opensrc_jan2002/language_model.arpaformat.gz
Never use it, check the page again and download proper models - wsj1 and lm_giga. Also make sure you understand that for 60000 vocabulary accuracty won't be better than 60%. You have to use smaller vocabulary (not more than 20000) to get acceptable results.
thanks, ill try others.
but why doesn't hub4 work? is it only suited for sphinx-4? because i've seen lot of people have used it.
> but why doesn't hub4 work?
Because it's targetted on broadcast news decoding. This domain is actually very specific, speakers are mostly professionals, vocabulary is not usual too. Every model has domain of it's use and if you attempt to use it for another domain most likely you'll get bad result.
> because i've seen lot of people have used it.
It was reasonable few years ago, now there are better models like wsj using new sphinx3 features like mltt.
hi,
i've retried with a different acoustic/language models, but results are same.
Try with the example I uploaded to you first. Then replace language model with the better one.
yes i used your models as well. but results are same. (100% Word Error Rate even with your models).
did you try the system with those models you've uploaded?
following is the contents of output file i got from your models and audio data i uploded.
THE (Attention)
TO AND (Completed)
THE (Nice)
THE (Sorry)
What sphinx version are you using? Try recent trunk.
i used a recent version of sphinx-3 and i think it should be mature enough.
if possible could you please upload all files related to this from a well working set of yours. (i mean language models, binaries, libraries, acoustic models, etc, so that i can try with that). It will be very helpful as im running out of options at the moment.
manusha
Your only option is to be precise. Could you use s3flat? All my files are uploaded. Source code you can checkout from svn.
i downloaded those models from
http://www.speech.cs.cmu.edu/sphinx/models/