I tried to use my own dic(english) words and jsgf file to recognize set of words(50) using hub4wsj_sc_8k and en_US model by pocketsphinx_batch for accuracy testing.
Getting below error, and words not getting recognized. I tried same using pocketsphinx_continuous, Its recognized. Please let me know, How to resolve this issues. Should i have to mention explicitly anything in command line argument..?
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
INFO: batch.c(760): 14_01_28_27_38_myvoice: 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(762): 14_01_28_27_38_myvoice: 0.00 xRT (CPU), 0.00 xRT (elapsed)
INFO: batch.c(774): TOTAL 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(776): AVERAGE 0.00 xRT (CPU), 0.00 xRT (elapsed)
Thanks,
Jack
Last edit: Jack 2014-01-28
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Current configuration: [NAME][DEFLT][VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-dither no yes
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+00
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 1.600000e+04
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+03
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-02
INFO: acmod.c(246): Parsed model-specific feature parameters from /home/jack/models/hub4/feat.params
INFO: fe_interface.c(299): You are using the internal mechanism to generate the seed.
INFO: feat.c(713): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0
INFO: acmod.c(167): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(517): Reading model definition: /home/home/jack/hub4wsj_sc_8k/mdef
INFO: mdef.c(528): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(336): Reading binary model definition: /home/home/jack/hub4wsj_sc_8k/mdef
INFO: bin_mdef.c(513): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150 CI-sen, 5150 Sen, 27135 Sen-Seq
INFO: tmat.c(205): Reading HMM transition probability matrices: /home/home/jack/hub4wsj_sc_8k/transition_matrices
INFO: acmod.c(121): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /home/home/jack/hub4wsj_sc_8k/means
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /home/home/jack/hub4wsj_sc_8k/variances
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(354): 0 variance values floored
INFO: s2_semi_mgau.c(903): Loading senones from dump file /home/home/jack/hub4wsj_sc_8k/sendump
INFO: s2_semi_mgau.c(927): BEGIN FILE FORMAT DESCRIPTION
INFO: s2_semi_mgau.c(990): Rows: 256, Columns: 5150
INFO: s2_semi_mgau.c(1022): Using memory-mapped I/O for senones
INFO: s2_semi_mgau.c(1296): Maximum top-N: 4 Top-N beams: 0 0 0
INFO: dict.c(317): Allocating 4114 * 32 bytes (128 KiB) for word entries
INFO: dict.c(332): Reading main dictionary: /home/home/jack/hub4wsj_sc_8k/direction.dic
INFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(335): 7 words read
INFO: dict.c(341): Reading filler dictionary: /home/home/jack/hub4wsj_sc_8k/noisedict
INFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(344): 11 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial triphones
INFO: dict2pid.c(131): Allocated 60400 bytes (58 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 60400 bytes (58 KiB) for single-phone word triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26, pip: 0)
INFO: jsgf.c(581): Defined rule: PUBLIC <direction.result>
INFO: fsg_model.c(215): Computing transitive closure for null transitions
INFO: fsg_model.c(270): 0 null transitions added
INFO: fsg_model.c(421): Adding silence transitions for <sil> to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++NOISE++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++BREATH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++SMACK++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++COUGH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++LAUGH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++TONE++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++UH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_search.c(366): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 612 bytes (0 KiB) for left and right context phones
INFO: fsg_lextree.c(253): 80 HMM nodes in lextree (61 leaves)
INFO: fsg_lextree.c(255): Allocated 10240 bytes (10 KiB) for all lextree nodes
INFO: fsg_lextree.c(258): Allocated 7808 bytes (7 KiB) for lextree leafnodes
INFO: cmn.c(175): CMN: 8.70 0.31 0.06 -0.06 -0.23 0.00 -0.20 -0.05 -0.04 -0.13 -0.16 -0.09 -0.12
INFO: fsg_search.c(1032): 599 frames, 4888 HMMs (8/fr), 21239 senones (35/fr), 599 history entries (1/fr)</sil></direction.result>
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
INFO: batch.c(760): file_14_01_28_06_27_myvoice: 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(762): file__14_01_28_06_27_myvoice: 0.00 xRT (CPU), 0.00 xRT (elapsed)
INFO: batch.c(774): TOTAL 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(776): AVERAGE 0.00 xRT (CPU), 0.00 xRT (elapsed)
~~~~~~~~~~~~~~~`
Last edit: Nickolay V. Shmyrev 2014-01-29
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 183
INFO: batch.c(753): Part1/077: 1.83 seconds speech, 0.04 seconds CPU, 0.04 seconds wall
INFO: batch.c(755): Part1/077: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(Part1/077 -25131)
Part1/077 done --------------------------------------
If I extract these 10 empty utts and run pocketsphinx_batch on them only, 9 are empty (1 utt got recognized correctly). This puzzles me. My arg file looks like:
-samprate 16000
-hmm ./acoustic_models/wsj_all_sc.cd_semi_5000
-cmn current
-cmninit 35
-dither no
-adcin yes
-agc none
-cepext .raw
-cepdir Data
-dict dict_123_ah
-fdict wsj_all_sc.cd_semi_5000/noisedict
-fsg ./fsg
-ctl ./test.ctl
-logfn ./fsg.test.parsable.log
-hyp ./fsg.test.parsable.result
Originally I thought there might be some randomness in my arg file. But I batch decoded the 540 utts again and the result is the same (word level and confidence level are all the same). And I already set -dither to no.
As you suggested, I use pocketsphinx_continuous to decode the empty utts. Basically I used the same setting for batch mode, shown below. Some gets correct results but some are still empty.
And I shuffled the 540 test utts and batch decoded them again. Now 15 utts are empty instead of 10. So is it possible that there is some variables not being cleared out after decoding one utt? I hope it is my problem of setting the arg file wrong.
Here is the location of the 15 empty utts within 540 test set (after shuffle):
Here is the location of the 10 empty utts within 540 test set (before shuffle):
13: (Part1/077.raw)
64: (Part1/413.raw)
76: (Part1/468.raw)
133: (Part1/630.raw)
269: (Part2/384.raw)
309: (Part2/486.raw)
372: (Part3/085.raw)
391: (Part3/139.raw)
452: (Part3/372.raw)
487: (Part3/468.raw)
I don't see a clear pattern that position of utt in the control file may affect the decoding result.
If I compare the result between my first decoding and the second decoding (shuffled), apart from 5 more utts are empty, for about 20 utts, the decoding hypotheses are different.
Thanks a lot!
Ming
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
Thanks for your reply!
I tried to set "-remove_noise no" in my arg file with wsj AM and did a batch decoding again.
Running the same arg file twice gives me same decoding hyps. However, if I shuffle the ctl file and batch decode again, the result is different. One with 15 empty hyps and the other with 18 empty ones. I used pocketsphinx_continuous (with -remove_noise no) to decode empty ones one by one, some got correctly recognized in this fashion but most are still empty hyps. As for the test set (540 utts), their references are all accepted by my fsg. I listened to those empty ones and the audio seem to be okay.
So I still have these two problems: 1) shuffle the .ctl leads to different decoding results; and 2) fsg acceptable utterances are decoded as empty ones.
I tried en-us-semi-full AM as well (leaving -remove_noise default). It reduces the empty hyps a lot (from 15ish to 5ish). However, if I shuffle the ctl file, the decoding result is different. Still, there are empty hyps. And by using pocketsphinx_continuous, some of the empty ones can be correctly recognized.
So, from my experiments with wsj model and en-us-semi-full model, there are still three issues: 1) when shuffling the ctl file, the decoding result is different; 2) some fsg acceptable utterances are still decoded as empty; 3) decoding these empty utts with pocketsphinx_continuous may sometimes correctly recognize the utts.
Thanks a lot!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
Please see attachment. There are 4 utts to be decoded. The ctl files contain 1) batch ctl and 2) shuffled batch ctl. The acoustic model I am using is en-us-semi-full. I did not include AM in the attachment. Please modify -hmm in your arg file. As you will see, in result folder, the normal ctl generates four empty hyps. The shuffled one has 1 correct result and 3 empty ones. If you decode with continuous mode with args/***continuous.arg, it will generate correct result for utt 405.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
I was playing with some parameters and thanks for letting me know these important ones.
I am still a bit confused why shuffling the .ctl file leads to different decoding result. It seems that there is some dependency across utterances. This happens to both ngram lm and fsg. I double checked that my cmn setting in my arg files and decoding log files. Although I set it to "current" in arg files and in log files, string 'cmn' co-occurs with 'current' everywhere, the cmn values for the same set of utts are still slightly different.
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 136
INFO: batch.c(753): data/372: 1.36 seconds speech, 0.03 seconds CPU, 0.03 seconds wall
INFO: batch.c(755): data/372: 0.03 xRT (CPU), 0.03 xRT (elapsed)
(data/372 119)
data/372 done --------------------------------------
INFO: batch.c(721): Decoding 'data/384'
INFO: cmn.c(183): CMN: 57.05 14.34 2.89 8.92 -18.11 0.91 -7.90 -8.36 -5.79 3.00 2.61 -1.87 3.34
INFO: fsg_search.c(843): 272 frames, 12799 HMMs (47/fr), 31539 senones (115/fr), 1225 history entries (4/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 272
INFO: batch.c(753): data/384: 2.72 seconds speech, 0.06 seconds CPU, 0.06 seconds wall
INFO: batch.c(755): data/384: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(data/384 119)
data/384 done --------------------------------------
INFO: batch.c(721): Decoding 'data/405'
INFO: cmn.c(183): CMN: 67.86 -11.97 16.72 -8.00 7.80 -10.64 -9.14 0.45 -4.88 3.04 -9.56 5.00 -6.40
INFO: fsg_search.c(843): 407 frames, 18828 HMMs (46/fr), 51360 senones (126/fr), 1450 history entries (3/fr)
INFO: batch.c(753): data/405: 4.07 seconds speech, 0.10 seconds CPU, 0.10 seconds wall
INFO: batch.c(755): data/405: 0.02 xRT (CPU), 0.02 xRT (elapsed)
SEVEN SIX FOUR SEVEN FOUR FOUR THREE SIX OH FIVE (data/405 -55335)
data/405 done --------------------------------------
INFO: batch.c(721): Decoding 'data/486'
INFO: cmn.c(183): CMN: 55.51 13.56 5.47 16.33 -18.16 -8.33 -13.06 -2.29 4.92 2.88 1.02 -1.13 1.65
INFO: fsg_search.c(843): 384 frames, 23623 HMMs (61/fr), 51000 senones (132/fr), 2638 history entries (6/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 384
You can see that for the same utt, the CMN values are different. Is it possible that this might cause the issue of different decoding result on the same test set with different order in ctl?
Also, I am curious why batch and continuous mode have different result on the same utterance (e.g., batch mode decodes it as empty and continuous mode gives me correct hyp). CMN might be the problem as well?
Thanks again!
Thanks!
Last edit: mings 2014-09-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can see that for the same utt, the CMN values are different. Is it possible that this might cause the issue of different decoding result on the same test set with different order in ctl?
Yes, CMN estimation has big effect on results.
Also, I am curious why batch and continuous mode have different result on the same utterance (e.g., batch mode decodes it as empty and continuous mode gives me correct hyp). CMN might be the problem as well?
Right, in batch mode CMN is estimated from the whole utterance at once, in continuous mode initial estimation is adjusted as long as new audio appears.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That's good to know!
So is there any way to enforce that in batch mode,the CMN estimation of any utterance is not dependent on where it appears in the ctl file? I set "cmn" to "current" already. But it does not help. The CMN value still changes when re-ordering the ctl file.
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
I still have problem with CMN, whose value is not the same for the same utterance if the order of the ctl file is changed. I am wondering if I am missing something in my arg file. I set "cmn" to "current" and from the log file, it says my "cmn" is "current" already. Is it possible that between two utterances, some variables are not cleared or reset?
Thanks a lot!
I set remove_noise to no and remove_silence to no, now it gives me the same hyps/cmn no matter what the order is. Thanks for your hint! I guess after setting them both to no, the system does not estimate snr (in fe_track_snr function) and assumes it is in_speech all the time. Otherwise, it tracks snr, which will be different for the same utt in different orders.
Last edit: mings 2014-09-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
I tried to use my own dic(english) words and jsgf file to recognize set of words(50) using hub4wsj_sc_8k and en_US model by pocketsphinx_batch for accuracy testing.
Getting below error, and words not getting recognized. I tried same using pocketsphinx_continuous, Its recognized. Please let me know, How to resolve this issues. Should i have to mention explicitly anything in command line argument..?
INFO: fsg_lextree.c(255): Allocated 9728 bytes (9 KiB) for all lextree nodes
INFO: fsg_lextree.c(258): Allocated 7680 bytes (7 KiB) for lextree leafnodes
INFO: cmn.c(175): CMN: 9.30 0.27 0.02 -0.03 -0.26 0.02 -0.25 -0.03 -0.12 -0.22 -0.16 -0.08 -0.16
INFO: fsg_search.c(1032): 599 frames, 4311 HMMs (7/fr), 21320 senones (35/fr), 598 history entries (0/fr)
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
INFO: batch.c(760): 14_01_28_27_38_myvoice: 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(762): 14_01_28_27_38_myvoice: 0.00 xRT (CPU), 0.00 xRT (elapsed)
INFO: batch.c(774): TOTAL 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(776): AVERAGE 0.00 xRT (CPU), 0.00 xRT (elapsed)
Thanks,
Jack
Last edit: Jack 2014-01-28
To get help on accuracy issues you need to provide a test data to reproduce your problems.
Hi,
I tried with hub4wsj_sc_8k and also en_US model. Received same issues.
Find below my dic, jsgf and log file
~~~~~~~~~~~~~~~~
JSGF V1.0;
/*
* JSGF Grammar for directions
/
grammar direction;
public <result> = (WEST|EAST|NORTH|SOUTH) ENTRANCE | (WEST|EAST) EXIT;
~~~~~~~~~~~~~~~</result>
The dictionary contains,
~~~~~~~~~~~~
EAST IY S T
ENTRANCE EH N T R AH N S
EXIT EH G Z IH T
EXIT(2) EH K S AH T
NORTH N AO R TH
SOUTH S AW TH
WEST W EH S T
~~~~~~~~~~~~~~
and the logs blow,
~~~~~~~~~~~~~~~~~~~~~
-dither yes \
-nfilt 20 \
-lowerf 1 \
-upperf 4000 \
-wlen 0.025 \
-transform dct \
-round_filters no \
-remove_dc yes \
-svspec 0-12/13-25/26-38 \
-feat 1s_c_d_dd \
-agc none \
-cmn current \
-varnorm no
Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-alpha 0.97 9.700000e-01
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-dither no yes
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+00
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 1.600000e+04
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+03
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-02
INFO: acmod.c(246): Parsed model-specific feature parameters from /home/jack/models/hub4/feat.params
INFO: fe_interface.c(299): You are using the internal mechanism to generate the seed.
INFO: feat.c(713): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0
INFO: acmod.c(167): Using subvector specification 0-12/13-25/26-38
INFO: mdef.c(517): Reading model definition: /home/home/jack/hub4wsj_sc_8k/mdef
INFO: mdef.c(528): Found byte-order mark BMDF, assuming this is a binary mdef file
INFO: bin_mdef.c(336): Reading binary model definition: /home/home/jack/hub4wsj_sc_8k/mdef
INFO: bin_mdef.c(513): 50 CI-phone, 143047 CD-phone, 3 emitstate/phone, 150 CI-sen, 5150 Sen, 27135 Sen-Seq
INFO: tmat.c(205): Reading HMM transition probability matrices: /home/home/jack/hub4wsj_sc_8k/transition_matrices
INFO: acmod.c(121): Attempting to use SCHMM computation module
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /home/home/jack/hub4wsj_sc_8k/means
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(198): Reading mixture gaussian parameter: /home/home/jack/hub4wsj_sc_8k/variances
INFO: ms_gauden.c(292): 1 codebook, 3 feature, size:
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(294): 256x13
INFO: ms_gauden.c(354): 0 variance values floored
INFO: s2_semi_mgau.c(903): Loading senones from dump file /home/home/jack/hub4wsj_sc_8k/sendump
INFO: s2_semi_mgau.c(927): BEGIN FILE FORMAT DESCRIPTION
INFO: s2_semi_mgau.c(990): Rows: 256, Columns: 5150
INFO: s2_semi_mgau.c(1022): Using memory-mapped I/O for senones
INFO: s2_semi_mgau.c(1296): Maximum top-N: 4 Top-N beams: 0 0 0
INFO: dict.c(317): Allocating 4114 * 32 bytes (128 KiB) for word entries
INFO: dict.c(332): Reading main dictionary: /home/home/jack/hub4wsj_sc_8k/direction.dic
INFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(335): 7 words read
INFO: dict.c(341): Reading filler dictionary: /home/home/jack/hub4wsj_sc_8k/noisedict
INFO: dict.c(211): Allocated 0 KiB for strings, 0 KiB for phones
INFO: dict.c(344): 11 words read
INFO: dict2pid.c(396): Building PID tables for dictionary
INFO: dict2pid.c(404): Allocating 50^3 * 2 bytes (244 KiB) for word-initial triphones
INFO: dict2pid.c(131): Allocated 60400 bytes (58 KiB) for word-final triphones
INFO: dict2pid.c(195): Allocated 60400 bytes (58 KiB) for single-phone word triphones
INFO: fsg_search.c(145): FSG(beam: -1080, pbeam: -1080, wbeam: -634; wip: -26, pip: 0)
INFO: jsgf.c(581): Defined rule: PUBLIC <direction.result>
INFO: fsg_model.c(215): Computing transitive closure for null transitions
INFO: fsg_model.c(270): 0 null transitions added
INFO: fsg_model.c(421): Adding silence transitions for <sil> to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++NOISE++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++BREATH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++SMACK++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++COUGH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++LAUGH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++TONE++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++UH++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_model.c(421): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(441): Added 6 silence word transitions
INFO: fsg_search.c(366): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 612 bytes (0 KiB) for left and right context phones
INFO: fsg_lextree.c(253): 80 HMM nodes in lextree (61 leaves)
INFO: fsg_lextree.c(255): Allocated 10240 bytes (10 KiB) for all lextree nodes
INFO: fsg_lextree.c(258): Allocated 7808 bytes (7 KiB) for lextree leafnodes
INFO: cmn.c(175): CMN: 8.70 0.31 0.06 -0.06 -0.23 0.00 -0.20 -0.05 -0.04 -0.13 -0.16 -0.09 -0.12
INFO: fsg_search.c(1032): 599 frames, 4888 HMMs (8/fr), 21239 senones (35/fr), 599 history entries (1/fr)</sil></direction.result>
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 599
INFO: batch.c(760): file_14_01_28_06_27_myvoice: 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(762): file__14_01_28_06_27_myvoice: 0.00 xRT (CPU), 0.00 xRT (elapsed)
INFO: batch.c(774): TOTAL 5.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(776): AVERAGE 0.00 xRT (CPU), 0.00 xRT (elapsed)
~~~~~~~~~~~~~~~`
Last edit: Nickolay V. Shmyrev 2014-01-29
You need to provide raw audio dumps. You can create them by adding -rawlogdir <dir> option in decoder configuration.
Last edit: Nickolay V. Shmyrev 2014-01-28
Hi, attached both raw file from -rawdir and wav file which i tried earlier.
the raw file generated with 0 bytes. Please have a look and let me how to resolve this.
Many thanks...
I tried to decode your file with sphinxbase/pocketsphinx trunk with the following command:
Result is as expected
Please make sure you are running the latest version. If not please update.
Hi,
I tried again with pocketsphinx 0.8 ver, Now this time received below error.. copying last 10 lines.
INFO: fsg_lextree.c(255): Allocated 6528 bytes (6 KiB) for all lextree nodes
INFO: fsg_lextree.c(258): Allocated 5120 bytes (5 KiB) for lextree leafnodes
INFO: cmn.c(175): CMN: 9.52 0.01 0.10 -0.03 -0.15 -0.00 -0.32 -0.15 -0.10 -0.09 -0.13 -0.08 -0.11
INFO: fsg_search.c(1032): 299 frames, 993 HMMs (3/fr), 7969 senones (26/fr), 298 history entries (0/fr)
ERROR: "fsg_search.c", line 1104: Final result does not match the grammar in frame 299
INFO: batch.c(760): file: 2.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(762): file: 0.00 xRT (CPU), 0.00 xRT (elapsed)
INFO: batch.c(774): TOTAL 2.99 seconds speech, 0.01 seconds CPU, 0.01 seconds wall
INFO: batch.c(776): AVERAGE 0.00 xRT (CPU), 0.00 xRT (elapsed)
I checked my jsgf grammar,dic,wav, mfc everything... Everything seems correct. Attached for your ref.
I also tried pocketsphinx_continuous as you did, like..
./pocketsphinx_continuous -dict /home/jack/hub4wsj_sc_8k/direction.dic -jsgf /home/jack/hub4wsj_sc_8k/direction.gram -infile /home/home/jack/file.wav
INFO: fsg_model.c(421): Adding silence transitions for ++UM++ to FSG
INFO: fsg_model.c(441): Added 4 silence word transitions
INFO: fsg_search.c(366): Added 1 alternate word transitions
INFO: fsg_lextree.c(108): Allocated 408 bytes (0 KiB) for left and right context phones
INFO: fsg_lextree.c(253): 51 HMM nodes in lextree (40 leaves)
INFO: fsg_lextree.c(255): Allocated 6528 bytes (6 KiB) for all lextree nodes
INFO: fsg_lextree.c(258): Allocated 5120 bytes (5 KiB) for lextree leafnodes
INFO: continuous.c(371): ./pocketsphinx_continuous COMPILED ON: Jan 30 2014, AT: 15:26:59
FATAL_ERROR: "continuous.c", line 153: Failed to calibrate voice activity detection
My personal model is working good(which I trained myself). But when I was trying to get recognized by hub4wsj_sc_8k, no success.
dic file contains,
west W EH S T
exit EH G Z IH T
exit(2) EH K S AH T
and attached gram file
Thanks,
jack
Last edit: Jack 2014-01-30
Like I wrote above, you need to checkout code from subversion trunk.
Hi Nickolay,
I came across similar problem as Jack. My pocketsphinx is checked out from svn trunk here: http://sourceforge.net/p/cmusphinx/code/HEAD/tree/trunk/
I have 540 test utts, with a fsg and a dictionary. If I run pocketsphinx_batch on these 540 utts, I got 10 utts with empty hyp. The errors are like:
INFO: batch.c(721): Decoding 'Part1/077'
INFO: cmn.c(183): CMN: 66.62 2.24 -2.53 0.03 -2.90 -1.38 -1.78 -0.76 -0.54 -0.63 -0.97 -0.27 -0.85
INFO: fsg_search.c(843): 183 frames, 7395 HMMs (40/fr), 30039 senones (164/fr), 464 history entries (2/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 183
INFO: batch.c(753): Part1/077: 1.83 seconds speech, 0.04 seconds CPU, 0.04 seconds wall
INFO: batch.c(755): Part1/077: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(Part1/077 -25131)
Part1/077 done --------------------------------------
If I extract these 10 empty utts and run pocketsphinx_batch on them only, 9 are empty (1 utt got recognized correctly). This puzzles me. My arg file looks like:
-samprate 16000
-hmm ./acoustic_models/wsj_all_sc.cd_semi_5000
-cmn current
-cmninit 35
-dither no
-adcin yes
-agc none
-cepext .raw
-cepdir Data
-dict dict_123_ah
-fdict wsj_all_sc.cd_semi_5000/noisedict
-fsg ./fsg
-ctl ./test.ctl
-logfn ./fsg.test.parsable.log
-hyp ./fsg.test.parsable.result
Originally I thought there might be some randomness in my arg file. But I batch decoded the 540 utts again and the result is the same (word level and confidence level are all the same). And I already set -dither to no.
As you suggested, I use pocketsphinx_continuous to decode the empty utts. Basically I used the same setting for batch mode, shown below. Some gets correct results but some are still empty.
-samprate 16000
-hmm ./wsj_all_sc.cd_semi_5000
-cmn current
-cmninit 35
-dither no
-adcin yes
-agc none
-dict ./dict_123_ah
-fdict ./wsj_all_sc.cd_semi_5000/noisedict
-fsg ./fsg_123_end_ah_fix
-logfn ./fsg.test.parsable.empty.log
-infile ./468.raw
Thank you very much!
Ming
It says you that decoding result does not match the grammar so you should update your grammar to make it more flexible.
To use older models in trunk you need to add '-remove_noise no', overall it's better to use en-us-generic model not older wsj.
And I shuffled the 540 test utts and batch decoded them again. Now 15 utts are empty instead of 10. So is it possible that there is some variables not being cleared out after decoding one utt? I hope it is my problem of setting the arg file wrong.
Here is the location of the 15 empty utts within 540 test set (after shuffle):
79: (Part3/118.raw)
95: (Part1/468.raw)
120: (Part3/087.raw)
134: (Part1/077.raw)
253: (Part1/630.raw)
276: (Part1/674.raw)
302: (Part3/372.raw)
304: (Part2/412.raw)
305: (Part2/384.raw)
328: (Part3/139.raw)
376: (Part3/468.raw)
379: (Part3/050.raw)
426: (Part3/123.raw)
470: (Part3/085.raw)
517: (Part2/486.raw)
Here is the location of the 10 empty utts within 540 test set (before shuffle):
13: (Part1/077.raw)
64: (Part1/413.raw)
76: (Part1/468.raw)
133: (Part1/630.raw)
269: (Part2/384.raw)
309: (Part2/486.raw)
372: (Part3/085.raw)
391: (Part3/139.raw)
452: (Part3/372.raw)
487: (Part3/468.raw)
I don't see a clear pattern that position of utt in the control file may affect the decoding result.
If I compare the result between my first decoding and the second decoding (shuffled), apart from 5 more utts are empty, for about 20 utts, the decoding hypotheses are different.
Thanks a lot!
Ming
Hi Nickolay,
Thanks for your reply!
I tried to set "-remove_noise no" in my arg file with wsj AM and did a batch decoding again.
Running the same arg file twice gives me same decoding hyps. However, if I shuffle the ctl file and batch decode again, the result is different. One with 15 empty hyps and the other with 18 empty ones. I used pocketsphinx_continuous (with -remove_noise no) to decode empty ones one by one, some got correctly recognized in this fashion but most are still empty hyps. As for the test set (540 utts), their references are all accepted by my fsg. I listened to those empty ones and the audio seem to be okay.
So I still have these two problems: 1) shuffle the .ctl leads to different decoding results; and 2) fsg acceptable utterances are decoded as empty ones.
I tried en-us-semi-full AM as well (leaving -remove_noise default). It reduces the empty hyps a lot (from 15ish to 5ish). However, if I shuffle the ctl file, the decoding result is different. Still, there are empty hyps. And by using pocketsphinx_continuous, some of the empty ones can be correctly recognized.
Here is my arg file with en-us-semi-full AM:
-samprate 16000
-hmm ./en-us-semi-full
-cmn current
-cmninit 35
-dither no
-adcin yes
-agc none
-cepext .raw
-cepdir ./Data
-dict ./dict_123_ah
-fdict ./en-us-semi-full/noisedict
-fsg ./fsg_123_end_ah_fix
-ctl ./test_parsable.ctl
-logfn ./fsg.test.parsable.log
-hyp ./fsg.test.parsable.result
So, from my experiments with wsj model and en-us-semi-full model, there are still three issues: 1) when shuffling the ctl file, the decoding result is different; 2) some fsg acceptable utterances are still decoded as empty; 3) decoding these empty utts with pocketsphinx_continuous may sometimes correctly recognize the utts.
Thanks a lot!
You are welcome to provide the files to get help on this issue.
Hi Nickolay,
Please see attachment. There are 4 utts to be decoded. The ctl files contain 1) batch ctl and 2) shuffled batch ctl. The acoustic model I am using is en-us-semi-full. I did not include AM in the attachment. Please modify -hmm in your arg file. As you will see, in result folder, the normal ctl generates four empty hyps. The shuffled one has 1 correct result and 3 empty ones. If you decode with continuous mode with args/***continuous.arg, it will generate correct result for utt 405.
To get good recognition results with your data you can add the following arguments:
To learn more about what do they mean you can read
http://nshmyrev.blogspot.de/2012/01/dealing-with-pruning-issues.html
Hi Nickolay,
I was playing with some parameters and thanks for letting me know these important ones.
I am still a bit confused why shuffling the .ctl file leads to different decoding result. It seems that there is some dependency across utterances. This happens to both ngram lm and fsg. I double checked that my cmn setting in my arg files and decoding log files. Although I set it to "current" in arg files and in log files, string 'cmn' co-occurs with 'current' everywhere, the cmn values for the same set of utts are still slightly different.
For the original set:
INFO: batch.c(721): Decoding 'data/384'
INFO: cmn.c(183): CMN: 57.12 14.03 2.99 8.80 -18.04 0.89 -7.74 -8.37 -5.73 3.16 2.64 -1.86 3.31
INFO: fsg_search.c(843): 274 frames, 13265 HMMs (48/fr), 32085 senones (117/fr), 1240 history entries (4/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 274
INFO: batch.c(753): data/384: 2.74 seconds speech, 0.07 seconds CPU, 0.07 seconds wall
INFO: batch.c(755): data/384: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(data/384 119)
data/384 done --------------------------------------
INFO: batch.c(721): Decoding 'data/486'
INFO: cmn.c(183): CMN: 55.54 13.47 5.54 16.30 -18.17 -8.33 -13.05 -2.30 4.92 2.88 1.03 -1.15 1.66
INFO: fsg_search.c(843): 384 frames, 23663 HMMs (61/fr), 51088 senones (133/fr), 2638 history entries (6/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 384
INFO: batch.c(753): data/486: 3.84 seconds speech, 0.09 seconds CPU, 0.09 seconds wall
INFO: batch.c(755): data/486: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(data/486 119)
data/486 done --------------------------------------
INFO: batch.c(721): Decoding 'data/372'
INFO: cmn.c(183): CMN: 70.65 -4.43 17.07 -7.11 -4.86 -14.88 -1.09 2.18 -2.29 -3.67 -14.99 4.49 -7.11
INFO: fsg_search.c(843): 160 frames, 17876 HMMs (111/fr), 42828 senones (267/fr), 832 history entries (5/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 160
INFO: batch.c(753): data/372: 1.60 seconds speech, 0.04 seconds CPU, 0.04 seconds wall
INFO: batch.c(755): data/372: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(data/372 119)
data/372 done --------------------------------------
INFO: batch.c(721): Decoding 'data/405'
INFO: cmn.c(183): CMN: 69.57 -12.59 16.85 -6.76 8.07 -11.92 -10.20 -0.49 -5.62 2.56 -10.64 5.17 -7.26
INFO: fsg_search.c(843): 349 frames, 7300 HMMs (20/fr), 23618 senones (67/fr), 858 history entries (2/fr)
For the shuffled set:
INFO: batch.c(721): Decoding 'data/372'
INFO: cmn.c(183): CMN: 72.28 -5.37 18.54 -4.05 -5.02 -15.77 -1.43 1.18 -2.44 -5.57 -16.51 4.75 -7.57
INFO: fsg_search.c(843): 136 frames, 11183 HMMs (82/fr), 27575 senones (202/fr), 491 history entries (3/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 136
INFO: batch.c(753): data/372: 1.36 seconds speech, 0.03 seconds CPU, 0.03 seconds wall
INFO: batch.c(755): data/372: 0.03 xRT (CPU), 0.03 xRT (elapsed)
(data/372 119)
data/372 done --------------------------------------
INFO: batch.c(721): Decoding 'data/384'
INFO: cmn.c(183): CMN: 57.05 14.34 2.89 8.92 -18.11 0.91 -7.90 -8.36 -5.79 3.00 2.61 -1.87 3.34
INFO: fsg_search.c(843): 272 frames, 12799 HMMs (47/fr), 31539 senones (115/fr), 1225 history entries (4/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 272
INFO: batch.c(753): data/384: 2.72 seconds speech, 0.06 seconds CPU, 0.06 seconds wall
INFO: batch.c(755): data/384: 0.02 xRT (CPU), 0.02 xRT (elapsed)
(data/384 119)
data/384 done --------------------------------------
INFO: batch.c(721): Decoding 'data/405'
INFO: cmn.c(183): CMN: 67.86 -11.97 16.72 -8.00 7.80 -10.64 -9.14 0.45 -4.88 3.04 -9.56 5.00 -6.40
INFO: fsg_search.c(843): 407 frames, 18828 HMMs (46/fr), 51360 senones (126/fr), 1450 history entries (3/fr)
INFO: batch.c(753): data/405: 4.07 seconds speech, 0.10 seconds CPU, 0.10 seconds wall
INFO: batch.c(755): data/405: 0.02 xRT (CPU), 0.02 xRT (elapsed)
SEVEN SIX FOUR SEVEN FOUR FOUR THREE SIX OH FIVE (data/405 -55335)
data/405 done --------------------------------------
INFO: batch.c(721): Decoding 'data/486'
INFO: cmn.c(183): CMN: 55.51 13.56 5.47 16.33 -18.16 -8.33 -13.06 -2.29 4.92 2.88 1.02 -1.13 1.65
INFO: fsg_search.c(843): 384 frames, 23623 HMMs (61/fr), 51000 senones (132/fr), 2638 history entries (6/fr)
ERROR: "fsg_search.c", line 910: Final result does not match the grammar in frame 384
You can see that for the same utt, the CMN values are different. Is it possible that this might cause the issue of different decoding result on the same test set with different order in ctl?
Also, I am curious why batch and continuous mode have different result on the same utterance (e.g., batch mode decodes it as empty and continuous mode gives me correct hyp). CMN might be the problem as well?
Thanks again!
Thanks!
Last edit: mings 2014-09-08
Yes, CMN estimation has big effect on results.
Right, in batch mode CMN is estimated from the whole utterance at once, in continuous mode initial estimation is adjusted as long as new audio appears.
That's good to know!
So is there any way to enforce that in batch mode,the CMN estimation of any utterance is not dependent on where it appears in the ctl file? I set "cmn" to "current" already. But it does not help. The CMN value still changes when re-ordering the ctl file.
Thanks!
Hi Nickolay,
I still have problem with CMN, whose value is not the same for the same utterance if the order of the ctl file is changed. I am wondering if I am missing something in my arg file. I set "cmn" to "current" and from the log file, it says my "cmn" is "current" already. Is it possible that between two utterances, some variables are not cleared or reset?
Thanks a lot!
My arg file looks like:
-samprate 16000
-hmm ./en-us-semi-full
-cmn current
-dither no
-adcin yes
-agc none
-cepext .raw
-cepdir ./Data
-dict ./dict_123_ah
-fdict ./en-us-semi-full/noisedict
-fsg ./fsg_123_end_ah_fix
-ctl ./test_parsable.ctl
-logfn ./fsg.test.parsable.log
-hyp ./fsg.test.parsable.result
There are other parameters like noise estimation which might be persistent across utterances.
I set remove_noise to no and remove_silence to no, now it gives me the same hyps/cmn no matter what the order is. Thanks for your hint! I guess after setting them both to no, the system does not estimate snr (in fe_track_snr function) and assumes it is in_speech all the time. Otherwise, it tracks snr, which will be different for the same utt in different orders.
Last edit: mings 2014-09-23
We also recently implemented ps_start_stream function to reset SNR estimates. You can use it to get reproducible results.