CMU Sphinx / Forums / Help: ERROR: "ngram_model

daniel chen - 2010-12-14

ERROR: "ngram_model_arpa.c", line 76: No \data\ mark in LM file

I download sphinxbase-0.6.1 pocketsphinx-0.6.1 sphinxtrain-1.0 from website :
http://cmusphinx.sourceforge.net/wiki/download/

compile them , pass , when run pocketsphinx_continuous project. it ocurred
error:

ERROR: "ngram_model_arpa.c", line 76: No \data\ mark in LM file

but it can still run and show "READY...." , "Listening..." , but when i say a
word , it ASR result is wrong.

i use the AM and LM from the "pocketsphinx-0.6.1\model" .
the paras for the program i used are:

-hmm D:\PocketSphinx\pocketsphinx-0.6.1\model\hmm\zh\tdt_sc_8k
-lm D:\PocketSphinx\pocketsphinx-0.6.1\model\lm\zh_CN\gigatdt.5000.DMP
-dict D:\PocketSphinx\pocketsphinx-0.6.1\model\lm\zh_CN\mandarin_notone.dic

i use chinese ,and english , they all ASR wrong.

I check the pocketsphinx.args:

-hmm ../../../model/hmm/wsj0
-lm ../../../model/lm/turtle/turtle.lm.DMP
-dict ../../../model/lm/turtle/turtle.dic
-ctl ../../../model/lm/turtle/turtle.ctl
-cepdir ../../../model/lm/turtle
-cepext .16k
-adcin TRUE

i don't find "turtle" directory. so i don't know what are in the file
turtle.ctl.

i look "zh_broadcastnews", there are also only two files :
zh_broadcastnews_64000_utf8.DMP and zh_broadcastnews_utf8.dic.

another issue:

does the paras for program format are :

-hmm D:\PocketSphinx\pocketsphinx-0.6.1\model\hmm\zh\tdt_sc_8k
-lm D:\PocketSphinx\pocketsphinx-0.6.1\model\lm\zh_CN\gigatdt.5000.DMP
-dict D:\PocketSphinx\pocketsphinx-0.6.1\model\lm\zh_CN\mandarin_notone.dic

or

-hmm ../../../model/hmm/wsj0
-lm ../../../model/lm/turtle/turtle.lm.DMP
-dict ../../../model/lm/turtle/turtle.dic
-ctl ../../../model/lm/turtle/turtle.ctl
-cepdir ../../../model/lm/turtle
-cepext .16k
-adcin TRUE

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-14

i check the forum and find it something like topic "How to try the
pocketsphinx".

I download the PocketSphinx and AM , LM from the website ,but when i run
pocketsphinx_continuous project
in VS2008, and the cmd line is s cfg file with :

-hmm D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k
-lm D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_64000_utf8.DMP
-dict D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_utf8.dic

but run ,ocurred an error : "ERROR: "ngram_model_arpa.c", line 79: No \data\
mark in LM file".
as nshmyrev said in topic "How to try the pocketsphinx": " .. is a error
message: ERROR: "ngram_model_arpa.c", line 76: No \data\ mark in LM file " You
can ignore this error .

i ignore the message, but the ASR results are all wrong , does there any
issues i don't find, and how to make the online ASR
test.

And i use english AM and LM , the ASR results are wrong too.

here are a part of my test log:

INFO: cmd_ln.c(512): Parsing command line:
\
-hmm D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k \
-lm D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_64000_utf8.DMP \
-dict D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_utf8.dic

Current configuration:

-adcdev
-agc none none
-agcthresh 2.0 2.000000e+000
-alpha 0.97 9.700000e-001
-argfile
-ascale 20.0 2.000000e+001
-backtrace no no
-beam 1e-48 1.000000e-048
-bestpath yes yes
-bestpathlw 9.5 9.500000e+000
-bghist no no
-ceplen 13 13
-cmn current current
-cmninit 8.0 8.0
-compallsen no no
-debug 0
-dict D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_utf8.dic
-dictcase no no
-dither no no
-doublebw no no
-ds 1 1
-fdict
-feat 1s_c_d_dd 1s_c_d_dd
-featparams
-fillprob 1e-8 1.000000e-008
-frate 100 100
-fsg
-fsgusealtpron yes yes
-fsgusefiller yes yes
-fwdflat yes yes
-fwdflatbeam 1e-64 1.000000e-064
-fwdflatefwid 4 4
-fwdflatlw 8.5 8.500000e+000
-fwdflatsfwin 25 25
-fwdflatwbeam 7e-29 7.000000e-029
-fwdtree yes yes
-hmm D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k
-input_endian little little
-jsgf
-kdmaxbbi -1 -1
-kdmaxdepth 0 0
-kdtree
-latsize 5000 5000
-lda
-ldadim 0 0
-lextreedump 0 0
-lifter 0 0
-lm D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_64000_utf8.DMP
-lmctl
-lmname default default
-logbase 1.0001 1.000100e+000
-logfn
-logspec no no
-lowerf 133.33334 1.333333e+002
-lpbeam 1e-40 1.000000e-040
-lponlybeam 7e-29 7.000000e-029
-lw 6.5 6.500000e+000
-maxhmmpf -1 -1
-maxnewoov 20 20
-maxwpf -1 -1
-mdef
-mean
-mfclogdir
-mixw
-mixwfloor 0.0000001 1.000000e-007
-mllr
-mmap yes yes
-ncep 13 13
-nfft 512 512
-nfilt 40 40
-nwpen 1.0 1.000000e+000
-pbeam 1e-48 1.000000e-048
-pip 1.0 1.000000e+000
-pl_beam 1e-10 1.000000e-010
-pl_pbeam 1e-5 1.000000e-005
-pl_window 0 0
-rawlogdir
-remove_dc no no
-round_filters yes yes
-samprate 16000 1.600000e+004
-seed -1 -1
-sendump
-senmgau
-silprob 0.005 5.000000e-003
-smoothspec no no
-svspec
-tmat
-tmatfloor 0.0001 1.000000e-004
-topn 4 4
-topn_beam 0 0
-toprule
-transform legacy legacy
-unit_area yes yes
-upperf 6855.4976 6.855498e+003
-usewdphones no no
-uw 1.0 1.000000e+000
-var
-varfloor 0.0001 1.000000e-004
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wbeam 7e-29 7.000000e-029
-wip 0.65 6.500000e-001
-wlen 0.025625 2.562500e-002

INFO: cmd_ln.c(512): Parsing command line:
\
-nfilt 20 \
-lowerf 1 \
-upperf 4000 \
-wlen 0.025 \
-transform dct \
-round_filters no \
-remove_dc yes \
-feat 1s_c_d_dd \
-svspec 0-12/13-25/26-38 \
-agc none \
-cmn current \
-cmninit 54,-1,2 \
-varnorm no

Current configuration:

-agc none none
-agcthresh 2.0 2.000000e+000
-alpha 0.97 9.700000e-001
-ceplen 13 13
-cmn current current
-cmninit 8.0 54,-1,2
-dither no no
-doublebw no no
-feat 1s_c_d_dd 1s_c_d_dd
-frate 100 100
-input_endian little little
-lda
-ldadim 0 0
-lifter 0 0
-logspec no no
-lowerf 133.33334 1.000000e+000
-ncep 13 13
-nfft 512 512
-nfilt 40 20
-remove_dc no yes
-round_filters yes no
-samprate 16000 1.600000e+004
-seed -1 -1
-smoothspec no no
-svspec 0-12/13-25/26-38
-transform legacy dct
-unit_area yes yes
-upperf 6855.4976 4.000000e+003
-varnorm no no
-verbose no no
-warp_params
-warp_type inverse_linear inverse_linear
-wlen 0.025625 2.500000e-002

INFO: acmod.c(238): Parsed model-specific feature parameters from

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/f

at.params

INFO: feat.c(848): Initializing feature stream to type: '1s_c_d_dd',

ceplen=13, CMN='current', VARNORM='no', AGC='none'

INFO: cmn.c(142): mean= 12.00, mean= 0.0

INFO: acmod.c(163): Using subvector specification 0-12/13-25/26-38

INFO: mdef.c(520): Reading model definition:

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/mdef

INFO: mdef.c(531): Found byte-order mark BMDF, assuming this is a binary mdef

file

INFO: bin_mdef.c(330): Reading binary model definition:

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/mdef

INFO: bin_mdef.c(508): 70 CI-phone, 65021 CD-phone, 3 emitstate/phone, 210 CI-

sen, 5210 Sen, 11271 Sen-Seq

INFO: tmat.c(205): Reading HMM transition probability matrices:

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/tra

sition_matrices

INFO: acmod.c(117): Attempting to use SCHMM computation module

INFO: ms_gauden.c(198): Reading mixture gaussian parameter:

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/means

INFO: ms_gauden.c(292): 1 codebook, 3 feature, size

256x13 256x13 256x13

INFO: ms_gauden.c(198): Reading mixture gaussian parameter:

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/varianc

s

INFO: ms_gauden.c(292): 1 codebook, 3 feature, size

256x13 256x13 256x13

INFO: ms_gauden.c(356): 0 variance values floored

INFO: s2_semi_mgau.c(897): Loading senones from dump file

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/sendump

INFO: s2_semi_mgau.c(921): BEGIN FILE FORMAT DESCRIPTION

INFO: s2_semi_mgau.c(1016): Using memory-mapped I/O for senones

INFO: s2_semi_mgau.c(1293): Maximum top-N: 4 Top-N beams: 0 0 0

INFO: dict.c(294): Allocating 101598 * 20 bytes (1984 KiB) for word entries

INFO: dict.c(306): Reading main dictionary:

D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_utf8.dic

INFO: dict.c(206): Allocated 737 KiB for strings, 977 KiB for phones

INFO: dict.c(309): 97495 words read

INFO: dict.c(314): Reading filler dictionary:

D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k/noisedict

INFO: dict.c(206): Allocated 0 KiB for strings, 0 KiB for phones

INFO: dict.c(317): 7 words read

INFO: dict2pid.c(396): Building PID tables for dictionary

INFO: dict2pid.c(405): Allocating 70^3 * 2 bytes (669 KiB) for word-initial

triphones

INFO: dict2pid.c(131): Allocated 59080 bytes (57 KiB) for word-final triphones

INFO: dict2pid.c(195): Allocated 59080 bytes (57 KiB) for single-phone word

triphones

ERROR: "ngram_model_arpa.c", line 79: No \data\ mark in LM file

INFO: ngram_model_dmp.c(141): Will use memory-mapped I/O for LM file

INFO: ngram_model_dmp.c(195): ngrams 1=63944, 2=16600781, 3=20708460

INFO: ngram_model_dmp.c(241): 63944 = LM.unigrams(+trailer) read

INFO: ngram_model_dmp.c(289): 16600781 = LM.bigrams(+trailer) read

INFO: ngram_model_dmp.c(314): 20708460 = LM.trigrams read

INFO: ngram_model_dmp.c(338): 32337 = LM.prob2 entries read

INFO: ngram_model_dmp.c(357): 24468 = LM.bo_wt2 entries read

INFO: ngram_model_dmp.c(377): 27937 = LM.prob3 entries read

INFO: ngram_model_dmp.c(405): 32424 = LM.tseg_base entries read

INFO: ngram_model_dmp.c(461): 63944 = ascii word strings read

INFO: ngram_search_fwdtree.c(99): 476 unique initial diphones

INFO: ngram_search_fwdtree.c(147): 0 root, 0 non-root channels, 121 single-

phone words

INFO: ngram_search_fwdtree.c(186): Creating search tree

INFO: ngram_search_fwdtree.c(191): before: 0 root, 0 non-root channels, 121

single-phone words

INFO: ngram_search_fwdtree.c(324): after: max nonroot chan increased to 87980

INFO: ngram_search_fwdtree.c(333): after: 461 root, 87852 non-root channels,

26 single-phone words

INFO: ngram_search_fwdflat.c(153): fwdflat: min_ef_width = 4, max_sf_win = 25

Allocating 32 buffers of 2500 samples each

INFO: continuous.c(276):

d:\PocketSphinx\pocketsphinx\bin\Debug\pocketsphinx_continuous.exe COMPILED

ON: Dec 14 2010, A

15:34:00

READY....
Listening...
Stopped listening, please wait...
INFO: cmn_prior.c(121): cmn_prior_update: from < 54.00 -1.00 2.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.0
0.00 >
INFO: cmn_prior.c(139): cmn_prior_update: to < 36.39 -4.76 3.55 1.55 -2.98
0.61 -0.52 -1.13 -0.78 0.29 -0.54 -0.4
2.18 >
INFO: ngram_search_fwdtree.c(1513): 2331 words recognized (14/fr)
INFO: ngram_search_fwdtree.c(1515): 396081 senones evaluated (2386/fr)
INFO: ngram_search_fwdtree.c(1517): 2306035 channels searched (13891/fr),
60502 1st, 173412 last
INFO: ngram_search_fwdtree.c(1521): 10332 words for which last channels
evaluated (62/fr)
INFO: ngram_search_fwdtree.c(1524): 816412 candidate words for entering last
phone (4918/fr)
INFO: ngram_search_fwdflat.c(295): Utterance vocabulary contains 46 words
INFO: ngram_search_fwdflat.c(912): 817 words recognized (5/fr)
INFO: ngram_search_fwdflat.c(914): 57668 senones evaluated (347/fr)
INFO: ngram_search_fwdflat.c(916): 78665 channels searched (473/fr)
INFO: ngram_search_fwdflat.c(918): 4991 words searched (30/fr)
INFO: ngram_search_fwdflat.c(920): 2302 word transitions (13/fr)
WARNING: "ngram_search.c", line 1087: not found in last frame, using
<sil> instead
INFO: ngram_search.c(1137): lattice start node .0 end node <sil>.157
INFO: ps_lattice.c(1228): Normalizer P(O) = alpha(<sil>:157:164) = -1243759
INFO: ps_lattice.c(1266): Joint P(O,S) = -1249832 P(S|O) = -6073
000000000: 瀹?(-23769124)
READY.... </sil></sil></sil>

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-12-14

ASR results are not supposed to be 100% accurate, they always contain some
amount of errors. Moreover if the words you are trying to recognize are not
the part of the langauge model.

There are few issues you can fix to improve accuracy for example it might be
too high recording level, issues with audio input on Windows or something
else. Try to record speech to a file, make sure file has a proper format and
try to recognize from a file first.
You can get introduction into development using CMUSphinx by reading the
tutorial:

http://cmusphinx.sourceforge.net/wiki/tutorial

P.S. Please avoid discussing same issue multiple times on the different
topics. I don't think it's productive for both of us.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-14

thanks , i say the words which are include from the dic file, but all
recognize error, never right even once.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-12-14

Then it might be a bug with sound input from a microphone on Windows. Add
"-rawlogdir . " to the command line (don't miss the dot) to dump audio during
recognition to the filesystem. Share the audio after that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-14

i Add "-rawlogdir . " to the command line
D:\PocketSphinx\cfg_zh.txt -rawlogdir .

error info:

INFO: feat.c(848): Initializing feature stream to type
INFO: cmn.c(142): mean= 12.00, mean= 0.0
ERROR: "acmod.c", line 84: Must specify -mdef or -hmm

but i have set hmm already in cfg_zh.txt :

-hmm D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k
-lm D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_64000_utf8.DMP
-dict D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_utf8.dic

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-14

i know the cmd paras :

D:\PocketSphinx\demo>pocketsphinx_continuous.exe -hmm
D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k -lm D:\PocketS
phinx\pocketsphinx\model\lm\zh_CN\zh_broadcastnews_64000_utf8.DMP -dict
D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\zh_b
roadcastnews_utf8.dic -rawlogdir .

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-14

how to upload the audio file here ? thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-15

i check the audio file , i think the audio is ok, i don't know how to upload
here ?thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-12-15

You can use public file sharing resources to share the files

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-15

i don't find the link for share files. would you please give me the
link,thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-12-15

http://mediafire.com

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-15

i shared audio files in website:

http://www.mediafire.com/file/6p5oxwk862soest/sound_data.rar

http://www.mediafire.com/?6p5oxwk862soest

i make online ASR , but no one is right. Are there any issues ? thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-12-15

Well, it seems it returns close results, for example 00000005.raw is
recognized as:

000000000: 林口

Isn't it a good result? I don't understand chinese but it seems pretty close
in pinyin to what is pronunced

lín kǒu

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-16

hi nsymyrev,
I shared the data before are in english AM and LM.

And I do online ASR again both in english and chinese, but the ASR results are
wrong both in english and chinese. Its seldom right.

testing in english :
pocketsphinx_continuous.exe -hmm
D:\PocketSphinx\pocketsphinx\model\hmm\en_US\hub4wsj_sc_8k -lm
D:\PocketSphinx\pocketsphinx\model\lm\en_US\hub4.5000.DMP -dict
D:\PocketSphinx\pocketsphinx\model\lm\en_US\cmu07a.dic -rawlogdir .

i take the following words as example:
ability , about , equipment , month , morning , please , yellow .

it has no right ASR result.

testing in chinese :

pocketsphinx_continuous.exe -hmm
D:\PocketSphinx\pocketsphinx\model\hmm\zh\tdt_sc_8k -lm
D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\gigatdt.5000.DMP -dict
D:\PocketSphinx\pocketsphinx\model\lm\zh_CN\mandarin_notone.dic -rawlogdir .
i take the following words as example:

全国 , 公共事业 , 国家 , 国际足联 , 图书馆 , 国际能源 , 圆周率 , 土匪, 圣诞节, 在某种程度上, 地方各级人民政府, 地板.

it also wrong in ASR reulst.

PS: i ask another person for Chinese online ASR testing , there is some words
right, does it mean the issue is releated to
the chinese Acoustic Model , does the AM is not sufficient , and i need to
train the AM?

I check the audio file by cooledit software , its para : 16k, 16bits, mono.

i think the voice signal strength is enough for the ASR, i think the SN
between 3000 to 5000 is fit for the ASR, do you ?
if don't so , what is the level of voice SN is fit for ASR ? thanks!

the new audio file (both english and chinese) location :
http://www.mediafire.com/?lzouu8jv8ls5y9x
http://www.mediafire.com/file/lzouu8jv8ls5y9x/sound_data-1216.rar

Thanks!

===============================================================
MSN: danielchendc@live.cn
E-Mail: danielchendc@yahoo.com.cn
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

daniel chen - 2010-12-16

I make online ASR PocketSphinx in Windows ， does this effect the result of
ASR.?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ERROR: "ngram_model_arpa.c",...

Speech Recognition Toolkit

Forums

Help

ERROR: "ngram_model_arpa.c",...

ERROR: &quot;ngram_model_arpa.c&quot;,...

Speech Recognition Toolkit

Forums

Help

ERROR: &quot;ngram_model_arpa.c&quot;,... document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

ERROR: "ngram_model_arpa.c",...

ERROR: "ngram_model_arpa.c",...