CMU Sphinx / Forums / Help: SLMTK only calculating uni-grams

Hi,
I am trying to make a language model using the CMU SLMTK. I am using the
windows binaries which are currently available on source forge but I have also

compiled and used the Linux distribution, both of which give the same problem.
I am using the functions as described in the documentation (http://www.speech
.cs.cmu.edu/SLM/toolkit_documentation.html).
The problem is that the functions only seem to calculate uni-grams and fails
to calculate the bi-grams and tri-grams
I have processed my text to in the appropriate way (i.e. ~~some text~~
for each sentence) and have included how i am using each function below

text2wfreq.exe < input.txt > file.wfreq
wfreq2vocab.exe < file.wfreq > file.vocab
text2idngram.exe -vocab file.vocab < parsed.txt > file.idngram.gz
idngram2lm.exe -idngram file.idngram.gz -vocab file.vocab -arpa out.arpa

the .arpa file is then converted to a .dmp file using sphinx_lm_convert from
sphinxbase (agian i am using the windows binaries but have also tried used the

Linux version)

the output of the whole process is given below:

text2wfreq : Reading text from standard input...
text2wfreq : Done.
wfreq2vocab : Will generate a vocabulary containing the most
frequent 20000 words. Reading wfreq stream from stdin...
wfreq2vocab : Done.
text2idngram
Vocab : file.vocab
N-gram buffer size : 100
Hash table size : 2000000
Temp directory : /usr/tmp/
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
Reading vocabulary...
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.

Sorting n-grams...
Writing sorted n-grams to temporary file
e:\DOCUME~1\gz902298\LOCALS~1\Temp\text
2idngram.temp.21
Merging 1 temporary files...

2-grams occurring: N times > N times Sug. -spec_num value
0 110 121
1 106 4 14
2 4 0 10
3 0 0 10
4 0 0 10
5 0 0 10
6 0 0 10
7 0 0 10
8 0 0 10
9 0 0 10
10 0 0 10

3-grams occurring: N times > N times Sug. -spec_num value
0 116 127
1 116 0 10
2 0 0 10
3 0 0 10
4 0 0 10
5 0 0 10
6 0 0 10
7 0 0 10
8 0 0 10
9 0 0 10
10 0 0 10
text2idngram : Done.
n : 3
Input file : file.idngram.gz (binary format)
Output files :
ARPA format : out.arpa
Vocabulary file : file.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.

read_wlist_into_siht: a list of 83 words was read from "file.vocab".
read_wlist_into_array: a list of 83 words was read from "file.vocab".
WARNING: appears as a vocabulary item, but is not labelled as a
context cue.
Allocated space for 5000000 2-grams.
Allocated space for 12500000 3-grams.
table_size 84
Allocated 60000000 bytes to table for 2-grams.
Allocated (2+25000000) bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
'cat' is not recognized as an internal or external command,
operable program or batch file.

Calculating discounted counts.
Warning : 1-gram : f-of-f = 0 --> 1-gram discounting is disabled.
Warning : 2-gram : f-of-f = 0 --> 2-gram discounting is disabled.
Warning : 3-gram : f-of-f = 0 --> 3-gram discounting is disabled.
Unigrams's discount mass is 0 (n1/N = 0)
prob = 1
WARNING: 83 non-context-cue words have zero probability

Incrementing contexts...
Calculating back-off weights...
Warning : P( 0 ) == 1
Warning : Back off weight for <unk>(id 0) is set to 0.
May cause problems with zero probabilities.
Writing out language model...
ARPA-style 3-gram will be written to out.arpa
idngram2lm : Done.
INFO: cmd_ln.c(512): Parsing command line:
e:\Documents and Settings\gz902298\Desktop\Language Modelling\Sphinx
Base\sphinx
_lm_convert.exe \
-i out.arpa \
-o out.dmp </unk>

Current configuration:

-case
-debug 0
-help no no
-i out.arpa
-ienc
-ifmt
-logbase 1.0001 1.000100e+000
-mmap no no
-o out.dmp
-oenc utf8 utf8
-ofmt

INFO: ngram_model_arpa.c(476): ngrams 1=84, 2=1, 3=1
INFO: ngram_model_arpa.c(135): Reading unigrams
INFO: ngram_model_arpa.c(515): 84 = #unigrams created
INFO: ngram_model_arpa.c(194): Reading bigrams
INFO: ngram_model_arpa.c(531): 1 = #bigrams created
INFO: ngram_model_arpa.c(532): 2 = #prob2 entries
INFO: ngram_model_arpa.c(539): 2 = #bo_wt2 entries
INFO: ngram_model_arpa.c(291): Reading trigrams
INFO: ngram_model_arpa.c(552): 1 = #trigrams created
INFO: ngram_model_arpa.c(553): 2 = #prob3 entries
INFO: ngram_model_dmp.c(492): Building DMP model...
INFO: ngram_model_dmp.c(522): 84 = #unigrams created
INFO: ngram_model_dmp.c(621): 1 = #bigrams created
INFO: ngram_model_dmp.c(622): 2 = #prob2 entries
INFO: ngram_model_dmp.c(629): 2 = #bo_wt2 entries
INFO: ngram_model_dmp.c(633): 1 = #trigrams created
INFO: ngram_model_dmp.c(634): 1 = #prob3 entries

The probabilities of all the unigrams is always the same, a sample is shown
below

-99.0000 rustled 0.0000

-99.0000 s 0.0000

-99.0000 said 0.0000

-99.0000 see 0.0000

and reguardless of the size of the text there is always 1 bi-gram and 1 tri-
gram calculated which are:
\2-grams:
-0.0000 <unk> <unk> 0.0000 </unk></unk>

\3-grams:
-0.0000 <unk> <unk> <unk> </unk></unk></unk>

I have searched Google and the forum and have not been able to find any
information.
Any help would be greatly appreciated I have tried my best to provide as much
information about the problem, please let me know if any other information is
required.

Kind Regards,

~~Marco~~

I am using the functions as described in the documentation (http://www.spee
ch.cs.cmu.edu/SLM/toolkit_documentation.html).

Be careful, this is the obsolete one

text2wfreq.exe < input.txt > file.wfreq wfreq2vocab.exe < file.wfreq >
file.vocab text2idngram.exe -vocab file.vocab < parsed.txt > file.idngram.gz
idngram2lm.exe -idngram file.idngram.gz -vocab file.vocab -arpa

I have a feeling the issue you are using .gz extension here. while you don't
compress the output of text2idngram. Please try without gz.

If it's still broken, please try to build the latest snapshot on linux and
send all the files you created (you can upload them to public file sharing
server and give here a link). Also please provide logs. With the snapshot
commands are changed a bit:

text2wfreq < a.txt > a.wfreq
wfreq2vocab < a.wfreq > a.vocab
text2idngram -vocab a.vocab -idngram a.idngram < a.txt
idngram2lm -vocab a.vocab -idngram a.idngram -arpa a.arpa

SLMTK only calculating uni-grams

Speech Recognition Toolkit

Forums

Help

SLMTK only calculating uni-grams document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

SLMTK only calculating uni-grams