Hi,
I am trying to make a language model using the CMU SLMTK. I am using the
windows binaries which are currently available on source forge but I have also
compiled and used the Linux distribution, both of which give the same problem.
I am using the functions as described in the documentation (http://www.speech
.cs.cmu.edu/SLM/toolkit_documentation.html).
The problem is that the functions only seem to calculate uni-grams and fails
to calculate the bi-grams and tri-grams
I have processed my text to in the appropriate way (i.e. some text
for each sentence) and have included how i am using each function below
the .arpa file is then converted to a .dmp file using sphinx_lm_convert from
sphinxbase (agian i am using the windows binaries but have also tried used the
Linux version)
the output of the whole process is given below:
text2wfreq : Reading text from standard input...
text2wfreq : Done.
wfreq2vocab : Will generate a vocabulary containing the most
frequent 20000 words. Reading wfreq stream from stdin...
wfreq2vocab : Done.
text2idngram
Vocab : file.vocab
N-gram buffer size : 100
Hash table size : 2000000
Temp directory : /usr/tmp/
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
Reading vocabulary...
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
3-grams occurring: N times > N times Sug. -spec_num value
0 116 127
1 116 0 10
2 0 0 10
3 0 0 10
4 0 0 10
5 0 0 10
6 0 0 10
7 0 0 10
8 0 0 10
9 0 0 10
10 0 0 10
text2idngram : Done.
n : 3
Input file : file.idngram.gz (binary format)
Output files :
ARPA format : out.arpa
Vocabulary file : file.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
read_wlist_into_siht: a list of 83 words was read from "file.vocab".
read_wlist_into_array: a list of 83 words was read from "file.vocab".
WARNING: appears as a vocabulary item, but is not labelled as a
context cue.
Allocated space for 5000000 2-grams.
Allocated space for 12500000 3-grams.
table_size 84
Allocated 60000000 bytes to table for 2-grams.
Allocated (2+25000000) bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
'cat' is not recognized as an internal or external command,
operable program or batch file.
Calculating discounted counts.
Warning : 1-gram : f-of-f = 0 --> 1-gram discounting is disabled.
Warning : 2-gram : f-of-f = 0 --> 2-gram discounting is disabled.
Warning : 3-gram : f-of-f = 0 --> 3-gram discounting is disabled.
Unigrams's discount mass is 0 (n1/N = 0)
prob = 1
WARNING: 83 non-context-cue words have zero probability
Incrementing contexts...
Calculating back-off weights...
Warning : P( 0 ) == 1
Warning : Back off weight for <unk>(id 0) is set to 0.
May cause problems with zero probabilities.
Writing out language model...
ARPA-style 3-gram will be written to out.arpa
idngram2lm : Done.
INFO: cmd_ln.c(512): Parsing command line:
e:\Documents and Settings\gz902298\Desktop\Language Modelling\Sphinx
Base\sphinx
_lm_convert.exe \
-i out.arpa \
-o out.dmp </unk>
Current configuration:
-case
-debug 0
-help no no
-i out.arpa
-ienc
-ifmt
-logbase 1.0001 1.000100e+000
-mmap no no
-o out.dmp
-oenc utf8 utf8
-ofmt
The probabilities of all the unigrams is always the same, a sample is shown
below
-99.0000 rustled 0.0000
-99.0000 s 0.0000
-99.0000 said 0.0000
-99.0000 see 0.0000
and reguardless of the size of the text there is always 1 bi-gram and 1 tri-
gram calculated which are:
\2-grams:
-0.0000 <unk> <unk> 0.0000 </unk></unk>
I have searched Google and the forum and have not been able to find any
information.
Any help would be greatly appreciated I have tried my best to provide as much
information about the problem, please let me know if any other information is
required.
Kind Regards,
Marco
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a feeling the issue you are using .gz extension here. while you don't
compress the output of text2idngram. Please try without gz.
If it's still broken, please try to build the latest snapshot on linux and
send all the files you created (you can upload them to public file sharing
server and give here a link). Also please provide logs. With the snapshot
commands are changed a bit:
I have built and used the latest version (2.05) on Linux and initially found
that it produced the same error.
However I noticed that the text2idngram function produced an empty .idngram
file.
From the logs and the code I found that the function attempts to create a
temporary file in the destination defined by the -temp argument which defaults
to /usr/tmp/ if not set. As this directory did not exist in my file system the
text2idngram function encountered an error and produces an empty file which
caused my problem. it was therefore over come by setting the -temp parameter
to a valid folder.
I hope this information helps anyone who encounters the same problem
This solution did not solve the problem with the windows binaries.
Kind regards
Marco
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am trying to make a language model using the CMU SLMTK. I am using the
windows binaries which are currently available on source forge but I have also
compiled and used the Linux distribution, both of which give the same problem.
I am using the functions as described in the documentation (http://www.speech
.cs.cmu.edu/SLM/toolkit_documentation.html).
The problem is that the functions only seem to calculate uni-grams and fails
to calculate the bi-grams and tri-grams
I have processed my text to in the appropriate way (i.e.
some textfor each sentence) and have included how i am using each function below
text2wfreq.exe < input.txt > file.wfreq
wfreq2vocab.exe < file.wfreq > file.vocab
text2idngram.exe -vocab file.vocab < parsed.txt > file.idngram.gz
idngram2lm.exe -idngram file.idngram.gz -vocab file.vocab -arpa out.arpa
the .arpa file is then converted to a .dmp file using sphinx_lm_convert from
sphinxbase (agian i am using the windows binaries but have also tried used the
Linux version)
the output of the whole process is given below:
text2wfreq : Reading text from standard input...
text2wfreq : Done.
wfreq2vocab : Will generate a vocabulary containing the most
frequent 20000 words. Reading wfreq stream from stdin...
wfreq2vocab : Done.
text2idngram
Vocab : file.vocab
N-gram buffer size : 100
Hash table size : 2000000
Temp directory : /usr/tmp/
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
Reading vocabulary...
Allocating memory for the n-gram buffer...
Reading text into the n-gram buffer...
20,000 n-grams processed for each ".", 1,000,000 for each line.
Sorting n-grams...
Writing sorted n-grams to temporary file
e:\DOCUME~1\gz902298\LOCALS~1\Temp\text
2idngram.temp.21
Merging 1 temporary files...
2-grams occurring: N times > N times Sug. -spec_num value
0 110 121
1 106 4 14
2 4 0 10
3 0 0 10
4 0 0 10
5 0 0 10
6 0 0 10
7 0 0 10
8 0 0 10
9 0 0 10
10 0 0 10
3-grams occurring: N times > N times Sug. -spec_num value
0 116 127
1 116 0 10
2 0 0 10
3 0 0 10
4 0 0 10
5 0 0 10
6 0 0 10
7 0 0 10
8 0 0 10
9 0 0 10
10 0 0 10
text2idngram : Done.
n : 3
Input file : file.idngram.gz (binary format)
Output files :
ARPA format : out.arpa
Vocabulary file : file.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
read_wlist_into_siht: a list of 83 words was read from "file.vocab".
read_wlist_into_array: a list of 83 words was read from "file.vocab".
WARNING:
appears as a vocabulary item, but is not labelled as acontext cue.
Allocated space for 5000000 2-grams.
Allocated space for 12500000 3-grams.
table_size 84
Allocated 60000000 bytes to table for 2-grams.
Allocated (2+25000000) bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
'cat' is not recognized as an internal or external command,
operable program or batch file.
Calculating discounted counts.
Warning : 1-gram : f-of-f = 0 --> 1-gram discounting is disabled.
Warning : 2-gram : f-of-f = 0 --> 2-gram discounting is disabled.
Warning : 3-gram : f-of-f = 0 --> 3-gram discounting is disabled.
Unigrams's discount mass is 0 (n1/N = 0)
prob = 1
WARNING: 83 non-context-cue words have zero probability
Incrementing contexts...
Calculating back-off weights...
Warning : P( 0 ) == 1
Warning : Back off weight for <unk>(id 0) is set to 0.
May cause problems with zero probabilities.
Writing out language model...
ARPA-style 3-gram will be written to out.arpa
idngram2lm : Done.
INFO: cmd_ln.c(512): Parsing command line:
e:\Documents and Settings\gz902298\Desktop\Language Modelling\Sphinx
Base\sphinx
_lm_convert.exe \
-i out.arpa \
-o out.dmp </unk>
Current configuration:
-case
-debug 0
-help no no
-i out.arpa
-ienc
-ifmt
-logbase 1.0001 1.000100e+000
-mmap no no
-o out.dmp
-oenc utf8 utf8
-ofmt
INFO: ngram_model_arpa.c(476): ngrams 1=84, 2=1, 3=1
INFO: ngram_model_arpa.c(135): Reading unigrams
INFO: ngram_model_arpa.c(515): 84 = #unigrams created
INFO: ngram_model_arpa.c(194): Reading bigrams
INFO: ngram_model_arpa.c(531): 1 = #bigrams created
INFO: ngram_model_arpa.c(532): 2 = #prob2 entries
INFO: ngram_model_arpa.c(539): 2 = #bo_wt2 entries
INFO: ngram_model_arpa.c(291): Reading trigrams
INFO: ngram_model_arpa.c(552): 1 = #trigrams created
INFO: ngram_model_arpa.c(553): 2 = #prob3 entries
INFO: ngram_model_dmp.c(492): Building DMP model...
INFO: ngram_model_dmp.c(522): 84 = #unigrams created
INFO: ngram_model_dmp.c(621): 1 = #bigrams created
INFO: ngram_model_dmp.c(622): 2 = #prob2 entries
INFO: ngram_model_dmp.c(629): 2 = #bo_wt2 entries
INFO: ngram_model_dmp.c(633): 1 = #trigrams created
INFO: ngram_model_dmp.c(634): 1 = #prob3 entries
The probabilities of all the unigrams is always the same, a sample is shown
below
-99.0000 s 0.0000
-99.0000 said 0.0000
-99.0000 see 0.0000
and reguardless of the size of the text there is always 1 bi-gram and 1 tri-
gram calculated which are:
\2-grams:
-0.0000 <unk> <unk> 0.0000 </unk></unk>
\3-grams:
-0.0000 <unk> <unk> <unk> </unk></unk></unk>
I have searched Google and the forum and have not been able to find any
information.
Any help would be greatly appreciated I have tried my best to provide as much
information about the problem, please let me know if any other information is
required.
Kind Regards,
MarcoHi
Be careful, this is the obsolete one
I have a feeling the issue you are using .gz extension here. while you don't
compress the output of text2idngram. Please try without gz.
If it's still broken, please try to build the latest snapshot on linux and
send all the files you created (you can upload them to public file sharing
server and give here a link). Also please provide logs. With the snapshot
commands are changed a bit:
Hi nshmyrev,
Thanks for your quick response :-)
I have built and used the latest version (2.05) on Linux and initially found
that it produced the same error.
However I noticed that the text2idngram function produced an empty .idngram
file.
From the logs and the code I found that the function attempts to create a
temporary file in the destination defined by the -temp argument which defaults
to /usr/tmp/ if not set. As this directory did not exist in my file system the
text2idngram function encountered an error and produces an empty file which
caused my problem. it was therefore over come by setting the -temp parameter
to a valid folder.
I hope this information helps anyone who encounters the same problem
This solution did not solve the problem with the windows binaries.
Kind regards
Marco
Thanks, this temp file issue must be fixed in trunk now.