I was working on creating a language model for Malayalam(one of the langugages in India). I followed the documentation and used codes from the typical usage section of the same.
(I used the -write_ascii switch while running text2idngram on the ml_corpus.vocab)
I got this as the result:
n : 3
Input file : a.idngram (ascii format)
Output files :
ARPA format : ml.arpa
Vocabulary file : ml_corpus.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
read_wlist_into_siht: a list of 16179 words was read from "ml_corpus.vocab".
read_wlist_into_array: a list of 16179 words was read from "ml_corpus.vocab".
Allocated space for 5000000 2-grams.
Allocated space for 12500000 3-grams.
Allocated 50000000 bytes to table for 2-grams.
Allocated 50000000 bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.
I then ran the idngram2stats with the code:
idngram2stats -ascii_input <a.idngram>.stats
output:
n = 3
fof_size = 50
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error : Repeated ngram in idngram stream.
Since I saw "Error : Repeated ngram in idngram stream." I opened up the a.idngram in gedit and the only things i saw was:
(I did not check the whole file as it took a lot of time scrolling through.)
This is my first CMU Sphinx project and am no expert in it yet. Can someone help me finding out whats wrong.
One more thing, the "a.idngram" file is of size 572.6GB and is this normal for an idngram file of a vocab of that many words? (just curious because of the size)
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
OS : Ubuntu 14.10
CMUclmtk version : v2
I was working on creating a language model for Malayalam(one of the langugages in India). I followed the documentation and used codes from the typical usage section of the same.
The Code I ran before the issue:
idngram2lm -idngram a.idngram -vocab ml_corpus.vocab -arpa ml.arpa -ascii_input
(I used the -write_ascii switch while running text2idngram on the ml_corpus.vocab)
I got this as the result:
n : 3
Input file : a.idngram (ascii format)
Output files :
ARPA format : ml.arpa
Vocabulary file : ml_corpus.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
read_wlist_into_siht: a list of 16179 words was read from "ml_corpus.vocab".
read_wlist_into_array: a list of 16179 words was read from "ml_corpus.vocab".
Allocated space for 5000000 2-grams.
Allocated space for 12500000 3-grams.
Allocated 50000000 bytes to table for 2-grams.
Allocated 50000000 bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.
I then ran the idngram2stats with the code:
idngram2stats -ascii_input <a.idngram>.stats
output:
n = 3
fof_size = 50
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error : Repeated ngram in idngram stream.
Since I saw "Error : Repeated ngram in idngram stream." I opened up the a.idngram in gedit and the only things i saw was:
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0
65535 65535 65535 0 and so on....
(I did not check the whole file as it took a lot of time scrolling through.)
This is my first CMU Sphinx project and am no expert in it yet. Can someone help me finding out whats wrong.
One more thing, the "a.idngram" file is of size 572.6GB and is this normal for an idngram file of a vocab of that many words? (just curious because of the size)
Thank you.
Use srilm
sorry for late response, i'll try and report.
thanks