Hello Everyone,
I want to create language model using cmuclmtk for large vocabulary but when I
use wngram2idngram with "write_ascii" it gives an error "Write error
encountered while attempting to merge temporary files" and when I use it
without "write_ascii", it gives an error "rr_fwrite: problems writing n-gram
ids. Only 0 of 1 elements were written".
How can I solve this? I changed buffer size, number of files but it is still
giving the same error.
However I noticed that everytime .idngram file of exactly 2 gb is created
whether I use write_ascii or not or whether I change number of words in my
vocabulary. Is there any upper limit on file size by default which is causing
this error?
Waiting for your reply
Regards
Anurag
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
could you paste the command that you used? Also perhaps you can upload the
related files somewhere (with README) so that others can see if it can be
reproduced.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I used the following command to generate idngram
wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram
But i got following error
Vocab : languagemodel.vocab
Output idngram : languagemodel.idngram
Buffer size : 100
Hash table size : 200000
Max open files : 20
n : 3
FOF size : 10
buffer size = 4166600
Initialising hash table...
Reading vocabulary...
Allocating memory for the buffer...
Writing non-OOV counts to temporary file cmuclmtk-Y7QVNP/1
Write error encountered while attempting to merge temporary files.
Aborting, but keeping temporary files.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
...............................................................
read_wlist_into_siht: a list of 63998 words was read from
"languagemodel.vocab".
read_wlist_into_array: a list of 63998 words was read from
"languagemodel.vocab".
table_size 63999
Allocated (2+255996) bytes to table for 1-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When I executed those commands I got the following error:
n : 3
Input file : - (binary format)
Output files :
Binary format : a.binlm
Vocabulary file : a.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Memory requirement specified.
2-gram : 5000000 3-gram : 15000000
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
text2idngram
Vocab : a.vocab
N-gram buffer size : 100
Hash table size : 200000
Temp directory : /usr/tmp/
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
read_wlist_into_siht: a list of 16307 words was read from "a.vocab".
read_wlist_into_array: a list of 16307 words was read from "a.vocab".
Allocated 50000000 bytes to table for 2-grams.
Allocated 60000000 bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Reading vocabulary...
Allocating memory for the n-gram buffer...
Merging temporary files...
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.
Please help me.
Please post the commands that I should follow or do I have to change the code?
My goal is to find the perplexity.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello Everyone,
I want to create language model using cmuclmtk for large vocabulary but when I
use wngram2idngram with "write_ascii" it gives an error "Write error
encountered while attempting to merge temporary files" and when I use it
without "write_ascii", it gives an error "rr_fwrite: problems writing n-gram
ids. Only 0 of 1 elements were written".
How can I solve this? I changed buffer size, number of files but it is still
giving the same error.
However I noticed that everytime .idngram file of exactly 2 gb is created
whether I use write_ascii or not or whether I change number of words in my
vocabulary. Is there any upper limit on file size by default which is causing
this error?
Waiting for your reply
Regards
Anurag
could you paste the command that you used? Also perhaps you can upload the
related files somewhere (with README) so that others can see if it can be
reproduced.
Hi,
I used the following command to generate idngram
wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram
But i got following error
Vocab : languagemodel.vocab
Output idngram : languagemodel.idngram
Buffer size : 100
Hash table size : 200000
Max open files : 20
n : 3
FOF size : 10
buffer size = 4166600
Initialising hash table...
Reading vocabulary...
Allocating memory for the buffer...
Writing non-OOV counts to temporary file cmuclmtk-Y7QVNP/1
Write error encountered while attempting to merge temporary files.
Aborting, but keeping temporary files.
Which cmuclmtk version are you using. What OS do you have.
It worked when i passed n=1 in the method text2wngram and then wngram2idngram.
But when I used idngram2lm I got the new error.
I used the following command
idngram2lm -idngram languagemodel.idngram -vocab languagemodel.vocab -arpa
languagemodel.arpa -ascii_input -n 1
and got the following output
Input file : languagemodel.idngram (ascii format)
Output files :
ARPA format : languagemodel.arpa
Vocabulary file : languagemodel.vocab
Cutoffs :
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
...............................................................
read_wlist_into_siht: a list of 63998 words was read from
"languagemodel.vocab".
read_wlist_into_array: a list of 63998 words was read from
"languagemodel.vocab".
table_size 63999
Allocated (2+255996) bytes to table for 1-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.
I am using the cmuclmtk snapshot version which I downloaded 2 days ago and my
OS is ubuntu 10.10
Hello
This bug must be fixed in trunk, please update again, new snapshort should
work fine
Btw, you don't need wngrams, you can just use text2idngram to create idngram
from text directly. See documentation for more details.
Hi nsh,
Thanks for your reply
It worked when I used the following commands
cat simple.txt | text2wfreq > simple.wfreq
cat simple.wfreq | wfreq2vocab -top 70000 > simple.vocab
cat simple.txt | text2idngram -vocab simple.vocab -idngram simple.idngram >
simple.idngram
idngram2lm -idngram simple.idngram -vocab simple.vocab -arpa simple.arpa
But it is still not working when I use
wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram
or
text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii
However, the previous commands did the job for me.
Regards
Anurag
wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram
In this command you create idngram twice and overwrite correct result. The
command should be
What doesn't work in this case exactly?
Hello nsh,
When I run the command
text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii
it says allocating memory for the n-gram buffer , and then nothing happens.
Following is the output
text2idngram
Vocab : simple.vocab
Output idngram : languagemodel.idngram
N-gram buffer size : 100
Hash table size : 2000000
Temp directory : cmuclmtk-2WYtmG
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
Reading vocabulary...
Allocating memory for the n-gram buffer...
This command waits for the text from stdin. It should be something like
or
Thanks a lot nsh
It worked, damn I wasted so much time on this small thing.
Regards
Anurag
I have downloaded the CMU tool and following the documentation from this link: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html#changes
When I executed those commands I got the following error:
n : 3
Input file : - (binary format)
Output files :
Binary format : a.binlm
Vocabulary file : a.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Memory requirement specified.
2-gram : 5000000 3-gram : 15000000
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
text2idngram
Vocab : a.vocab
N-gram buffer size : 100
Hash table size : 200000
Temp directory : /usr/tmp/
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
read_wlist_into_siht: a list of 16307 words was read from "a.vocab".
read_wlist_into_array: a list of 16307 words was read from "a.vocab".
Allocated 50000000 bytes to table for 2-grams.
Allocated 60000000 bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Reading vocabulary...
Allocating memory for the n-gram buffer...
Merging temporary files...
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.
Please help me.
Please post the commands that I should follow or do I have to change the code?
My goal is to find the perplexity.
Tutorial is here http://cmusphinx.github.io/wiki/tutoriallm