CMU Sphinx / Forums / Help: cmuclmtk , wngram2idngram error

Anurag Jain - 2011-06-14

Hello Everyone,
I want to create language model using cmuclmtk for large vocabulary but when I
use wngram2idngram with "write_ascii" it gives an error "Write error
encountered while attempting to merge temporary files" and when I use it
without "write_ascii", it gives an error "rr_fwrite: problems writing n-gram
ids. Only 0 of 1 elements were written".
How can I solve this? I changed buffer size, number of files but it is still
giving the same error.
However I noticed that everytime .idngram file of exactly 2 gb is created
whether I use write_ascii or not or whether I change number of words in my
vocabulary. Is there any upper limit on file size by default which is causing
this error?
Waiting for your reply

Regards
Anurag

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2011-06-14

could you paste the command that you used? Also perhaps you can upload the
related files somewhere (with README) so that others can see if it can be
reproduced.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anurag Jain - 2011-06-14

Hi,
I used the following command to generate idngram
wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram

But i got following error
Vocab : languagemodel.vocab
Output idngram : languagemodel.idngram
Buffer size : 100
Hash table size : 200000
Max open files : 20
n : 3
FOF size : 10
buffer size = 4166600
Initialising hash table...
Reading vocabulary...
Allocating memory for the buffer...
Writing non-OOV counts to temporary file cmuclmtk-Y7QVNP/1
Write error encountered while attempting to merge temporary files.
Aborting, but keeping temporary files.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-06-14

Which cmuclmtk version are you using. What OS do you have.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anurag Jain - 2011-06-14

It worked when i passed n=1 in the method text2wngram and then wngram2idngram.
But when I used idngram2lm I got the new error.

I used the following command
idngram2lm -idngram languagemodel.idngram -vocab languagemodel.vocab -arpa
languagemodel.arpa -ascii_input -n 1

and got the following output

Input file : languagemodel.idngram (ascii format)
Output files :
ARPA format : languagemodel.arpa
Vocabulary file : languagemodel.vocab
Cutoffs :

Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1
Memory allocation for tree structure :
Allocate 100 MB of memory, shared equally between all n-gram tables.
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
...............................................................
read_wlist_into_siht: a list of 63998 words was read from
"languagemodel.vocab".
read_wlist_into_array: a list of 63998 words was read from
"languagemodel.vocab".
table_size 63999
Allocated (2+255996) bytes to table for 1-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anurag Jain - 2011-06-14

I am using the cmuclmtk snapshot version which I downloaded 2 days ago and my
OS is ubuntu 10.10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-06-15

Hello

This bug must be fixed in trunk, please update again, new snapshort should
work fine

Btw, you don't need wngrams, you can just use text2idngram to create idngram
from text directly. See documentation for more details.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anurag Jain - 2011-06-16

Hi nsh,
Thanks for your reply
It worked when I used the following commands

cat simple.txt | text2wfreq > simple.wfreq
cat simple.wfreq | wfreq2vocab -top 70000 > simple.vocab
cat simple.txt | text2idngram -vocab simple.vocab -idngram simple.idngram >
simple.idngram
idngram2lm -idngram simple.idngram -vocab simple.vocab -arpa simple.arpa

But it is still not working when I use

wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram
or
text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii

However, the previous commands did the job for me.

Regards
Anurag

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-06-16

wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii <languagemodel.wngram> languagemodel.idngram

In this command you create idngram twice and overwrite correct result. The
command should be

wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii <languagemodel.wngram text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii

What doesn't work in this case exactly?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anurag Jain - 2011-06-16

Hello nsh,
When I run the command
text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
-write_ascii

it says allocating memory for the n-gram buffer , and then nothing happens.
Following is the output

text2idngram
Vocab : simple.vocab
Output idngram : languagemodel.idngram
N-gram buffer size : 100
Hash table size : 2000000
Temp directory : cmuclmtk-2WYtmG
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
Reading vocabulary...
Allocating memory for the n-gram buffer...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-06-16

This command waits for the text from stdin. It should be something like

cat text.txt | text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii

or

text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii < text.txt
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anurag Jain - 2011-06-16

Thanks a lot nsh
It worked, damn I wasted so much time on this small thing.

Regards
Anurag

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Abhishek mamidi - 2017-10-21

I have downloaded the CMU tool and following the documentation from this link: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html#changes

When I executed those commands I got the following error:
n : 3
Input file : - (binary format)
Output files :
Binary format : a.binlm
Vocabulary file : a.vocab
Cutoffs :
2-gram : 0 3-gram : 0
Vocabulary type : Open - type 1
Minimum unigram count : 0
Zeroton fraction : 1
Counts will be stored in two bytes.
Count table size : 65535
Discounting method : Good-Turing
Discounting ranges :
1-gram : 1 2-gram : 7 3-gram : 7
Memory allocation for tree structure :
Memory requirement specified.
2-gram : 5000000 3-gram : 15000000
Back-off weight storage :
Back-off weights will be stored in four bytes.
Reading vocabulary.
text2idngram
Vocab : a.vocab
N-gram buffer size : 100
Hash table size : 200000
Temp directory : /usr/tmp/
Max open files : 20
FOF size : 10
n : 3
Initialising hash table...
read_wlist_into_siht: a list of 16307 words was read from "a.vocab".
read_wlist_into_array: a list of 16307 words was read from "a.vocab".
Allocated 50000000 bytes to table for 2-grams.
Allocated 60000000 bytes to table for 3-grams.
Processing id n-gram file.
20,000 n-grams processed for each ".", 1,000,000 for each line.
Reading vocabulary...
Allocating memory for the n-gram buffer...
Merging temporary files...
Error in idngram stream. This is most likely to be caused by trying to read
a gzipped file as if it were uncompressed. Ensure that all gzipped files have
a .gz extension. Other causes might be confusion over whether the file is in
ascii or binary format.

Please help me.
Please post the commands that I should follow or do I have to change the code?
My goal is to find the perplexity.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2017-10-21
  
  Tutorial is here http://cmusphinx.github.io/wiki/tutoriallm
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

cmuclmtk , wngram2idngram error

Speech Recognition Toolkit

Forums

Help

cmuclmtk , wngram2idngram error document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

cmuclmtk , wngram2idngram error