Menu

cmuclmtk , wngram2idngram error

Help
2011-06-14
2012-09-22
  • Anurag Jain

    Anurag Jain - 2011-06-14

    Hello Everyone,
    I want to create language model using cmuclmtk for large vocabulary but when I
    use wngram2idngram with "write_ascii" it gives an error "Write error
    encountered while attempting to merge temporary files" and when I use it
    without "write_ascii", it gives an error "rr_fwrite: problems writing n-gram
    ids. Only 0 of 1 elements were written".
    How can I solve this? I changed buffer size, number of files but it is still
    giving the same error.
    However I noticed that everytime .idngram file of exactly 2 gb is created
    whether I use write_ascii or not or whether I change number of words in my
    vocabulary. Is there any upper limit on file size by default which is causing
    this error?
    Waiting for your reply

    Regards
    Anurag

     
  • Pranav Jawale

    Pranav Jawale - 2011-06-14

    could you paste the command that you used? Also perhaps you can upload the
    related files somewhere (with README) so that others can see if it can be
    reproduced.

     
  • Anurag Jain

    Anurag Jain - 2011-06-14

    Hi,
    I used the following command to generate idngram
    wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
    -write_ascii <languagemodel.wngram> languagemodel.idngram

    But i got following error
    Vocab : languagemodel.vocab
    Output idngram : languagemodel.idngram
    Buffer size : 100
    Hash table size : 200000
    Max open files : 20
    n : 3
    FOF size : 10
    buffer size = 4166600
    Initialising hash table...
    Reading vocabulary...
    Allocating memory for the buffer...
    Writing non-OOV counts to temporary file cmuclmtk-Y7QVNP/1
    Write error encountered while attempting to merge temporary files.
    Aborting, but keeping temporary files.

     
  • Nickolay V. Shmyrev

    Which cmuclmtk version are you using. What OS do you have.

     
  • Anurag Jain

    Anurag Jain - 2011-06-14

    It worked when i passed n=1 in the method text2wngram and then wngram2idngram.
    But when I used idngram2lm I got the new error.

    I used the following command
    idngram2lm -idngram languagemodel.idngram -vocab languagemodel.vocab -arpa
    languagemodel.arpa -ascii_input -n 1

    and got the following output

    Input file : languagemodel.idngram (ascii format)
    Output files :
    ARPA format : languagemodel.arpa
    Vocabulary file : languagemodel.vocab
    Cutoffs :

    Vocabulary type : Open - type 1
    Minimum unigram count : 0
    Zeroton fraction : 1
    Counts will be stored in two bytes.
    Count table size : 65535
    Discounting method : Good-Turing
    Discounting ranges :
    1-gram : 1
    Memory allocation for tree structure :
    Allocate 100 MB of memory, shared equally between all n-gram tables.
    Back-off weight storage :
    Back-off weights will be stored in four bytes.
    Reading vocabulary.
    ...............................................................
    read_wlist_into_siht: a list of 63998 words was read from
    "languagemodel.vocab".
    read_wlist_into_array: a list of 63998 words was read from
    "languagemodel.vocab".
    table_size 63999
    Allocated (2+255996) bytes to table for 1-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    Error in idngram stream. This is most likely to be caused by trying to read
    a gzipped file as if it were uncompressed. Ensure that all gzipped files have
    a .gz extension. Other causes might be confusion over whether the file is in
    ascii or binary format.

     
  • Anurag Jain

    Anurag Jain - 2011-06-14

    I am using the cmuclmtk snapshot version which I downloaded 2 days ago and my
    OS is ubuntu 10.10

     
  • Nickolay V. Shmyrev

    Hello

    This bug must be fixed in trunk, please update again, new snapshort should
    work fine

    Btw, you don't need wngrams, you can just use text2idngram to create idngram
    from text directly. See documentation for more details.

     
  • Anurag Jain

    Anurag Jain - 2011-06-16

    Hi nsh,
    Thanks for your reply
    It worked when I used the following commands

    cat simple.txt | text2wfreq > simple.wfreq
    cat simple.wfreq | wfreq2vocab -top 70000 > simple.vocab
    cat simple.txt | text2idngram -vocab simple.vocab -idngram simple.idngram >
    simple.idngram
    idngram2lm -idngram simple.idngram -vocab simple.vocab -arpa simple.arpa

    But it is still not working when I use

    wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
    -write_ascii <languagemodel.wngram> languagemodel.idngram
    or
    text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
    -write_ascii

    However, the previous commands did the job for me.

    Regards
    Anurag

     
  • Nickolay V. Shmyrev

    wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
    -write_ascii <languagemodel.wngram> languagemodel.idngram

    In this command you create idngram twice and overwrite correct result. The
    command should be

    wngram2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii <languagemodel.wngram
    
    text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii
    

    What doesn't work in this case exactly?

     
  • Anurag Jain

    Anurag Jain - 2011-06-16

    Hello nsh,
    When I run the command
    text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram
    -write_ascii

    it says allocating memory for the n-gram buffer , and then nothing happens.
    Following is the output

    text2idngram
    Vocab : simple.vocab
    Output idngram : languagemodel.idngram
    N-gram buffer size : 100
    Hash table size : 2000000
    Temp directory : cmuclmtk-2WYtmG
    Max open files : 20
    FOF size : 10
    n : 3
    Initialising hash table...
    Reading vocabulary...
    Allocating memory for the n-gram buffer...

     
  • Nickolay V. Shmyrev

    This command waits for the text from stdin. It should be something like

    cat text.txt | text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii
    

    or

    text2idngram -vocab languagemodel.vocab -idngram languagemodel.idngram -write_ascii < text.txt
    
     
  • Anurag Jain

    Anurag Jain - 2011-06-16

    Thanks a lot nsh
    It worked, damn I wasted so much time on this small thing.

    Regards
    Anurag

     
  • Abhishek mamidi

    Abhishek mamidi - 2017-10-21

    I have downloaded the CMU tool and following the documentation from this link: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.html#changes

    When I executed those commands I got the following error:
    n : 3
    Input file : - (binary format)
    Output files :
    Binary format : a.binlm
    Vocabulary file : a.vocab
    Cutoffs :
    2-gram : 0 3-gram : 0
    Vocabulary type : Open - type 1
    Minimum unigram count : 0
    Zeroton fraction : 1
    Counts will be stored in two bytes.
    Count table size : 65535
    Discounting method : Good-Turing
    Discounting ranges :
    1-gram : 1 2-gram : 7 3-gram : 7
    Memory allocation for tree structure :
    Memory requirement specified.
    2-gram : 5000000 3-gram : 15000000
    Back-off weight storage :
    Back-off weights will be stored in four bytes.
    Reading vocabulary.
    text2idngram
    Vocab : a.vocab
    N-gram buffer size : 100
    Hash table size : 200000
    Temp directory : /usr/tmp/
    Max open files : 20
    FOF size : 10
    n : 3
    Initialising hash table...
    read_wlist_into_siht: a list of 16307 words was read from "a.vocab".
    read_wlist_into_array: a list of 16307 words was read from "a.vocab".
    Allocated 50000000 bytes to table for 2-grams.
    Allocated 60000000 bytes to table for 3-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    Reading vocabulary...
    Allocating memory for the n-gram buffer...
    Merging temporary files...
    Error in idngram stream. This is most likely to be caused by trying to read
    a gzipped file as if it were uncompressed. Ensure that all gzipped files have
    a .gz extension. Other causes might be confusion over whether the file is in
    ascii or binary format.

    Please help me.
    Please post the commands that I should follow or do I have to change the code?
    My goal is to find the perplexity.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.