Menu

CMUCLMTK Error while idngram2lm

Help
2015-07-03
2017-10-21
  • Sreenadh TC

    Sreenadh TC - 2015-07-03

    Hi,
    OS : Ubuntu 14.10
    CMUclmtk version : v2

    I was working on creating a language model for Malayalam(one of the langugages in India). I followed the documentation and used codes from the typical usage section of the same.

    The Code I ran before the issue:

    idngram2lm -idngram a.idngram -vocab ml_corpus.vocab -arpa ml.arpa -ascii_input

    (I used the -write_ascii switch while running text2idngram on the ml_corpus.vocab)

    I got this as the result:

    n : 3
    Input file : a.idngram (ascii format)
    Output files :
    ARPA format : ml.arpa
    Vocabulary file : ml_corpus.vocab
    Cutoffs :
    2-gram : 0 3-gram : 0
    Vocabulary type : Open - type 1
    Minimum unigram count : 0
    Zeroton fraction : 1
    Counts will be stored in two bytes.
    Count table size : 65535
    Discounting method : Good-Turing
    Discounting ranges :
    1-gram : 1 2-gram : 7 3-gram : 7
    Memory allocation for tree structure :
    Allocate 100 MB of memory, shared equally between all n-gram tables.
    Back-off weight storage :
    Back-off weights will be stored in four bytes.
    Reading vocabulary.
    read_wlist_into_siht: a list of 16179 words was read from "ml_corpus.vocab".
    read_wlist_into_array: a list of 16179 words was read from "ml_corpus.vocab".
    Allocated space for 5000000 2-grams.
    Allocated space for 12500000 3-grams.
    Allocated 50000000 bytes to table for 2-grams.
    Allocated 50000000 bytes to table for 3-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    Error in idngram stream. This is most likely to be caused by trying to read
    a gzipped file as if it were uncompressed. Ensure that all gzipped files have
    a .gz extension. Other causes might be confusion over whether the file is in
    ascii or binary format.

    I then ran the idngram2stats with the code:

    idngram2stats -ascii_input <a.idngram>.stats

    output:

    n = 3
    fof_size = 50
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    Error : Repeated ngram in idngram stream.

    Since I saw "Error : Repeated ngram in idngram stream." I opened up the a.idngram in gedit and the only things i saw was:

    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0
    65535 65535 65535 0 and so on....

    (I did not check the whole file as it took a lot of time scrolling through.)

    This is my first CMU Sphinx project and am no expert in it yet. Can someone help me finding out whats wrong.
    One more thing, the "a.idngram" file is of size 572.6GB and is this normal for an idngram file of a vocab of that many words? (just curious because of the size)

    Thank you.

     
    • Nickolay V. Shmyrev

      Use srilm

       
      • Sreenadh TC

        Sreenadh TC - 2015-07-11

        sorry for late response, i'll try and report.

        thanks

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.