Menu

SLMTK only calculating uni-grams

Help
2010-08-25
2012-09-22
  • Marco Volino

    Marco Volino - 2010-08-25

    Hi,
    I am trying to make a language model using the CMU SLMTK. I am using the
    windows binaries which are currently available on source forge but I have also

    compiled and used the Linux distribution, both of which give the same problem.
    I am using the functions as described in the documentation (http://www.speech
    .cs.cmu.edu/SLM/toolkit_documentation.html).

    The problem is that the functions only seem to calculate uni-grams and fails
    to calculate the bi-grams and tri-grams
    I have processed my text to in the appropriate way (i.e. some text
    for each sentence) and have included how i am using each function below

    text2wfreq.exe < input.txt > file.wfreq
    wfreq2vocab.exe < file.wfreq > file.vocab
    text2idngram.exe -vocab file.vocab < parsed.txt > file.idngram.gz
    idngram2lm.exe -idngram file.idngram.gz -vocab file.vocab -arpa out.arpa

    the .arpa file is then converted to a .dmp file using sphinx_lm_convert from
    sphinxbase (agian i am using the windows binaries but have also tried used the

    Linux version)

    the output of the whole process is given below:

    text2wfreq : Reading text from standard input...
    text2wfreq : Done.
    wfreq2vocab : Will generate a vocabulary containing the most
    frequent 20000 words. Reading wfreq stream from stdin...
    wfreq2vocab : Done.
    text2idngram
    Vocab : file.vocab
    N-gram buffer size : 100
    Hash table size : 2000000
    Temp directory : /usr/tmp/
    Max open files : 20
    FOF size : 10
    n : 3
    Initialising hash table...
    Reading vocabulary...
    Allocating memory for the n-gram buffer...
    Reading text into the n-gram buffer...
    20,000 n-grams processed for each ".", 1,000,000 for each line.

    Sorting n-grams...
    Writing sorted n-grams to temporary file
    e:\DOCUME~1\gz902298\LOCALS~1\Temp\text
    2idngram.temp.21
    Merging 1 temporary files...

    2-grams occurring: N times > N times Sug. -spec_num value
    0 110 121
    1 106 4 14
    2 4 0 10
    3 0 0 10
    4 0 0 10
    5 0 0 10
    6 0 0 10
    7 0 0 10
    8 0 0 10
    9 0 0 10
    10 0 0 10

    3-grams occurring: N times > N times Sug. -spec_num value
    0 116 127
    1 116 0 10
    2 0 0 10
    3 0 0 10
    4 0 0 10
    5 0 0 10
    6 0 0 10
    7 0 0 10
    8 0 0 10
    9 0 0 10
    10 0 0 10
    text2idngram : Done.
    n : 3
    Input file : file.idngram.gz (binary format)
    Output files :
    ARPA format : out.arpa
    Vocabulary file : file.vocab
    Cutoffs :
    2-gram : 0 3-gram : 0
    Vocabulary type : Open - type 1
    Minimum unigram count : 0
    Zeroton fraction : 1
    Counts will be stored in two bytes.
    Count table size : 65535
    Discounting method : Good-Turing
    Discounting ranges :
    1-gram : 1 2-gram : 7 3-gram : 7
    Memory allocation for tree structure :
    Allocate 100 MB of memory, shared equally between all n-gram tables.
    Back-off weight storage :
    Back-off weights will be stored in four bytes.
    Reading vocabulary.

    read_wlist_into_siht: a list of 83 words was read from "file.vocab".
    read_wlist_into_array: a list of 83 words was read from "file.vocab".
    WARNING: appears as a vocabulary item, but is not labelled as a
    context cue.
    Allocated space for 5000000 2-grams.
    Allocated space for 12500000 3-grams.
    table_size 84
    Allocated 60000000 bytes to table for 2-grams.
    Allocated (2+25000000) bytes to table for 3-grams.
    Processing id n-gram file.
    20,000 n-grams processed for each ".", 1,000,000 for each line.
    'cat' is not recognized as an internal or external command,
    operable program or batch file.

    Calculating discounted counts.
    Warning : 1-gram : f-of-f = 0 --> 1-gram discounting is disabled.
    Warning : 2-gram : f-of-f = 0 --> 2-gram discounting is disabled.
    Warning : 3-gram : f-of-f = 0 --> 3-gram discounting is disabled.
    Unigrams's discount mass is 0 (n1/N = 0)
    prob = 1
    WARNING: 83 non-context-cue words have zero probability

    Incrementing contexts...
    Calculating back-off weights...
    Warning : P( 0 ) == 1
    Warning : Back off weight for <unk>(id 0) is set to 0.
    May cause problems with zero probabilities.
    Writing out language model...
    ARPA-style 3-gram will be written to out.arpa
    idngram2lm : Done.
    INFO: cmd_ln.c(512): Parsing command line:
    e:\Documents and Settings\gz902298\Desktop\Language Modelling\Sphinx
    Base\sphinx
    _lm_convert.exe \
    -i out.arpa \
    -o out.dmp </unk>

    Current configuration:

    -case
    -debug 0
    -help no no
    -i out.arpa
    -ienc
    -ifmt
    -logbase 1.0001 1.000100e+000
    -mmap no no
    -o out.dmp
    -oenc utf8 utf8
    -ofmt

    INFO: ngram_model_arpa.c(476): ngrams 1=84, 2=1, 3=1
    INFO: ngram_model_arpa.c(135): Reading unigrams
    INFO: ngram_model_arpa.c(515): 84 = #unigrams created
    INFO: ngram_model_arpa.c(194): Reading bigrams
    INFO: ngram_model_arpa.c(531): 1 = #bigrams created
    INFO: ngram_model_arpa.c(532): 2 = #prob2 entries
    INFO: ngram_model_arpa.c(539): 2 = #bo_wt2 entries
    INFO: ngram_model_arpa.c(291): Reading trigrams
    INFO: ngram_model_arpa.c(552): 1 = #trigrams created
    INFO: ngram_model_arpa.c(553): 2 = #prob3 entries
    INFO: ngram_model_dmp.c(492): Building DMP model...
    INFO: ngram_model_dmp.c(522): 84 = #unigrams created
    INFO: ngram_model_dmp.c(621): 1 = #bigrams created
    INFO: ngram_model_dmp.c(622): 2 = #prob2 entries
    INFO: ngram_model_dmp.c(629): 2 = #bo_wt2 entries
    INFO: ngram_model_dmp.c(633): 1 = #trigrams created
    INFO: ngram_model_dmp.c(634): 1 = #prob3 entries

    The probabilities of all the unigrams is always the same, a sample is shown
    below

    -99.0000 rustled 0.0000

    -99.0000 s 0.0000

    -99.0000 said 0.0000

    -99.0000 see 0.0000

    and reguardless of the size of the text there is always 1 bi-gram and 1 tri-
    gram calculated which are:
    \2-grams:
    -0.0000 <unk> <unk> 0.0000 </unk></unk>

    \3-grams:
    -0.0000 <unk> <unk> <unk> </unk></unk></unk>

    I have searched Google and the forum and have not been able to find any
    information.
    Any help would be greatly appreciated I have tried my best to provide as much
    information about the problem, please let me know if any other information is
    required.

    Kind Regards,

    Marco

     
  • Nickolay V. Shmyrev

    Hi

    I am using the functions as described in the documentation (http://www.spee
    ch.cs.cmu.edu/SLM/toolkit_documentation.html).

    Be careful, this is the obsolete one

    text2wfreq.exe < input.txt > file.wfreq wfreq2vocab.exe < file.wfreq >
    file.vocab text2idngram.exe -vocab file.vocab < parsed.txt > file.idngram.gz
    idngram2lm.exe -idngram file.idngram.gz -vocab file.vocab -arpa

    I have a feeling the issue you are using .gz extension here. while you don't
    compress the output of text2idngram. Please try without gz.

    If it's still broken, please try to build the latest snapshot on linux and
    send all the files you created (you can upload them to public file sharing
    server and give here a link). Also please provide logs. With the snapshot
    commands are changed a bit:

    text2wfreq < a.txt > a.wfreq
    wfreq2vocab < a.wfreq > a.vocab
    text2idngram -vocab a.vocab -idngram a.idngram < a.txt
    idngram2lm -vocab a.vocab -idngram a.idngram -arpa a.arpa
    
     
  • Marco Volino

    Marco Volino - 2010-08-25

    Hi nshmyrev,

    Thanks for your quick response :-)

    I have built and used the latest version (2.05) on Linux and initially found
    that it produced the same error.

    However I noticed that the text2idngram function produced an empty .idngram
    file.
    From the logs and the code I found that the function attempts to create a
    temporary file in the destination defined by the -temp argument which defaults
    to /usr/tmp/ if not set. As this directory did not exist in my file system the
    text2idngram function encountered an error and produces an empty file which
    caused my problem. it was therefore over come by setting the -temp parameter
    to a valid folder.

    I hope this information helps anyone who encounters the same problem

    This solution did not solve the problem with the windows binaries.

    Kind regards

    Marco

     
  • Nickolay V. Shmyrev

    Thanks, this temp file issue must be fixed in trunk now.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.