CMU Sphinx / Forums / Help: LM Training issues using CMU-LM v2.05

Horizon - 2008-05-26

I am having issues training LM using CMU-LM Toolkit v2.05 on a 1k dictionary.
I have word frequency counts for 1,2,3-gram word sequences (*.wngram files).
When I run the various experiments, it crashes! I will mainly touch upon two main issues in this post.

Some word frequncies I have are greater than 2^31 (exceed MAX_SIGNED_INT). So while processing unigrams, it returns error stating Error: SUM(P) = 100.xxx.
To get around this, I take mod of counts (for now), thereby bypassing this error. Anyone come across this issue before? Suggestions on tackling this??

The other one has to deal with the *.idngram file created . I'll start with the sequence of commands i am executing...

wfreq2vocab (INPUT .wfreq; OUTPUT .vocab)
for ($n = ngram){ //create these files iteratively to use in successive runs
wngram2idngram (INPUT .vocab, .wngram; OUTPUT .idNgram) //N = $n
mergeidngram (INPUT .idNgram; OUTPUT .idngram) //N = 1..$n
idngram2lm (INPUT .idngram; OUTPUT *.arpa)
}

Everyting seems to run fine until the very last step where i use the merged *.idngram file, returning an error of "_cygtls::handle_exceptions: Error dumping state".
On some debugging, it appears that the merged idngram file has some odd entries.

Sample file contents (with line numbers):
*.id2gram (WID_1 WID_2 WFREQ)
1. 2 3 7483647
2. 2 4 9577454
3. 2 5 24266
4.
5.
.
.
x. 849 849 1124

.idngram (merged .id1gram & *.id2gram)
1. 1 65535 2 <--- where did this come from? for a 1k dictionary, a 65k entry??
2. 2 3 2483647
3. 2 4 2077454
4. 2 5 24266
5.
.
.
x. 849 849 1124
x+1. 65535 3 2483647
x+2. 4 65535 5
x+3. 10980 6 660658
x+4. 7 21530 8
.
.

What are these new entries, which seem erroneous, and cause the program to crash?
When i manually go in and delte these extra entries, I am able to get a .arpa file, but with warnings of setting backoffs to '0' (I suspect this is because now this stripped-down file is simply .id2gram, with no unigram info)!

Lastly, I am using the toolkit in cygwin environment.

Any inputs would be great.

cheers,
Horrizon
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Horizon - 2008-06-04
  
  Hi Nikolay,
  
  Finally got back to your suggestions - excuse me for the delay! I was having some problem getting SimpleLM to work out-of-the-box, but its now resolved. For testing SimpleLM, I used it to train the LM on RM1 data (using just expected results to train) using 'cmulmtk'. There is a problem in 'wngram2idngram'. The contents of the generated *.idngram (clrealy wrong) is below:
  
  65535 65535 65535 4016015
  65535 65535 65535 82
  65535 65535 65535 83
  65535 65535 65535 82
  65535 65535 65535 1
  
  However, instead of pointing binaries to cmulmtk and using cmu-camv2.05, I am able to get a *.arpabo file which looks to have valid numbers. For cmulmtk as you suggested, I ensured that words with digits are not part of the vocab. Are you aware of any other bugs?
  
  Will report to you if the trained dummy RM1 LM works reasonably fine (for sanity check sake), and then back to my dataset!
  
  Horrizon
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-06-04
    
    > 65535 65535 65535 4016015
    
    Yes, it's a bug. I've just posted a patch to fix it on mailing list. Unfortunately, it's not easily extractable from html:
    
    http://sourceforge.net/mailarchive/forum.php?thread_name=1212508090.2250.3.camel%40sphinx.teleton&forum_name=cmusphinx-sdmeet
    
    I'll commit it soon anyhow and this bug will be fixed. But if you'll have other problems with RM1, please report.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Horizon - 2008-05-26
  
  For the sake of accuracy, I think my post should be updated slightly as follows..
  
  Its a 849 word dicionary.
  
  wfreq2vocab (INPUT .wfreq; OUTPUT .vocab)
  for ($n){ //create these files iteratively to use in successive runs
  ...wngram2idngram (INPUT .vocab, .wNgram; OUTPUT .idNgram) //N = $n
  ...mergeidngram (INPUT .id1gram ... .idNgram; OUTPUT .idngram)
  ...idngram2lm (INPUT .idngram; OUTPUT N.arpa)
  }
  
  .idngram for training a bigram (merged .id1gram & *.id2gram)
  1. 1 65535 2 <--- where did this come from? for a 849-word dictionary, a 65k word-id entry??
  2. 2 3 2483647
  3. 2 4 2077454
  4. 2 5 24266
  5.
  .
  .
  x+1. 849 849 1124
  x+2. 65535 3 2483647 <--- ?
  x+3. 4 65535 5 <--- ?
  x+4. 10980 6 660658 <--- ?
  x+5. 7 21530 8 <--- ?
  . <--- ?
  . <--- ?
  
  Reports error and exits here in idngram2lm.c:
  .
  .
  pos_of_novelty = ng.n;
  ..for (i=0;i<=ng.n-1;i++) {
  ....if (current_ngram.id_array[i] > previous_ngram.id_array[i]) {
  ......pos_of_novelty = i;
  ......i=ng.n;
  ....}
  ....else {
  ......if (current_ngram.id_array[i] < previous_ngram.id_array[i]) {
  ........if (nlines < 5) { / Error ocurred early - file format? /
  ..........quit(-1,"Error : n-gram ordering problem - could be due to using wrong file format.\nCheck whether id n-gram file is in ascii or binary format.\n");
  ........}
  ........else {
  ..........quit(-1,"Error : n-grams are not correctly ordered. Error occurred at ngram %d.\n",nlines); <-- reports error here
  ........}
  ......}
  ....}
  ..}
  .
  .
  
  Format of input files...
  .wNgram files: W_1....W_N COUNT
  .idngram files: WID_1....WID_N COUNT
  
  Would it be possible for someone to post sample intermediate files for me to cross check with?
  Any pointers to LM training recepies using CMU-Cam LM Toolkit would be quite helpful as well.
  
  Horrizon
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-05-26
    
    It would be helpful for you to pack your files in archive and give us a link.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Horizon - 2008-05-27
  
  Hi Nickolay,
  
  Files mentioned above are packed in an archive (http://rapidshare.com/files/118006694/SampleLM_Files.rar.html).
  You can run 'a.pl' (assumes the CMU-Cam LM executables are in the same directory, which I was running in cygwin's lattest release).
  
  Thanks,
  Horrizon
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Horizon - 2008-05-28
  
  That was a good suggestion Nikolay! Hope that helped.
  
  Could someone take a brief look at the files I posted (link in previous post) and see if something funny's happening. For starters, you don't even have to run the thing... just look at the files and file format to see if something anomolous could be spotted. Would be great if someone could post a sample set of input/intermediate/final outut files generated from CMU-Cam LM?
  
  I am really curious though, is there no good training recepie for CMU-Cam Toolkit LM trainig? I think Keith's done an excellent job, which is no doublt very helpful for people using the SRILM.
  
  Horrizon
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-05-28
    
    Well, there are multiple issues here.
    
    1) Why use old CAM while there is newer one and supported cmulmtk with the same code.
    2) Why do you merge idgrams? The whole process must be more simple:
    
    text -> wfreq
    wfreq -> vocab
    text + vocab -> 3grams (wngrams or idngrams directly)
    3grams + smoothing for backoff -> language model
    
    You can check SimpleLM for example or scripts inside cmulmtk.
    
    3) Recent cmulmtk trunk has a bug in hashing that doesn't allow numbers in words, so your A0000 will not work out of box. This bug must be fixed.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Horizon - 2008-05-29
  
  Thanks for the pointers Nikolay... quite helpful. Ok, so progress for sure, but its kinda' back to square one!! Atleast its not crashing though :).
  
  The reason I thought of using the mergeidngram function was because when I use just the 1/2/3gram files, backoffs are either -99.99 or 0.00 (using -linear smoothing for now, though don't think it affects backoffs). I wish I had the text files from where ngram's were extracted, but unfortunately just have the ngram counts. So given this, if I don't use the mergeidngram function, how do i pass lower ngram info (looks like idngram2lm requires a single idngram file)? So I'm able to generate LM for a 2gram, albeit with erroneous backoff weights.
  
  Another issue I am running into is "Error: Sum[P(w)] = 10.792" (reported when computing unigram probabilities; appears to be first set of results reported when processing a 3gram). Sometimes the sum on other data I observed can be as high as slightly above 100! Would this be happening because of large word counts?
  
  On setting verbosity to 4, I get the following output before the program quits on the Sum[] error (indicates a few other issues as well):
  
  Warning: 1-gram: f-of-f[1] = 0 --> 1-gram discounting is disabled.
  Warning: 2-gram: f-of-f[1] = 0 --> 2-gram discounting is disabled.
  Warning: 3-gram: f-of-f[1] = 0 --> 3-gram discounting is disabled.
  prob[1] = 1e-99 count = 0 ## This is a filler word
  .
  .
  prob[N] = 1.308e-05 count = 22924
  .
  .
  
  Unigram's discount mass is -9.80
  Discount mass was rounded to zero (n1/N = 0)
  prob[unk] = 1e-99
  WARNING: 31 non-context-cue words have zero probability
  
  THE FINAL UNIGRAM:
  unigram[1] = 1e-99 ## This is a filler word which was mentioned in filler.ccs
  .
  .
  unigram[N] = 0.195
  .
  .
  
  Horrizon
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-05-30
    
    > The reason I thought of using the mergeidngram function was because when I use just the 1/2/3gram files, backoffs are either -99.99 or 0.00 (using -linear smoothing for now, though don't think it affects backoffs). I wish I had the text files from where ngram's were extracted, but unfortunately just have the ngram counts. So given this, if I don't use the mergeidngram function, how do i pass lower ngram info (looks like idngram2lm requires a single idngram file)? So I'm able to generate LM for a 2gram, albeit with erroneous backoff weights.
    
    I think that bigram counts are contained in 3grams counts (the assumption is that you have all trigrams from a text), so there is really no sense to merge 3gram counts with 2gram onces. Try to get a simple text and build a model from it. It will just dump 3grams in idgrams and then build correct lm from it.
    
    > Another issue I am running into is "Error: Sum[P(w)] = 10.792" (reported when computing unigram probabilities; appears to be first set of results reported when processing a 3gram). Sometimes the sum on other data I observed can be as high as slightly above 100! Would this be happening because of large word counts?
    
    About large counts, of course such a big precise values aren't important for LM, did you try just to normalize them (divide on 1000)?
    
    > THE FINAL UNIGRAM:
    unigram[1] = 1e-99 ## This is a filler word which was mentioned in filler.ccs
    
    I'll try to reproduce it myself, probably it's just a bug with numbers as I told.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2008-05-30
      
      To make sure we are working with the same code, did you migrate to cmuclmtk trunk?
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

LM Training issues using CMU-LM v2.05

Speech Recognition Toolkit

Forums

Help

LM Training issues using CMU-LM v2.05 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

LM Training issues using CMU-LM v2.05