Kaldi / Discussion / Help: Very high perplexity

Luise Rygaard - 2015-07-07

Hi,

I am training a recognizer on a dataset of about 18 hrs of training data. I am adapting the callhome_egyptian s5 recipe (because part of my data is from the Callhome set), but I am having trouble with the call to callhome_train_lms.sh. I get the following perplexity info:

Perplexity over 139962.000000 words is 2704.151374
Perplexity over 139337.000000 words (excluding 625.000000 OOVs) is 2768.970163
2704.151374

First of all, the perplexity is extremely high, so I assume something is wrong. Another indicator is the number of words (139337). The set I am training is a subset of a different dataset. The original full dataset gives me the following output after calling callhome_trian_lms.sh:

Perplexity over 97006.000000 words is 161.256061
Perplexity over 96103.000000 words (excluding 903.000000 OOVs) is 163.822326
161.256061

The word count here (96103) is significantly smaller (~45000) than for the subset which seems wrong. I have looked at train_lm.sh and compute_perplexity.cc, but I still can't figure out where the word count comes from, and why it might be wrong - and thus why the perplexity is so big.

Also, I have tried to use the SRILM tools instead to build a bigram instead. This gives me the following stats

0 zeroprobs, logprob= -32617.5 ppl= 205.187 ppl1= 386.593

and a G.fst with min/max weights 0 and -0.162378 (from fstisstochastic). I know that a min of 0 is ideal, but it just doesn't really happen usually, so I feel like this is wrong as well.

Does someone have a suggestion as to where I should look, and what I should look for?

Thanks in advance,
Luise

Last edit: Luise Rygaard 2015-07-07

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-07-07
  
  You might find the SRILM tools to be easier to use-- have a look for a
  setup that uses SRILM in its LM training.
  Dan
  
  On Tue, Jul 7, 2015 at 11:14 AM, Luise Rygaard luisev@users.sf.net wrote:
  
  Hi,
  
  I am training a recognizer on a dataset of about 18 hrs of training data. I
  am adapting the callhome_egyptian s5 recipe (because part of my data is from
  the Callhome set), but I am having trouble with the call to
  callhome_train_lms.sh. I get the following perplexity info:
  
  Perplexity over 139962.000000 words is 2704.151374
  Perplexity over 139337.000000 words (excluding 625.000000 OOVs) is
  2768.970163
  2704.151374
  
  First of all, the perplexity is extremely high, so I assume something is
  wrong. Another indicator is the number of words (139337). The set I am
  training is a subset of a different dataset. The original full dataset gives
  me the following output after calling callhome_trian_lms.sh:
  
  Perplexity over 97006.000000 words is 161.256061
  Perplexity over 96103.000000 words (excluding 903.000000 OOVs) is 163.822326
  161.256061
  
  The word count here (96103) is significantly smaller (~45000) than for the
  subset which seems wrong. I have looked at train_lm.sh and
  compute_perplexity.cc, but I still can't figure out where the word count
  comes from, and why it might be wrong - and thus why the perplexity is so
  big.
  
  Does someone have a suggestion as to where I should look, and what I should
  look for?
  
  Thanks in advance,
  Luise
  
  Very high perplexity
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355348/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Luise Rygaard - 2015-07-07
    
    Thanks, Dan. As I included in the edited post above, I actually already tried SRILM. Would you say that I should not be concerned by the 0 minimum weight and just continue?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Daniel Povey - 2015-07-07
      
      Your edited post said:
      "Also, I have tried to use the SRILM tools instead to build a bigram
      instead. This gives me the following stats
      
      0 zeroprobs, logprob= -32617.5 ppl= 205.187 ppl1= 386.593
      
      and a G.fst with min/max weights 0 and -0.162378 (from
      fstisstochastic). I know that a min of 0 is ideal, but it just doesn't
      really happen usually, so I feel like this is wrong as well."
      
      That looks fine to me.
      
      Dan
      
      On Tue, Jul 7, 2015 at 11:57 AM, Luise Rygaard luisev@users.sf.net wrote:
      
      Thanks, Dan. As I included in the edited post above, I actually already
      tried SRILM. Would you say that I should not be concerned by the 0 minimum
      weight and just continue?
      
      Very high perplexity
      
      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/
      
      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Luise Rygaard - 2015-07-07

Perfect. Thank you for your help!
Luise

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Very high perplexity

Forums

Help

Very high perplexity document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Very high perplexity