Menu

Very high perplexity

Help
2015-07-07
2015-07-07
  • Luise Rygaard

    Luise Rygaard - 2015-07-07

    Hi,

    I am training a recognizer on a dataset of about 18 hrs of training data. I am adapting the callhome_egyptian s5 recipe (because part of my data is from the Callhome set), but I am having trouble with the call to callhome_train_lms.sh. I get the following perplexity info:

    Perplexity over 139962.000000 words is 2704.151374
    Perplexity over 139337.000000 words (excluding 625.000000 OOVs) is 2768.970163
    2704.151374

    First of all, the perplexity is extremely high, so I assume something is wrong. Another indicator is the number of words (139337). The set I am training is a subset of a different dataset. The original full dataset gives me the following output after calling callhome_trian_lms.sh:

    Perplexity over 97006.000000 words is 161.256061
    Perplexity over 96103.000000 words (excluding 903.000000 OOVs) is 163.822326
    161.256061

    The word count here (96103) is significantly smaller (~45000) than for the subset which seems wrong. I have looked at train_lm.sh and compute_perplexity.cc, but I still can't figure out where the word count comes from, and why it might be wrong - and thus why the perplexity is so big.

    Also, I have tried to use the SRILM tools instead to build a bigram instead. This gives me the following stats

    0 zeroprobs, logprob= -32617.5 ppl= 205.187 ppl1= 386.593

    and a G.fst with min/max weights 0 and -0.162378 (from fstisstochastic). I know that a min of 0 is ideal, but it just doesn't really happen usually, so I feel like this is wrong as well.

    Does someone have a suggestion as to where I should look, and what I should look for?

    Thanks in advance,
    Luise

     

    Last edit: Luise Rygaard 2015-07-07
    • Daniel Povey

      Daniel Povey - 2015-07-07

      You might find the SRILM tools to be easier to use-- have a look for a
      setup that uses SRILM in its LM training.
      Dan

      On Tue, Jul 7, 2015 at 11:14 AM, Luise Rygaard luisev@users.sf.net wrote:

      Hi,

      I am training a recognizer on a dataset of about 18 hrs of training data. I
      am adapting the callhome_egyptian s5 recipe (because part of my data is from
      the Callhome set), but I am having trouble with the call to
      callhome_train_lms.sh. I get the following perplexity info:

      Perplexity over 139962.000000 words is 2704.151374
      Perplexity over 139337.000000 words (excluding 625.000000 OOVs) is
      2768.970163
      2704.151374

      First of all, the perplexity is extremely high, so I assume something is
      wrong. Another indicator is the number of words (139337). The set I am
      training is a subset of a different dataset. The original full dataset gives
      me the following output after calling callhome_trian_lms.sh:

      Perplexity over 97006.000000 words is 161.256061
      Perplexity over 96103.000000 words (excluding 903.000000 OOVs) is 163.822326
      161.256061

      The word count here (96103) is significantly smaller (~45000) than for the
      subset which seems wrong. I have looked at train_lm.sh and
      compute_perplexity.cc, but I still can't figure out where the word count
      comes from, and why it might be wrong - and thus why the perplexity is so
      big.

      Does someone have a suggestion as to where I should look, and what I should
      look for?

      Thanks in advance,
      Luise


      Very high perplexity


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355348/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • Luise Rygaard

        Luise Rygaard - 2015-07-07

        Thanks, Dan. As I included in the edited post above, I actually already tried SRILM. Would you say that I should not be concerned by the 0 minimum weight and just continue?

         
        • Daniel Povey

          Daniel Povey - 2015-07-07

          Your edited post said:
          "Also, I have tried to use the SRILM tools instead to build a bigram
          instead. This gives me the following stats

          0 zeroprobs, logprob= -32617.5 ppl= 205.187 ppl1= 386.593

          and a G.fst with min/max weights 0 and -0.162378 (from
          fstisstochastic). I know that a min of 0 is ideal, but it just doesn't
          really happen usually, so I feel like this is wrong as well."

          That looks fine to me.

          Dan

          On Tue, Jul 7, 2015 at 11:57 AM, Luise Rygaard luisev@users.sf.net wrote:

          Thanks, Dan. As I included in the edited post above, I actually already
          tried SRILM. Would you say that I should not be concerned by the 0 minimum
          weight and just continue?


          Very high perplexity


          Sent from sourceforge.net because you indicated interest in
          https://sourceforge.net/p/kaldi/discussion/1355348/

          To unsubscribe from further messages, please visit
          https://sourceforge.net/auth/subscriptions/

           
  • Luise Rygaard

    Luise Rygaard - 2015-07-07

    Perfect. Thank you for your help!
    Luise