I am training a recognizer on a dataset of about 18 hrs of training data. I am adapting the callhome_egyptian s5 recipe (because part of my data is from the Callhome set), but I am having trouble with the call to callhome_train_lms.sh. I get the following perplexity info:
Perplexity over 139962.000000 words is 2704.151374
Perplexity over 139337.000000 words (excluding 625.000000 OOVs) is 2768.970163
2704.151374
First of all, the perplexity is extremely high, so I assume something is wrong. Another indicator is the number of words (139337). The set I am training is a subset of a different dataset. The original full dataset gives me the following output after calling callhome_trian_lms.sh:
Perplexity over 97006.000000 words is 161.256061
Perplexity over 96103.000000 words (excluding 903.000000 OOVs) is 163.822326
161.256061
The word count here (96103) is significantly smaller (~45000) than for the subset which seems wrong. I have looked at train_lm.sh and compute_perplexity.cc, but I still can't figure out where the word count comes from, and why it might be wrong - and thus why the perplexity is so big.
Also, I have tried to use the SRILM tools instead to build a bigram instead. This gives me the following stats
and a G.fst with min/max weights 0 and -0.162378 (from fstisstochastic). I know that a min of 0 is ideal, but it just doesn't really happen usually, so I feel like this is wrong as well.
Does someone have a suggestion as to where I should look, and what I should look for?
Thanks in advance,
Luise
Last edit: Luise Rygaard 2015-07-07
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am training a recognizer on a dataset of about 18 hrs of training data. I
am adapting the callhome_egyptian s5 recipe (because part of my data is from
the Callhome set), but I am having trouble with the call to
callhome_train_lms.sh. I get the following perplexity info:
Perplexity over 139962.000000 words is 2704.151374
Perplexity over 139337.000000 words (excluding 625.000000 OOVs) is
2768.970163
2704.151374
First of all, the perplexity is extremely high, so I assume something is
wrong. Another indicator is the number of words (139337). The set I am
training is a subset of a different dataset. The original full dataset gives
me the following output after calling callhome_trian_lms.sh:
Perplexity over 97006.000000 words is 161.256061
Perplexity over 96103.000000 words (excluding 903.000000 OOVs) is 163.822326
161.256061
The word count here (96103) is significantly smaller (~45000) than for the
subset which seems wrong. I have looked at train_lm.sh and
compute_perplexity.cc, but I still can't figure out where the word count
comes from, and why it might be wrong - and thus why the perplexity is so
big.
Does someone have a suggestion as to where I should look, and what I should
look for?
Thanks, Dan. As I included in the edited post above, I actually already tried SRILM. Would you say that I should not be concerned by the 0 minimum weight and just continue?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
and a G.fst with min/max weights 0 and -0.162378 (from
fstisstochastic). I know that a min of 0 is ideal, but it just doesn't
really happen usually, so I feel like this is wrong as well."
Thanks, Dan. As I included in the edited post above, I actually already
tried SRILM. Would you say that I should not be concerned by the 0 minimum
weight and just continue?
Hi,
I am training a recognizer on a dataset of about 18 hrs of training data. I am adapting the callhome_egyptian s5 recipe (because part of my data is from the Callhome set), but I am having trouble with the call to callhome_train_lms.sh. I get the following perplexity info:
First of all, the perplexity is extremely high, so I assume something is wrong. Another indicator is the number of words (139337). The set I am training is a subset of a different dataset. The original full dataset gives me the following output after calling callhome_trian_lms.sh:
The word count here (96103) is significantly smaller (~45000) than for the subset which seems wrong. I have looked at train_lm.sh and compute_perplexity.cc, but I still can't figure out where the word count comes from, and why it might be wrong - and thus why the perplexity is so big.
Also, I have tried to use the SRILM tools instead to build a bigram instead. This gives me the following stats
and a G.fst with min/max weights 0 and -0.162378 (from fstisstochastic). I know that a min of 0 is ideal, but it just doesn't really happen usually, so I feel like this is wrong as well.
Does someone have a suggestion as to where I should look, and what I should look for?
Thanks in advance,
Luise
Last edit: Luise Rygaard 2015-07-07
You might find the SRILM tools to be easier to use-- have a look for a
setup that uses SRILM in its LM training.
Dan
On Tue, Jul 7, 2015 at 11:14 AM, Luise Rygaard luisev@users.sf.net wrote:
Thanks, Dan. As I included in the edited post above, I actually already tried SRILM. Would you say that I should not be concerned by the 0 minimum weight and just continue?
Your edited post said:
"Also, I have tried to use the SRILM tools instead to build a bigram
instead. This gives me the following stats
0 zeroprobs, logprob= -32617.5 ppl= 205.187 ppl1= 386.593
and a G.fst with min/max weights 0 and -0.162378 (from
fstisstochastic). I know that a min of 0 is ideal, but it just doesn't
really happen usually, so I feel like this is wrong as well."
That looks fine to me.
Dan
On Tue, Jul 7, 2015 at 11:57 AM, Luise Rygaard luisev@users.sf.net wrote:
Perfect. Thank you for your help!
Luise