I want to perform word recognition "guided" by transcripts. The idea is to generate a language model based on the transcript, then use ngram_model_set_init to combine that with the generic CMUDict language model, forming a biased lanuage model.
How do I determine the lm weights so that the transcript lm dominates, but the generic lm still applies whenever the transcript is clearly wrong? -- If I wanted to do this only once, I could figure out the weights by trial and error. But I want a formula that works for any transcript lm.
So here's what I was thinking: The higher the number of unigrams in the transcript lm, the lower the average probability for each unigram. So using using fixed weights for both language models would give too much preference to the transcript lm if it contains few unigrams, and too little if it contains a high number of unigrams. Instead, I'd determine the weight for the transcript lm by dividing a fixed number α by the number of unigrams in the transcript lm. After that, I'd normalize the weights to add up to 1.0.
Does this approach make sense or is there a better way? Has somebody already figured out a value for α that works well in general?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I successfully created a biased LM using ngram_model_set_init and weights [1.0, 1.0]. I then experimented with different weights. At one time, I passed [1000.0, 1000.0]. Given that the ratio is 1:1, I expected the same results as for [1.0, 1.0]. To my surprise, the result was not only different, but actually better!
So my question is: What should these weights add up to? Are there any guidelines? And what exactly changes with absolute values rather than the ratio?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is no normalization, it simply adds to the score calcualted by the model. So if 1000 is better it probably means your language weight is not perfectly calibrated, slightly different language weight should have the same effect as weight.
I recommend weights to sum to 1.0, but it is not enforced.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using the CMUSphinx US English generic acoustic model (server version) along with the default language model that comes with Pocketsphinx. I also use the default language model weight of 6.5.
Now I'm wondering:
The readme file for the acoustic model says, "To use this model for large vocabulary speech recognition download also CMUDict and US English generic language model." Does this refer to the files that ship with Pocketsphinx and which I already use, or are there larger versions of the dictionary and LM available?
Assuming that I'm using the correct dictionary and LM, what is the best way to determine a good language model weight?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The readme file for the acoustic model says, "To use this model for large vocabulary speech recognition download also CMUDict and US English generic language model." Does this refer to the files that ship with Pocketsphinx and which I already use, or are there larger versions of the dictionary and LM available?
Big langauge model is available on website, but it is still not perfect.
I've collected about 30 minutes of transcribed audio and I'm running pocketsphinx_batch in order to determine a good language model weight. Everything is working fine, but processing is taking rather long (about real time) on my machine:
1.05 xRT (CPU), 1.07 xRT (elapsed)
The tutorial mentions decoding times about 30 times faster (0.03 xRT). Am I doing something wrong?
I'm using an almost up-to-date version of Pocketsphinx, built in Release mode on Windows. My machine is an i5 with good overall performance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I performed 20 test passes with varying language model weight. All tests used a 27-minute body of high-quality speech. All tests used the CMUSphinx US English generic acoustic model (server version) along with the default language model that comes with Pocketsphinx.
Here are my results:
Best accuracy was achieved with language model weights between 2.0 and 6.5. Within this range, the total accuracy was between 72% an 73%.
Language model weights lower than 2.0 resulted in slightly worse accuracy, along with much higher processing times.
Language model weights higher than 6.5 resulted in much lower accuracy.
I noted earlier that for a biased language model, higher language model weights seem to give better results. Intuitively, that makes perfect sense, since a biased language model already "knows" the correct words and tuples.
Thus, the default language model weight of 6.5 appears to be a good compromise for both unbiased and biased language models. Based on my tests, I might choose a slightly lower value (around 5.0) for unbiased language models and a higher value (tbd.) for biased ones.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is reasonable. Please note that there are 3 stages - fwdflat, bestpath, fwdtree each with different language weight. You need to disable fwdflat and bestpath to tune fwdtree lw, then tune corresponding fwdflat and bestpath lw. Overall, defaults should be ok, it's just matters that they might be different if you have very specialized lm.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I started this thread asking how to choose weights for a biased language model. The following are my results; they may be useful to others, too.
To recap: I'm loading the default language model that comes with Pocketsphinx ("default model") as well as a language model generated from the transcript of a single recording ("dialog model"). I then combine those into a single biased language model by calling ngram_model_set_init. The two weights I pass always add up to 1.0, so in the following, I'll only mention the dialog model weights.
First, I wanted to know the range of dialog model weights that would still result in a sufficiently biased model. So I chose a short recording that resulted in 7 errors using the default model. (By "errors" I mean the Levenshtein distance between the actual and the recognized words.) I started with a dialog model weight of 0.9, which reduced the number of errors to 0. I then gradually lowered the dialog model weight to see how low I could go. Surprisingly (to me), I could go down to 0.000000000000000008 (8e-18), while still getting 0 errors. Only when reducing the dialog model weight to 5e-18 did I get the first error.
Next, I wanted to know whether the required dialog model weight depends on the size of the dialog model. So I artificially bloated the dialog model by adding nonsense words to the dialog file. This increased the model size tenfold from 15 to 150 unigrams. My intuition was that I would now need ten times the dialog model weight to get 0 errors. However, the results were identical to the earlier tests: 8e-18 gave 0 errors; 5e-18 gave 1 error. I don't understand why this is the case, but it certainly makes matters easier: It allows me to choose a fixed weight rather than calculating it dynamically based on the dialog.
Clearly, there is a huge range of possible weights that will give good results (in my case, 8e-18 .. 0.9), as long as the provided transcript is correct. But what if it isn't?
To find that out, I created a fake transcript, where each word sounds similar to the actual one, but is in fact wrong. This is the worst possible case. I used a dialog weight of 0.9, expecting to get a result containing all the errors from the fake transcript. To my surprise, however, only a single false word from the transcript ("butt" instead of "but") ended up in the result. Apart from that, the result was identical to that I got using only the default model.
This is great news: It means that you have great freedom in choosing weights. Even an extremely low dialog model weight will still improve the results. And even a very high dialog model weight won't allow a false transcript to ruin the results.
tl;dr: Use fixed weights of 0.9 : 0.1, or anything similar.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You mentioned that there are three language model weights. If I understand correctly, -lwonly affects the 'fwdtree' stage, while -fwdflatlw and -bestpathlwaffect the 'flat lexicon' and 'bestpath' stages, respectively. Is that correct?
For biased language models, I'd like to experiment with higher language model weights. I noted before that passing a combined weight of 2000 to ngram_model_set_init led to better results. Now I'd like to get similar results with normalized model set weight and higher lm weights instead. Does that mean that I should multiply the three lm weights with 2000, passing -lw 13000 -fwdflatlw 17000 -bestpathlw 19000?
You propose to tweak the three lm weights individually. Does that mean setting two out of three to 0? Will that give meaningful results?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You mentioned that there are three language model weights. If I understand correctly, -lw only affects the 'fwdtree' stage, while -fwdflatlw and -bestpathlwaffect the 'flat lexicon' and 'bestpath' stages, respectively. Is that correct?
Yes
For biased language models, I'd like to experiment with higher language model weights. I noted before that passing a combined weight of 2000 to ngram_model_set_init led to better results. Now I'd like to get similar results with normalized model set weight and higher lm weights instead. Does that mean that I should multiply the three lm weights with 2000, passing -lw 13000 -fwdflatlw 17000 -bestpathlw 19000?
This is strange, such high rate should not work at all. Reasonable weight is between 3.0 and 30.0, not more.
You propose to tweak the three lm weights individually. Does that mean setting two out of three to 0? Will that give meaningful results?
You disable fwdflat and bestpath first with -fwdflat no -bestpath no, then enable fwdflat and tune fwdflatlw, then enable bestpath.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You disable fwdflat and bestpath first with -fwdflat no -bestpath no, then enable fwdflat and tune fwdflatlw, then enable bestpath.
That makes sense, thank you. My plan was to raise the lm weights just high enough that I minimize the error rate, assuming that the biased lm is based on a correct transcript. Does this approach make sense, or will disabling fwdflat and bestpath lead to widely different results?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I did some tests, but the results are sobering. I couldn't get consistently lower error rates for biased language models by raising the lm weights. So I'm dropping the idea for the time being.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Getting back to this topic. I have a number of recording that sound very clear (to me). For each recording, I have a correct transcript, which I turn into a language model. I then combine this specialized language model with the default language model using 0.9:0.1 weights. The result is a biased language model that strongly favors the correct words and word tuples.
What I don't understand is why PocketSphinx still recognizes so many words incorrectly. My understanding is that raising the language model weights would tell PocketSphinx to give more regard to the probabilities in the language model, which should boost the correct words from the transcript. But raising the lm weights only increases the error rate for me.
Am I doing something wrong? Is there another way to tell PocketSphinx to pay more attention to the lm probabilities?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Acoustic model comes first so it is more important. Beam controls that and cuts acoustic hypothesis before it survives the end for lm scoring. You can try with very large beam, you should see that your lm starts to work better but still, its very important to get the acoustic model right.
To reduce other effects you can disable fwdflat and bestpath only testing fwdtree.
You can also try how DNN decoders like Kaldi work if you care much about accuracy, they should be quite accurate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to perform word recognition "guided" by transcripts. The idea is to generate a language model based on the transcript, then use
ngram_model_set_init
to combine that with the generic CMUDict language model, forming a biased lanuage model.How do I determine the lm weights so that the transcript lm dominates, but the generic lm still applies whenever the transcript is clearly wrong? -- If I wanted to do this only once, I could figure out the weights by trial and error. But I want a formula that works for any transcript lm.
So here's what I was thinking: The higher the number of unigrams in the transcript lm, the lower the average probability for each unigram. So using using fixed weights for both language models would give too much preference to the transcript lm if it contains few unigrams, and too little if it contains a high number of unigrams. Instead, I'd determine the weight for the transcript lm by dividing a fixed number
α
by the number of unigrams in the transcript lm. After that, I'd normalize the weights to add up to 1.0.Does this approach make sense or is there a better way? Has somebody already figured out a value for
α
that works well in general?Fixed value should work ok.
Hi Nickolay,
Thank you for your answer (belatedly)!
I successfully created a biased LM using
ngram_model_set_init
and weights [1.0, 1.0]. I then experimented with different weights. At one time, I passed [1000.0, 1000.0]. Given that the ratio is 1:1, I expected the same results as for [1.0, 1.0]. To my surprise, the result was not only different, but actually better!So my question is: What should these weights add up to? Are there any guidelines? And what exactly changes with absolute values rather than the ratio?
Hello Daniel
There is no normalization, it simply adds to the score calcualted by the model. So if 1000 is better it probably means your language weight is not perfectly calibrated, slightly different language weight should have the same effect as weight.
I recommend weights to sum to 1.0, but it is not enforced.
Thank you!
I'm using the CMUSphinx US English generic acoustic model (server version) along with the default language model that comes with Pocketsphinx. I also use the default language model weight of 6.5.
Now I'm wondering:
Big langauge model is available on website, but it is still not perfect.
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/en-70k-0.2.lm.gz/download
It is not currently supported unfortunately in pocketsphinx due to the small bug which I wanted to fix
https://sourceforge.net/p/cmusphinx/discussion/sphinx4/thread/e2a8acc9/
Modern state of art is to use large RNNLM models, those are much more accurate but has to be trained separately.
Create a test set as described in http://cmusphinx.sourceforge.net/wiki/tutorialtuning
Run with different language weights and pick the language weight which gives smaller WER.
Thank you -- that sounds easy enough! :-)
I've collected about 30 minutes of transcribed audio and I'm running
pocketsphinx_batch
in order to determine a good language model weight. Everything is working fine, but processing is taking rather long (about real time) on my machine:The tutorial mentions decoding times about 30 times faster (0.03 xRT). Am I doing something wrong?
I'm using an almost up-to-date version of Pocketsphinx, built in Release mode on Windows. My machine is an i5 with good overall performance.
It is ok for continuous model.
This speed is reasonble for ptm model and a small grammar.
That makes sense; thank you!
I performed 20 test passes with varying language model weight. All tests used a 27-minute body of high-quality speech. All tests used the CMUSphinx US English generic acoustic model (server version) along with the default language model that comes with Pocketsphinx.
Here are my results:
I noted earlier that for a biased language model, higher language model weights seem to give better results. Intuitively, that makes perfect sense, since a biased language model already "knows" the correct words and tuples.
Thus, the default language model weight of 6.5 appears to be a good compromise for both unbiased and biased language models. Based on my tests, I might choose a slightly lower value (around 5.0) for unbiased language models and a higher value (tbd.) for biased ones.
This is reasonable. Please note that there are 3 stages - fwdflat, bestpath, fwdtree each with different language weight. You need to disable fwdflat and bestpath to tune fwdtree lw, then tune corresponding fwdflat and bestpath lw. Overall, defaults should be ok, it's just matters that they might be different if you have very specialized lm.
I started this thread asking how to choose weights for a biased language model. The following are my results; they may be useful to others, too.
To recap: I'm loading the default language model that comes with Pocketsphinx ("default model") as well as a language model generated from the transcript of a single recording ("dialog model"). I then combine those into a single biased language model by calling
ngram_model_set_init
. The two weights I pass always add up to 1.0, so in the following, I'll only mention the dialog model weights.First, I wanted to know the range of dialog model weights that would still result in a sufficiently biased model. So I chose a short recording that resulted in 7 errors using the default model. (By "errors" I mean the Levenshtein distance between the actual and the recognized words.) I started with a dialog model weight of 0.9, which reduced the number of errors to 0. I then gradually lowered the dialog model weight to see how low I could go. Surprisingly (to me), I could go down to 0.000000000000000008 (8e-18), while still getting 0 errors. Only when reducing the dialog model weight to 5e-18 did I get the first error.
Next, I wanted to know whether the required dialog model weight depends on the size of the dialog model. So I artificially bloated the dialog model by adding nonsense words to the dialog file. This increased the model size tenfold from 15 to 150 unigrams. My intuition was that I would now need ten times the dialog model weight to get 0 errors. However, the results were identical to the earlier tests: 8e-18 gave 0 errors; 5e-18 gave 1 error. I don't understand why this is the case, but it certainly makes matters easier: It allows me to choose a fixed weight rather than calculating it dynamically based on the dialog.
Clearly, there is a huge range of possible weights that will give good results (in my case, 8e-18 .. 0.9), as long as the provided transcript is correct. But what if it isn't?
To find that out, I created a fake transcript, where each word sounds similar to the actual one, but is in fact wrong. This is the worst possible case. I used a dialog weight of 0.9, expecting to get a result containing all the errors from the fake transcript. To my surprise, however, only a single false word from the transcript ("butt" instead of "but") ended up in the result. Apart from that, the result was identical to that I got using only the default model.
This is great news: It means that you have great freedom in choosing weights. Even an extremely low dialog model weight will still improve the results. And even a very high dialog model weight won't allow a false transcript to ruin the results.
tl;dr: Use fixed weights of 0.9 : 0.1, or anything similar.
Hi Nickolay,
You mentioned that there are three language model weights. If I understand correctly,
-lw
only affects the 'fwdtree' stage, while-fwdflatlw
and-bestpathlw
affect the 'flat lexicon' and 'bestpath' stages, respectively. Is that correct?For biased language models, I'd like to experiment with higher language model weights. I noted before that passing a combined weight of 2000 to
ngram_model_set_init
led to better results. Now I'd like to get similar results with normalized model set weight and higher lm weights instead. Does that mean that I should multiply the three lm weights with 2000, passing-lw 13000 -fwdflatlw 17000 -bestpathlw 19000
?You propose to tweak the three lm weights individually. Does that mean setting two out of three to 0? Will that give meaningful results?
Yes
This is strange, such high rate should not work at all. Reasonable weight is between 3.0 and 30.0, not more.
You disable fwdflat and bestpath first with -fwdflat no -bestpath no, then enable fwdflat and tune fwdflatlw, then enable bestpath.
That makes sense, thank you. My plan was to raise the lm weights just high enough that I minimize the error rate, assuming that the biased lm is based on a correct transcript. Does this approach make sense, or will disabling fwdflat and bestpath lead to widely different results?
It should be ok. I don't think very high weight is needed, it should be something different.
Thanks, I'll try it!
I did some tests, but the results are sobering. I couldn't get consistently lower error rates for biased language models by raising the lm weights. So I'm dropping the idea for the time being.
Getting back to this topic. I have a number of recording that sound very clear (to me). For each recording, I have a correct transcript, which I turn into a language model. I then combine this specialized language model with the default language model using 0.9:0.1 weights. The result is a biased language model that strongly favors the correct words and word tuples.
What I don't understand is why PocketSphinx still recognizes so many words incorrectly. My understanding is that raising the language model weights would tell PocketSphinx to give more regard to the probabilities in the language model, which should boost the correct words from the transcript. But raising the lm weights only increases the error rate for me.
Am I doing something wrong? Is there another way to tell PocketSphinx to pay more attention to the lm probabilities?
Acoustic model comes first so it is more important. Beam controls that and cuts acoustic hypothesis before it survives the end for lm scoring. You can try with very large beam, you should see that your lm starts to work better but still, its very important to get the acoustic model right.
To reduce other effects you can disable fwdflat and bestpath only testing fwdtree.
You can also try how DNN decoders like Kaldi work if you care much about accuracy, they should be quite accurate.