I have question regarding the use of bigram language model in sphinx 3. We can only train trigram model using the CMU lmtool utility. What is the way to train bigram models?
What is the way to make sphinx 3 work with bigram models? I really appreciate your help.
Thanks a lot,
abhishek.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
>I used add-dummy-bows on the bigram lm obtained from srilm. It puts a zero as backoff number on the first unigram (</s>). Is it ok?
yes
>I didn''t use sort-lm before and lm3g2dmp didn't give me any error. Do I still need to use it for some other reason?
yes you do. lm3g2dmp assumes its sorted. It will not work if you dont sort.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Nickolay. It's working. Although I have to do some changes in the model generated by SRILM. It doesn't put the backoff number at the end of first token in the unigram. I have to put a number there to make it readable to lm3g2dmp. And, the second token (in my case </s>) has a very low probability score of -99 regardless of the text I give. The first token <s> has a reasonable probability score. I couldn't understand the reason for it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Nickolay. It's working. Although I have to do some changes in the model generated by SRILM. It doesn't put the backoff number at the end of first token in the unigram. I have to put a number there to make it readable to lm3g2dmp. And, the second token (in my case </s>) has a very low probability score of -99 regardless of the text I give. The first token <s> has a reasonable probability score. I couldn't understand the reason for it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi ,
I have question regarding the use of bigram language model in sphinx 3. We can only train trigram model using the CMU lmtool utility. What is the way to train bigram models?
What is the way to make sphinx 3 work with bigram models? I really appreciate your help.
Thanks a lot,
abhishek.
>I used add-dummy-bows on the bigram lm obtained from srilm. It puts a zero as backoff number on the first unigram (</s>). Is it ok?
yes
>I didn''t use sort-lm before and lm3g2dmp didn't give me any error. Do I still need to use it for some other reason?
yes you do. lm3g2dmp assumes its sorted. It will not work if you dont sort.
> What is the way to train bigram models?
http://www.speech.sri.com/projects/srilm/
> What is the way to make sphinx 3 work with bigram models? I
There is no difference, just -lm bigram.lm
Check sphinx3/src/tests/performance/rm1/ARGS.rm1_bigram
and sphinx3/src/tests/performance/rm1/RM.2845.bigram.arpa.DMP for example
Thanks Nickolay. It's working. Although I have to do some changes in the model generated by SRILM. It doesn't put the backoff number at the end of first token in the unigram. I have to put a number there to make it readable to lm3g2dmp. And, the second token (in my case </s>) has a very low probability score of -99 regardless of the text I give. The first token <s> has a reasonable probability score. I couldn't understand the reason for it.
> Although I have to do some changes in the model generated by SRILM. It doesn't put the backoff number at the end of first token in the unigram
Yes, it's documented in the SRILM, you need to use add-dummy-bows script to add the probs. Also you need to use sort-lm to sort the lm.
> . And, the second token (in my case </s>) has a very low probability score of -99 regardless of the text I give.
It should be so, every lm has 99 as the prob for </s> start
I used add-dummy-bows on the bigram lm obtained from srilm. It puts a zero as backoff number on the first unigram (</s>). Is it ok?
I didn''t use sort-lm before and lm3g2dmp didn't give me any error. Do I still need to use it for some other reason?
Thanks a lot,
abhishek.
Thanks Nickolay. It's working. Although I have to do some changes in the model generated by SRILM. It doesn't put the backoff number at the end of first token in the unigram. I have to put a number there to make it readable to lm3g2dmp. And, the second token (in my case </s>) has a very low probability score of -99 regardless of the text I give. The first token <s> has a reasonable probability score. I couldn't understand the reason for it.