I want to use a class based language model with Sphinx3. Is it supported?
Can each word in the class have a probability associated with it (instead of
uniform prior?)
I see source code supporting it, but I do not see an example anywhere.
Can someone please point me to an example where each word in the class
has a probability associated with it.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, it's supported. It works the same as in Sphinx2 - you make a "control file" listing each language model with its associated classes, and a class definition file which lists the words in each class with their probabilities.
Look at the example in sphinx3/model/lm/an4, specifically these files:
The last file doesn't have any probabilities in it since there is only one member in the class (I don't know why the test was made this way, it isn't a very good test!). You can enter probabilities like this:
LMCLASS [v_class]
A 0.25
E 0.3
I 0.1
O 0.25
U 0.1
END [v_class]
It's probably a good idea for them to add up to one within each class. The code should just normalize them for you, but for some reason it doesn't.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sphinx3 seems to be ignoring the class based probabilities. I can say that because I built a class based LM and
a regular trigram LM.
the perplexity of the sentance to be recognized is about 235 using the class based LM (measured by using the
equivalent class file definition and SRILM toolkit). I increase the LM weight to 14 and still the recognized string has
words that have class conditional probability of only 1e-7.
The perplexity of the recognized text (measured using srilm) is 2.8e6. This is possible only if Sphinx3 (I used livepretend) is ignoring the class conditionals and assigning a uniform prior instead.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to use a class based language model with Sphinx3. Is it supported?
Can each word in the class have a probability associated with it (instead of
uniform prior?)
I see source code supporting it, but I do not see an example anywhere.
Can someone please point me to an example where each word in the class
has a probability associated with it.
Yes, it's supported. It works the same as in Sphinx2 - you make a "control file" listing each language model with its associated classes, and a class definition file which lists the words in each class with their probabilities.
Look at the example in sphinx3/model/lm/an4, specifically these files:
args.an4.test.cls
an4.ug.cls.lmctl
an4.cls.probdef
The last file doesn't have any probabilities in it since there is only one member in the class (I don't know why the test was made this way, it isn't a very good test!). You can enter probabilities like this:
LMCLASS [v_class]
A 0.25
E 0.3
I 0.1
O 0.25
U 0.1
END [v_class]
It's probably a good idea for them to add up to one within each class. The code should just normalize them for you, but for some reason it doesn't.
Thanks a lot. I assume you meant the following files:
args.an4.test.cls.in
an4.ug.cls.lmctl.in
and an4.cls.probdef
Sphinx3 seems to be ignoring the class based probabilities. I can say that because I built a class based LM and
a regular trigram LM.
the perplexity of the sentance to be recognized is about 235 using the class based LM (measured by using the
equivalent class file definition and SRILM toolkit). I increase the LM weight to 14 and still the recognized string has
words that have class conditional probability of only 1e-7.
The perplexity of the recognized text (measured using srilm) is 2.8e6. This is possible only if Sphinx3 (I used livepretend) is ignoring the class conditionals and assigning a uniform prior instead.