We want to do phone recognition. We have N triphone HMM models available (N could be of the order of 50^3) and we create a unigram phone language model (say of 50 phones).
During search, (theoretically) should the language model be expanded to 50^3 possible paths? Because otherwise the triphone models will not be utilized.
If yes, is this actually done practically also (say in sphinx/htk)?
Last edit: dovark 2013-04-03
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
HTK: yes. There is a flag in HVite which allows full expansion of a single phone word. So in phoneme recognition, you would find that HVite would slow down tremendously.
Btw, if I remember correctly, HTK also has optional flag for silence expansion. For trivial reason, you might not want to expand.
Sphinx: in allphone mode of sphinx3, a full expansion was also done for triphone. Unfortunately I don't know s4/ps enough to give you answers on them. I will leave them to other experts.
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The same problem also arises with word LM I think. Since last phoneme of a word can have any of the other (say K) possibilities of next phones, would acoustic scores of all K paths be separately computed and stored?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes. In HTK, again you can fully expand it. In many other recognizers, you can find dozens of different implementations. It's beyond to give a full account, I will just give you some examples, throw out some jargons without getting into detail,
In Sphinx3 mode=flat left context is fully expanded, whereas right context is approximated by multiplexed triphones
In Sphinx3 mode=tree composite triphones.
Of course, many sophisticated recognizer would also use a 2-stage paradigm, first by generating lattice, then do the full triphone expansion on the lattice 2nd-stage.
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I'm curious about following question.
We want to do phone recognition. We have N triphone HMM models available (N could be of the order of 50^3) and we create a unigram phone language model (say of 50 phones).
During search, (theoretically) should the language model be expanded to 50^3 possible paths? Because otherwise the triphone models will not be utilized.
If yes, is this actually done practically also (say in sphinx/htk)?
Last edit: dovark 2013-04-03
Woo... Juicy Question. ;) (Juicy ?)
Short answer:
HTK: yes. There is a flag in HVite which allows full expansion of a single phone word. So in phoneme recognition, you would find that HVite would slow down tremendously.
Btw, if I remember correctly, HTK also has optional flag for silence expansion. For trivial reason, you might not want to expand.
Sphinx: in allphone mode of sphinx3, a full expansion was also done for triphone. Unfortunately I don't know s4/ps enough to give you answers on them. I will leave them to other experts.
Arthur
Thanks Arthur.
The same problem also arises with word LM I think. Since last phoneme of a word can have any of the other (say K) possibilities of next phones, would acoustic scores of all K paths be separately computed and stored?
Yes. In HTK, again you can fully expand it. In many other recognizers, you can find dozens of different implementations. It's beyond to give a full account, I will just give you some examples, throw out some jargons without getting into detail,
Of course, many sophisticated recognizer would also use a 2-stage paradigm, first by generating lattice, then do the full triphone expansion on the lattice 2nd-stage.
Arthur