How is the search graph created when decoding phones using triphone acoustic...

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

How is the search graph created when decoding phones using triphone acoustic models?

Forum: Speech Recognition Theory

Creator: dovark

Created: 2013-04-03

Updated: 2013-04-03

dovark - 2013-04-03

Hi,

I'm curious about following question.

We want to do phone recognition. We have N triphone HMM models available (N could be of the order of 50^3) and we create a unigram phone language model (say of 50 phones).

During search, (theoretically) should the language model be expanded to 50^3 possible paths? Because otherwise the triphone models will not be utilized.

If yes, is this actually done practically also (say in sphinx/htk)?

Last edit: dovark 2013-04-03

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The Grand Janitor - 2013-04-03

Woo... Juicy Question. ;) (Juicy ?)

Short answer:

HTK: yes. There is a flag in HVite which allows full expansion of a single phone word. So in phoneme recognition, you would find that HVite would slow down tremendously.

Btw, if I remember correctly, HTK also has optional flag for silence expansion. For trivial reason, you might not want to expand.

Sphinx: in allphone mode of sphinx3, a full expansion was also done for triphone. Unfortunately I don't know s4/ps enough to give you answers on them. I will leave them to other experts.

Arthur

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

dovark - 2013-04-03

Thanks Arthur.

The same problem also arises with word LM I think. Since last phoneme of a word can have any of the other (say K) possibilities of next phones, would acoustic scores of all K paths be separately computed and stored?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The Grand Janitor - 2013-04-03

Yes. In HTK, again you can fully expand it. In many other recognizers, you can find dozens of different implementations. It's beyond to give a full account, I will just give you some examples, throw out some jargons without getting into detail,

In Sphinx3 mode=flat left context is fully expanded, whereas right context is approximated by multiplexed triphones

In Sphinx3 mode=tree composite triphones.

Of course, many sophisticated recognizer would also use a 2-stage paradigm, first by generating lattice, then do the full triphone expansion on the lattice 2nd-stage.

Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.