Hi, I read(http://cmusphinx.sourceforge.net/wiki/acousticmodeltypes) that the ptm model uses around 5000 gaussians and sem-cont model uses 700 gaussians. When training one of these two models it is recomended to set number of gaussians densities to 256. If i understand correctly the 256 densities one set in the cfg file is used for VQ the feature vectors. The other 5000(or 700) are used in making the senones(gaussian mixtures for the state output probablilites of the HMMs).
Is this correct?
And is there any lectures,papers or books one can get more insight in the theory behind CMUSpeech recognition toolkit?
This manual(http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html) have given me some great info so far, but I was hoping to find something similar that was up to date.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the 256 densities one set in the cfg file is used for VQ the feature vectors. The other 5000(or 700) are used in making the senones(gaussian mixtures for the state output probablilites of the HMMs).
There are no "other" gaussians. In configuration file you set 256 gaussians per stream, in semi-continuous models you usually have 3 streams so 3 * 256 ~ 700 gaussians. In ptm models gaussians are phone-dependent but default setting is 64 gaussian per stream, so 3 streams * 64 gaussians per stream * 30 phones ~ 5000 gaussians in the model.
And is there any lectures,papers or books one can get more insight in the theory behind CMUSpeech recognition toolkit?
Hi, I read(http://cmusphinx.sourceforge.net/wiki/acousticmodeltypes) that the ptm model uses around 5000 gaussians and sem-cont model uses 700 gaussians. When training one of these two models it is recomended to set number of gaussians densities to 256. If i understand correctly the 256 densities one set in the cfg file is used for VQ the feature vectors. The other 5000(or 700) are used in making the senones(gaussian mixtures for the state output probablilites of the HMMs).
Is this correct?
And is there any lectures,papers or books one can get more insight in the theory behind CMUSpeech recognition toolkit?
This manual(http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html) have given me some great info so far, but I was hoping to find something similar that was up to date.
There are no "other" gaussians. In configuration file you set 256 gaussians per stream, in semi-continuous models you usually have 3 streams so 3 * 256 ~ 700 gaussians. In ptm models gaussians are phone-dependent but default setting is 64 gaussian per stream, so 3 streams * 64 gaussians per stream * 30 phones ~ 5000 gaussians in the model.
Spoken Language Processing
http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165
one more question, when you say 3 streams, do you mean feature streams?
Yes, for semi models we analyze feature, feature deltas and feature delta-deltas in separate streams with separate gaussians.
Thank you