3 - Render the FFT down to something almost like a "cepstrum", a way of looking at a waveform that eliminates a lot of what makes us individuals and just concentrates on the commonality that lets us communicate.
(thing is, 1, 2, and 3 are well documented and implemented, if not entirely understood) and they make up maybe 7% of the system (no matter how you look at it. 7% of the code, 7% of the processor time and memory used, etc). The big, big thing is...
4 - Search a bunch of HMMs for the sequences that best match the cepstral frames. (HMMs don't "match" things, you have to make things match the HMMs, and the a good search algorithm is the secret sauce that makes an efficient recognizer).
One of the best overviews of how Sphinx works is the first 48 pages of David Huggins-Dains's Ph.D. thesis. Don't let that put you off, it's very approachable.
And a very good overview of HMMs in speech recognition is provided in the classic
paper of Rabiner "A Tutorial In Hidden Markov Models And Selected Applications In
Speech Recognition" http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there somewhere that describes in detail the sphinx algorithms? I can step through the code, but that's a little inefficient.
For instance it uses HMM routines, but what exactly is it modeling? This is what I'm assuming it does:
Am I completely off base?
Well, 1 and 2 are OK...
3 - Render the FFT down to something almost like a "cepstrum", a way of looking at a waveform that eliminates a lot of what makes us individuals and just concentrates on the commonality that lets us communicate.
(thing is, 1, 2, and 3 are well documented and implemented, if not entirely understood) and they make up maybe 7% of the system (no matter how you look at it. 7% of the code, 7% of the processor time and memory used, etc). The big, big thing is...
4 - Search a bunch of HMMs for the sequences that best match the cepstral frames. (HMMs don't "match" things, you have to make things match the HMMs, and the a good search algorithm is the secret sauce that makes an efficient recognizer).
One of the best overviews of how Sphinx works is the first 48 pages of David Huggins-Dains's Ph.D. thesis. Don't let that put you off, it's very approachable.
http://www.lti.cs.cmu.edu/research/thesis/2011/david_hugginsdaines.pdf
If you want to dive deep, create language models, etc. the aptly named "The Hieroglyphs" is worth a look.
http://www-2.cs.cmu.edu/~archan/documentation/sphinxDocDraft3.pdf
There's also the CMUSphinx Wiki
http://cmusphinx.sourceforge.net/wiki/start
Good luck.
Here is an updated link David Huggins-Danis's Ph.D. thesis.
http://www.lti.cs.cmu.edu/sites/default/files/research/thesis/2011/david_huggins-daines_an_architecture_for_scalable_universal_speech_recognition.pdf
Last edit: Roslyn Debacker 2017-07-21
Another very readable document to understand more about sphinx is the Phd thesis
of Mosur Ravishankar "Efficient Algorithms For Speech Recognition"
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.72.3560
And a very good overview of HMMs in speech recognition is provided in the classic
paper of Rabiner "A Tutorial In Hidden Markov Models And Selected Applications In
Speech Recognition"
http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf