CMU Sphinx / Forums / Speech Recognition Theory: Entry point in search

Oleg Chervonogradsky - 2014-05-17

Hey

qq- in which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?

Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.

And more wider question: is there any description in detail about acoustic model files ?

Regards
Oleg

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-05-17

In which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?

There is no such thing

Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.

In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state. So there is no mapping from MFCC to the state but there is a mapping from state to probability distribution. This mapping is described in mdef file. Each line contains a mapping from phonetic context like L after SIL and before AH to three probability distributions corresponding to 3 hmm states. Probability distribution parameters are stored in means and variances files. Every distribution is encoded by the id in mdef file and same id is used in means file which is essentially multidimensional array.

And more wider question: is there any description in detail about acoustic model files ?

You can find some information on the wiki, I suggest you to read:

http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

http://cmusphinx.sourceforge.net/wiki/acousticmodelformat

For quick tutorial on recognition you can read chapter 4 in HTKBook, if you are looking for more in-depth picture, a textbook on speech recognition makes more sense. For example you can read http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165

There are also few online courses which were discussed on this forum before.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Chervonogradsky - 2014-05-17
  
  thanks Nickolay.
  
  In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state.
  
  where/how "precomputed probability distribution of HMM state" comes from ?
  
  Last edit: Oleg Chervonogradsky 2014-05-17
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2014-05-18
    
    where/how "precomputed probability distribution of HMM state" comes from ?
    
    Distributions are estimated during acoustic model training.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Oleg Chervonogradsky - 2014-05-18
      
      thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-05-18

thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?

No, during recognition all possible word sequences are first constructed. Then those word sequences are converted to HMM state sequences with mdef file. Then every HMM state sequence is scored against sequence of MFCC vectors. Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed. The sequence of maximum probability is returned as a result.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Chervonogradsky - 2014-05-18
  
  Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed.
  
  okay - once we have MFCC vector in input to recognize it is scored against each state(which was computed during acoustic modeling) - how this scoring happens
  is taht just another computation of distribution based on MFCC vector we received ? and then searching for best match in 3 hmm state ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Chervonogradsky - 2014-05-19
  
  okay - let me put question in another way - do we have documentation on recognition process in details ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2014-05-19
    
    Yes, I linked you the textbook in the answer above.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Oleg Chervonogradsky - 2014-05-19
      
      i was more interested in pocket sphinx, but thanks anyway
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- James Young - 2014-05-22
  
  Hi Nickolay,
  
  Do we have to split the utterance by silence ? If the input speech is continuous, say 20 seconds long, what shall we do? Split anywhere in the speech or buffer the whole speech segment?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2014-05-24
    
    Hello James
    
    To ask a new question please start a new thread. Don't hijack in unrelated discussion.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Chervonogradsky - 2014-05-28
  
  thanks - it gets clearler right now, but still some question
  
  Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed.
  
  how this scoring happens ?
  if that is long to explain pls point me to source code function I'll try to dig by myself
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-05-28

Please read at least

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.7829&rep=rep1&type=pdf

It's only a few pages. I don't think it's productive to read code without understanding the theory, the code is pretty complex. In sphinx4 the search is implemented in SimpleSearchManager for example.

There are simplier HMM decoder implementations which might be easier to read, but they can lack critical features. For example you can also check http://en.wikipedia.org/wiki/Viterbi_algorithm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-05-28
  
  And in pocketsphinx search is implemented in search components specific to search module. For FSG module search space is constructed in fsg_lextree.c. The search itself is performed in fsg_search.c, but the individual HMM states are advanced in hmm.c.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Oleg Chervonogradsky - 2014-05-29
  
  thanks, so far qq
  acoustic model generation itself has following process then
  0. calculating MFCC vector for each of the frame. storing it.
  1. identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file
  2. workin on one utterance means:
  - select frames for each utterance
  - in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )
  - using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
  - based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
  - calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
  - store distribution parameters by mapping them to specified state and to specified transition
  
  So, we have to distributions
  - state to mfcc
  - transition-of-state to mfcc
  
  Am I right ?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2014-05-30
    
    Training is well covered in Rabiner's HMM tutorial, it's worth to read it if you are lazy to read the book.
    
    identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file
    
    There is no such step. Training is performed on the database which is already split on the utterances and for every utterance there is a transcription. You can find an example of such database in tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
    
    workin on one utterance means
    
    Training is a recursive process where you perform multiple iterations to find the best model. You start from an approximation of the model first then make it more and more precise.
    
    in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )
    
    The training operates on HMM states which are smaller then phonemes. Every phoneme is expanded to 3 or 5 states. Phonemes are not accounted in training at all, only states are.
    
    On every iteration we are looking for initial alignment of states to frames and this alignment is not precise but probabilistic. You have certain probability of the fact that frame belongs to certain phoneme.
    
    Once such alignment is calculated the model parameters are updated using this alignment.
    
    Then the next iteration is performed. Finally the iterations converge to an optimal model
    
    using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
    
    based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
    
    calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
    
    store distribution parameters by mapping them to specified state and to specified transition
    
    This thinking is in the right direction but it lacks important vocabulary and important concepts unfortunately. That's why it's better to read something in details first.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Entry point in search

Speech Recognition Toolkit

Forums

Help

Entry point in search

Entry point in search

Speech Recognition Toolkit

Forums

Help

Entry point in search document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Entry point in search