Menu

Entry point in search

2014-05-17
2014-05-30
  • Oleg Chervonogradsky

    Hey

    qq- in which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?

    Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.

    And more wider question: is there any description in detail about acoustic model files ?

    Regards
    Oleg

     
  • Nickolay V. Shmyrev

    In which file of acoustic model we store MFCC mapping for particular phoneme within its probability of occurrence?

    There is no such thing

    Saying if I would like to recognize word "LAST" there has to be somewhere mapping of MFCC features vector to first phoneme "L" that might be used as starting letter in whole recognition process.

    In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state. So there is no mapping from MFCC to the state but there is a mapping from state to probability distribution. This mapping is described in mdef file. Each line contains a mapping from phonetic context like L after SIL and before AH to three probability distributions corresponding to 3 hmm states. Probability distribution parameters are stored in means and variances files. Every distribution is encoded by the id in mdef file and same id is used in means file which is essentially multidimensional array.

    And more wider question: is there any description in detail about acoustic model files ?

    You can find some information on the wiki, I suggest you to read:

    http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

    http://cmusphinx.sourceforge.net/wiki/acousticmodelformat

    For quick tutorial on recognition you can read chapter 4 in HTKBook, if you are looking for more in-depth picture, a textbook on speech recognition makes more sense. For example you can read http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165

    There are also few online courses which were discussed on this forum before.

     
    • Oleg Chervonogradsky

      thanks Nickolay.

      In speech recognition we evaluate MFCC feature vector against the precomputed probability distribution of HMM state. This distribution is estimated from the training database based on MFCC vectors observed for the state.

      where/how "precomputed probability distribution of HMM state" comes from ?

       

      Last edit: Oleg Chervonogradsky 2014-05-17
      • Nickolay V. Shmyrev

        where/how "precomputed probability distribution of HMM state" comes from ?

        Distributions are estimated during acoustic model training.

         
        • Oleg Chervonogradsky

          thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?

           
  • Nickolay V. Shmyrev

    thanks. so, during recognition, once we received MFCC vector it gets calculated and we are getting distribution, then this distribution is searched and mapped in HMM state mapping table(mdef)?

    No, during recognition all possible word sequences are first constructed. Then those word sequences are converted to HMM state sequences with mdef file. Then every HMM state sequence is scored against sequence of MFCC vectors. Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed. The sequence of maximum probability is returned as a result.

     
    • Oleg Chervonogradsky

      Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed.

      okay - once we have MFCC vector in input to recognize it is scored against each state(which was computed during acoustic modeling) - how this scoring happens
      is taht just another computation of distribution based on MFCC vector we received ? and then searching for best match in 3 hmm state ?

       
    • Oleg Chervonogradsky

      okay - let me put question in another way - do we have documentation on recognition process in details ?

       
      • Nickolay V. Shmyrev

        Yes, I linked you the textbook in the answer above.

         
        • Oleg Chervonogradsky

          i was more interested in pocket sphinx, but thanks anyway

           
    • James Young

      James Young - 2014-05-22

      Hi Nickolay,

      Do we have to split the utterance by silence ? If the input speech is continuous, say 20 seconds long, what shall we do? Split anywhere in the speech or buffer the whole speech segment?

       
      • Nickolay V. Shmyrev

        Hello James

        To ask a new question please start a new thread. Don't hijack in unrelated discussion.

         
    • Oleg Chervonogradsky

      thanks - it gets clearler right now, but still some question

      Each state is scored against each MFCC vector to get state-to-mfcc alignment probability and the total probability is computed.

      how this scoring happens ?
      if that is long to explain pls point me to source code function I'll try to dig by myself

       
  • Nickolay V. Shmyrev

    Please read at least

    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.17.7829&rep=rep1&type=pdf

    It's only a few pages. I don't think it's productive to read code without understanding the theory, the code is pretty complex. In sphinx4 the search is implemented in SimpleSearchManager for example.

    There are simplier HMM decoder implementations which might be easier to read, but they can lack critical features. For example you can also check http://en.wikipedia.org/wiki/Viterbi_algorithm

     
    • Nickolay V. Shmyrev

      And in pocketsphinx search is implemented in search components specific to search module. For FSG module search space is constructed in fsg_lextree.c. The search itself is performed in fsg_search.c, but the individual HMM states are advanced in hmm.c.

       
    • Oleg Chervonogradsky

      thanks, so far qq
      acoustic model generation itself has following process then
      0. calculating MFCC vector for each of the frame. storing it.
      1. identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file
      2. workin on one utterance means:
      - select frames for each utterance
      - in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )
      - using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
      - based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
      - calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
      - store distribution parameters by mapping them to specified state and to specified transition

      So, we have to distributions
      - state to mfcc
      - transition-of-state to mfcc

      Am I right ?

       
      • Nickolay V. Shmyrev

        Training is well covered in Rabiner's HMM tutorial, it's worth to read it if you are lazy to read the book.

        identification of utterances by means of finding frames with minimum power aligned to number of words that specified in transcription file

        There is no such step. Training is performed on the database which is already split on the utterances and for every utterance there is a transcription. You can find an example of such database in tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam

        workin on one utterance means

        Training is a recursive process where you perform multiple iterations to find the best model. You start from an approximation of the model first then make it more and more precise.

        in this set of frames using MFCC vector find out transition vector(s) from one phoneme in the word to another by means of what .... ?????? (is this comparing )

        The training operates on HMM states which are smaller then phonemes. Every phoneme is expanded to 3 or 5 states. Phonemes are not accounted in training at all, only states are.

        On every iteration we are looking for initial alignment of states to frames and this alignment is not precise but probabilistic. You have certain probability of the fact that frame belongs to certain phoneme.

        Once such alignment is calculated the model parameters are updated using this alignment.

        Then the next iteration is performed. Finally the iterations converge to an optimal model

        • using those transition MFCC vectors break utterance set into specified phoneme related MFCC vectors, for example BIG will have 3 sets of MFCC vectors representing each of the phenome B I G. Still each phoneme is MFCC vectors set based on frames, as shift of cursor by 10 ms each.
        • based on this phonemes related MFCC vector set we calculate probability distribution, mapping state of phoneme, lest say, "B" to distribution of MFCC vectors set, that represent its state.
        • calculate transition distribution based on those MFCC vectors that used to distinguish one phoneme from another; in example of word BIG is set that identify transition from B to I as example.
        • store distribution parameters by mapping them to specified state and to specified transition

        This thinking is in the right direction but it lacks important vocabulary and important concepts unfortunately. That's why it's better to read something in details first.

         

Log in to post a comment.