Menu

Logical steps of Continuous speech recognition

Vamsi
2014-07-16
2014-08-07
  • Vamsi

    Vamsi - 2014-07-16

    I am trying to understand logical steps for Continuous speech recognition. Could someone please validate if my understanding is right?
    1. Spoken voice sample is converted into string of connected word combinations by levarging the acoustic model and the lexicon.
    2.Pattern matching of subword HMMs is leveraged to create candidate connected word strings.
    2.The connected word combinations are hypothesis for the spoken input.
    3.Multiple possible hypothesis are stored in a lattice structure.
    4.Language model and dynamic programming are leveraged prune the lattice structure and select the most probable candidate.

     
  • Nickolay V. Shmyrev

    if my understanding is right?

    Your understanding is not 100% correct. For correct process of speech recognition you can read any textbook or at least our tutorial

    http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

     
  • Vamsi

    Vamsi - 2014-07-17

    Thanks Nichlay!

    I have been reading ' Fundamentals of Speech Recognition' Rabiner/Jaung/Yagnarayana and was trying to map what I grasped from the text book to what I saw as results from running Sphinx4 examples.

    I have now read the tutorial you have referred too.

    While I'll anyway spend more time on reading the book, I wanted to know if LM is leveraged even to create word level matches? Coz my current understanding is that it used only during the sentence level matching phase of recognition.

     
  • Nickolay V. Shmyrev

    I wanted to know if LM is leveraged even to create word level matches?

    Yes, LM is used to construct original HMM search graph

    Coz my current understanding is that it used only during the sentence level matching phase of recognition.

    There is no sentence level or word level, the first stage search is performed with the expanded graph which combines information from lm, dictionary and acoustic model together. There could be later lattice rescoring stage, but it is not used in sphinx4 by default.

     
  • Vamsi

    Vamsi - 2014-08-06

    I have been reading up 'Token Passing Algorithm by SJ Young' to understand how information from AM,LM and Dictionary are used to by the decoder to recognize the spoken text. I have tried to integrate what I read here with what I know about acoustic modelling and language modelling.

    Here are a few further questions I have.

    1. Am I right is understanding that the token cost is the ' minimal cost of aligning the test audio with the dynamically generated SentenceHMM'?
      2.Is it that the token cost( for intra word transitions) is a function of the state transition probability and the emission probability values for the HMMs trained during AM?
      3.Language Model probabilities are used to compute token cost for interword transitions?
     
  • Nickolay V. Shmyrev

    Am I right is understanding that the token cost is the ' minimal cost of aligning the test audio with the dynamically generated SentenceHMM'?

    Token cost is not the minimal cost, it's the cost of one of the possible alignment. Best token cost is the minimal cost.

    2.Is it that the token cost( for intra word transitions) is a function of the state transition probability and the emission probability values for the HMMs trained during AM?

    Token cost also contains language model probability

    3.Language Model probabilities are used to compute token cost for interword transitions?

    Yes

     
  • Vamsi

    Vamsi - 2014-08-07

    Thank you Nickolay! Your response has strengthened my understanding.

     

Log in to post a comment.