Menu

Using Visemes to enhance accuracy of recognition

2016-02-16
2016-05-06
  • Bharat Mallapur

    Bharat Mallapur - 2016-02-16

    Hello,

    I am currently working on an voice-recognition based app which will be used
    by children (5-10 years age group).
    The children will read the text on the screen either letter by letter
    (phonetic) or word-by-word.

    An example sentence might be : "The Chicken crossed the street."

    My app's job is to check if the child has recognized the letter (or in the
    case of older children, the word) accurately, and if so, go on to the next
    letter/word.
    If not, the child will be "helped" in recognizing it in multiple ways.

    Now my queries:

    1. Considering that the text being read is known in advance to us (but
      obviously not the actual spoken word/letter), could this be used to predict
      the spoken text better?
      Note that I do not aim for semantic recognition, just plain speech
      recognition to match reference text vs. spoken.

    2.
    a) Would using visemes along with speech recognition be used to increase
    accuracy of the recognition even further / more confidently?
    b) Would this help to resolve consonant confusions?

    1. I was planning to use the OpenCV library to recognize the visemes being
      pronounced.
      a) Is there any known way to use the viseme information to help the
      speech recognition make more informed decisions?
      b) If so, could you point me to it?

    4) In the research paper Large Vocabulary Automatic Speech Recognition for
    Children (
    http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44268.pdf
    )
    it mentions that
    "Speech recognition for adults has improved significantly over the last
    few years; however less progress has been made in recognizing

    speech produced by children well [1, 2]. Many factors make recognizing
    children’s speech challenging. As children learn to speak, their ability to
    accurately realize speech

    sounds properly changes [3, 4]. Spectrally, children’s smaller vocal
    tracts lead to higher fundamental and formant frequencies. Children’s
    overall speaking rate is slower, and they have

    more variability in speaking rate, vocal effort, and spontaneity of speech
    [5]. Linguistically, children are more likely to use “imaginative words,
    ungrammatical phrases and incorrect pronunciations”

    [6]. By training directly on children’s speech it was shown that this
    mismatch in performance can be reduced on a digit recognition task,
    although accuracy is still worse than on

    adults [7]. All these aspects evolve rapidly as children grow [2]."

    My question: Considering the above conditions set in my app (fixed text
    input, voice model base training with say 100 children and continuous
    addition of new data to the existing model), would the accuracy level
    attained be achieved around 85%... Or is there something intrinsically
    making it a lot more difficult to recognize children's speech ?

    Thanks in advance,
    Bharat

     
  • Nickolay V. Shmyrev

    Considering that the text being read is known in advance to us (but
    obviously not the actual spoken word/letter), could this be used to predict
    the spoken text better?

    Well, your task is not recognition but verification and the text is not known in advance.

    Would using visemes along with speech recognition be used to increase
    accuracy of the recognition even further / more confidently?
    b) Would this help to resolve consonant confusions?

    Due to the light problems and face positioning problems I doubt visemes would be of any help.

    would the accuracy level attained be achieved around 85%.

    Verification is not characterized by accuracy.

     
  • Bharat Mallapur

    Bharat Mallapur - 2016-02-16

    Thanks for the clarifications. Speech Verification IS what I'm interested in performing (at least as far as Speech is concerned)... I just didn't know that there is an actual term for that! :)

    But I am wondering what you meant when you said "Well, your task is not recognition but verification and the text is not known in advance" In my case, the text would be known in advance since the kid would be reading from the text in question. Could you clarify?

    Also, thanks to your mentioning speech verification, I found some links similar to what I'm trying to achieve. Posting them here for other forum users. Of course, they look quite dated in their approach (for example using noise cancelling microphones etc) but still quite interesting to see how the thought process was a decade or more ago!

    SPEECH TECHNOLOGY IN COMPUTER-AIDED LANGUAGE LEARNING: STRENGTHS AND LIMITATIONS OF A NEW CALL PARADIGM http://llt.msu.edu/vol2num1/article3/index.html

    USING AUTOMATIC SPEECH PROCESSING FOR FOREIGN LANGUAGE PRONUNCIATION TUTORING: SOME ISSUES AND A PROTOTYPE http://llt.msu.edu/vol2num2/article3/index.html

    I agree that lighting and face positioning seems difficult to achieve, esp. w.r.t kids!

    Also, I am still not sure what you meant by "Verification is not characterized by accuracy". Could you explain in a bit more detail? Sorry if I'm not on the ball here, but I am a newbie to speech recognition.

    Thanks again for replying so helpfully!

     

    Last edit: Bharat Mallapur 2016-02-17
    • Nickolay V. Shmyrev

      In my case, the text would be known in advance since the kid would be reading from the text in question. Could you clarify?

      Kid will say many other things, so you do not really know what will he say. And you should not assume that. This problem is well covered in research on verification, you can just read the papers.

      For a recent research on the subject I suggest you to check
      https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf

      "Verification is not characterized by accuracy". Could you explain in a bit more detail?

      There are things like "Equal error rate", again, you can find details in the paper.

       

Log in to post a comment.