I am currently working on an voice-recognition based app which will be used
by children (5-10 years age group).
The children will read the text on the screen either letter by letter
(phonetic) or word-by-word.
An example sentence might be : "The Chicken crossed the street."
My app's job is to check if the child has recognized the letter (or in the
case of older children, the word) accurately, and if so, go on to the next
letter/word.
If not, the child will be "helped" in recognizing it in multiple ways.
Now my queries:
Considering that the text being read is known in advance to us (but
obviously not the actual spoken word/letter), could this be used to predict
the spoken text better?
Note that I do not aim for semantic recognition, just plain speech
recognition to match reference text vs. spoken.
2.
a) Would using visemes along with speech recognition be used to increase
accuracy of the recognition even further / more confidently?
b) Would this help to resolve consonant confusions?
I was planning to use the OpenCV library to recognize the visemes being
pronounced.
a) Is there any known way to use the viseme information to help the
speech recognition make more informed decisions?
b) If so, could you point me to it?
4) In the research paper Large Vocabulary Automatic Speech Recognition for
Children ( http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44268.pdf
)
it mentions that
"Speech recognition for adults has improved significantly over the last
few years; however less progress has been made in recognizing speech produced by children well [1, 2]. Many factors make recognizing
children’s speech challenging. As children learn to speak, their ability to
accurately realize speech sounds properly changes [3, 4]. Spectrally, children’s smaller vocal
tracts lead to higher fundamental and formant frequencies. Children’s
overall speaking rate is slower, and they have more variability in speaking rate, vocal effort, and spontaneity of speech [5]. Linguistically, children are more likely to use “imaginative words,
ungrammatical phrases and incorrect pronunciations” [6]. By training directly on children’s speech it was shown that this
mismatch in performance can be reduced on a digit recognition task,
although accuracy is still worse than on adults [7]. All these aspects evolve rapidly as children grow [2]."
My question: Considering the above conditions set in my app (fixed text
input, voice model base training with say 100 children and continuous
addition of new data to the existing model), would the accuracy level
attained be achieved around 85%... Or is there something intrinsically
making it a lot more difficult to recognize children's speech ?
Thanks in advance,
Bharat
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Considering that the text being read is known in advance to us (but
obviously not the actual spoken word/letter), could this be used to predict
the spoken text better?
Well, your task is not recognition but verification and the text is not known in advance.
Would using visemes along with speech recognition be used to increase
accuracy of the recognition even further / more confidently?
b) Would this help to resolve consonant confusions?
Due to the light problems and face positioning problems I doubt visemes would be of any help.
would the accuracy level attained be achieved around 85%.
Verification is not characterized by accuracy.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the clarifications. Speech Verification IS what I'm interested in performing (at least as far as Speech is concerned)... I just didn't know that there is an actual term for that! :)
But I am wondering what you meant when you said "Well, your task is not recognition but verification and the text is not known in advance" In my case, the text would be known in advance since the kid would be reading from the text in question. Could you clarify?
Also, thanks to your mentioning speech verification, I found some links similar to what I'm trying to achieve. Posting them here for other forum users. Of course, they look quite dated in their approach (for example using noise cancelling microphones etc) but still quite interesting to see how the thought process was a decade or more ago!
I agree that lighting and face positioning seems difficult to achieve, esp. w.r.t kids!
Also, I am still not sure what you meant by "Verification is not characterized by accuracy". Could you explain in a bit more detail? Sorry if I'm not on the ball here, but I am a newbie to speech recognition.
Thanks again for replying so helpfully!
Last edit: Bharat Mallapur 2016-02-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In my case, the text would be known in advance since the kid would be reading from the text in question. Could you clarify?
Kid will say many other things, so you do not really know what will he say. And you should not assume that. This problem is well covered in research on verification, you can just read the papers.
Hello,
I am currently working on an voice-recognition based app which will be used
by children (5-10 years age group).
The children will read the text on the screen either letter by letter
(phonetic) or word-by-word.
An example sentence might be : "The Chicken crossed the street."
My app's job is to check if the child has recognized the letter (or in the
case of older children, the word) accurately, and if so, go on to the next
letter/word.
If not, the child will be "helped" in recognizing it in multiple ways.
Now my queries:
obviously not the actual spoken word/letter), could this be used to predict
the spoken text better?
Note that I do not aim for semantic recognition, just plain speech
recognition to match reference text vs. spoken.
2.
a) Would using visemes along with speech recognition be used to increase
accuracy of the recognition even further / more confidently?
b) Would this help to resolve consonant confusions?
pronounced.
a) Is there any known way to use the viseme information to help the
speech recognition make more informed decisions?
b) If so, could you point me to it?
4) In the research paper Large Vocabulary Automatic Speech Recognition for
Children (
http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/44268.pdf
)
it mentions that
"Speech recognition for adults has improved significantly over the last
few years; however less progress has been made in recognizing
speech produced by children well [1, 2]. Many factors make recognizing
children’s speech challenging. As children learn to speak, their ability to
accurately realize speech
sounds properly changes [3, 4]. Spectrally, children’s smaller vocal
tracts lead to higher fundamental and formant frequencies. Children’s
overall speaking rate is slower, and they have
more variability in speaking rate, vocal effort, and spontaneity of speech
[5]. Linguistically, children are more likely to use “imaginative words,
ungrammatical phrases and incorrect pronunciations”
[6]. By training directly on children’s speech it was shown that this
mismatch in performance can be reduced on a digit recognition task,
although accuracy is still worse than on
adults [7]. All these aspects evolve rapidly as children grow [2]."
My question: Considering the above conditions set in my app (fixed text
input, voice model base training with say 100 children and continuous
addition of new data to the existing model), would the accuracy level
attained be achieved around 85%... Or is there something intrinsically
making it a lot more difficult to recognize children's speech ?
Thanks in advance,
Bharat
Well, your task is not recognition but verification and the text is not known in advance.
Due to the light problems and face positioning problems I doubt visemes would be of any help.
Verification is not characterized by accuracy.
Thanks for the clarifications. Speech Verification IS what I'm interested in performing (at least as far as Speech is concerned)... I just didn't know that there is an actual term for that! :)
But I am wondering what you meant when you said "Well, your task is not recognition but verification and the text is not known in advance" In my case, the text would be known in advance since the kid would be reading from the text in question. Could you clarify?
Also, thanks to your mentioning speech verification, I found some links similar to what I'm trying to achieve. Posting them here for other forum users. Of course, they look quite dated in their approach (for example using noise cancelling microphones etc) but still quite interesting to see how the thought process was a decade or more ago!
SPEECH TECHNOLOGY IN COMPUTER-AIDED LANGUAGE LEARNING: STRENGTHS AND LIMITATIONS OF A NEW CALL PARADIGM http://llt.msu.edu/vol2num1/article3/index.html
USING AUTOMATIC SPEECH PROCESSING FOR FOREIGN LANGUAGE PRONUNCIATION TUTORING: SOME ISSUES AND A PROTOTYPE http://llt.msu.edu/vol2num2/article3/index.html
I agree that lighting and face positioning seems difficult to achieve, esp. w.r.t kids!
Also, I am still not sure what you meant by "Verification is not characterized by accuracy". Could you explain in a bit more detail? Sorry if I'm not on the ball here, but I am a newbie to speech recognition.
Thanks again for replying so helpfully!
Last edit: Bharat Mallapur 2016-02-17
Kid will say many other things, so you do not really know what will he say. And you should not assume that. This problem is well covered in research on verification, you can just read the papers.
For a recent research on the subject I suggest you to check
https://www.slate2015.org/files/SLaTE2015-Proceedings.pdf
There are things like "Equal error rate", again, you can find details in the paper.