I am developing a pronunciation scoring application for non-native speakers
using Sphinx 4. The main goal is to give user feedback about his pronunciation
of single isolated words. The score should contain ratings of each phoneme
from this word (correctness). Additional information will contain mistakes
made by user - misspoken phonemes (missing phonemes, substituted phonemes and
wrong phonemes added to pronunciation).
The simplest way to achieve that is to get phoneme transcription of spoken
phrase (whatever was spoken) and compare this transcription to the correct
one. However, the problem is that getting exact transcription from speech is
very inaccurate using Sphinx and it will be very difficult to generate
flexible grammar to get a transcription that will correspond exactly to what
user said.
I am now using WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz model from Sphinx and a
grammar that contains only single word, which pronunciation I want to score.
Dictionary has an entry containing the correct phoneme transcription of given
word and eventually a couple of entries containing most common mistakes made
by non-native speakers, i.e. commonly substituted phonemes.
Using that strategy, I have a couple of questions, how could I develop the
rest of the system:
I would like to access the log-likelihood scores of all recognized phonemes of
that word. What would be the best approach to get a score for each phoneme?
Could it be the plain sum or average value of acoustic scores given by Sphinx
recognition results? Is it possible to get this scores even if Sphinx doesn't
recognize given speech sample?
For example: I want to score pronunciation of word APPLE (AE P AH L) and I run
recognition on the utterance containing bad pronunciation like APPLES (AE P AH
L S) - in this case Sphinx keeps returning the 'null' result as it doesn't
recognize the word from the grammar (APPLE). Is it possible to get scores for
first four phonemes anyway and give feedback for user that something is wrong
with his pronunciation at the end of the word?
I will be very thankful for answering those questions. Or maybe you have
better idea/strategy of how to solve this problem - scoring pronunciation with
Sphinx.
Regards,
Tomek
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2010-08-20
There are 3 measures of pronunciation quality popular in many papers:
- segment log-likelihood (for each phoneme)
- segment duration in frames
- phone log-posterior probability scores
No, there is no easy way to calculate this. It requires you both to develop
search manager that use phone space to match the audio and to develop the
accumulators that store phone likelihoods and turn them into posteriors.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And, looking on the age of your article I wouldn't recommend you following
their methods. Nowdays language aquisition tools target specific mistakes non-
natives do. For example they are trying to catch the typical mistakes US
students do in French. You will not get it looking on posterious.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2010-08-23
Thanks for your replies, ill consider your ideas...
I have also some slightly other question.
I want to use Sphinx4 as an recognition engine in java applet on my website.
The problem is that this site will be very often reloaded by user. The grammar
contains only one word to be recognized but unfortunately whole WSJ package
must be loaded every time. This package is quite heavy (10MB) but created
search graph is just a little part of it.
So, my question is: Is it possible to separately create SearchGraph from
linguist and somehow save it to a file, and then only send this specific
search graph to the applet to perform recognition on it? I think it will
greatly improve performance of this application.
Regards,
Michał
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This package is quite heavy (10MB) but created search graph is just a little
part of it.
I'm not sure what you mean by search graph, but the recognizer search space is
really huge (millions of nodes)
So, my question is: Is it possible to separately create SearchGraph from
linguist and somehow save it to a file, and then only send this specific
search graph to the applet to perform recognition on it? I think it will
greatly improve performance of this application. Regards, Michał
Being heavyweight by nature recognition is unlikely to fit into lightweight
client paradigm. I suggest you to consider using other technology as a base
for your application. There was lot of success recently with Red5 + Flash
setups. For example you can visit http://speechapi.com
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I am developing a pronunciation scoring application for non-native speakers
using Sphinx 4. The main goal is to give user feedback about his pronunciation
of single isolated words. The score should contain ratings of each phoneme
from this word (correctness). Additional information will contain mistakes
made by user - misspoken phonemes (missing phonemes, substituted phonemes and
wrong phonemes added to pronunciation).
The simplest way to achieve that is to get phoneme transcription of spoken
phrase (whatever was spoken) and compare this transcription to the correct
one. However, the problem is that getting exact transcription from speech is
very inaccurate using Sphinx and it will be very difficult to generate
flexible grammar to get a transcription that will correspond exactly to what
user said.
I am now using WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz model from Sphinx and a
grammar that contains only single word, which pronunciation I want to score.
Dictionary has an entry containing the correct phoneme transcription of given
word and eventually a couple of entries containing most common mistakes made
by non-native speakers, i.e. commonly substituted phonemes.
Using that strategy, I have a couple of questions, how could I develop the
rest of the system:
I would like to access the log-likelihood scores of all recognized phonemes of
that word. What would be the best approach to get a score for each phoneme?
Could it be the plain sum or average value of acoustic scores given by Sphinx
recognition results? Is it possible to get this scores even if Sphinx doesn't
recognize given speech sample?
For example: I want to score pronunciation of word APPLE (AE P AH L) and I run
recognition on the utterance containing bad pronunciation like APPLES (AE P AH
L S) - in this case Sphinx keeps returning the 'null' result as it doesn't
recognize the word from the grammar (APPLE). Is it possible to get scores for
first four phonemes anyway and give feedback for user that something is wrong
with his pronunciation at the end of the word?
I will be very thankful for answering those questions. Or maybe you have
better idea/strategy of how to solve this problem - scoring pronunciation with
Sphinx.
Regards,
Tomek
https://sourceforge.net/search/?group_id=1904&type_of_search=forums&words=pro
nunciation+scoring&search=Search
There are 3 measures of pronunciation quality popular in many papers:
- segment log-likelihood (for each phoneme)
- segment duration in frames
- phone log-posterior probability scores
I am wondering if it is possible to easily get the last one from Sphinx4
result? This measure is descibed for example in "AUTOMATIC PRONUNCIATION
SCORING FOR LANGUAGE INSTRUCTION" by SRI International (http://citeseerx.ist.
psu.edu/viewdoc/download?doi=10.1.1.68.4549&rep=rep1&type=pdf)
Do you know any easy way to calculate this measure (phone log-posterior
probability score)?
Regards,
Michał
No, there is no easy way to calculate this. It requires you both to develop
search manager that use phone space to match the audio and to develop the
accumulators that store phone likelihoods and turn them into posteriors.
And, looking on the age of your article I wouldn't recommend you following
their methods. Nowdays language aquisition tools target specific mistakes non-
natives do. For example they are trying to catch the typical mistakes US
students do in French. You will not get it looking on posterious.
Thanks for your replies, ill consider your ideas...
I have also some slightly other question.
I want to use Sphinx4 as an recognition engine in java applet on my website.
The problem is that this site will be very often reloaded by user. The grammar
contains only one word to be recognized but unfortunately whole WSJ package
must be loaded every time. This package is quite heavy (10MB) but created
search graph is just a little part of it.
So, my question is: Is it possible to separately create SearchGraph from
linguist and somehow save it to a file, and then only send this specific
search graph to the applet to perform recognition on it? I think it will
greatly improve performance of this application.
Regards,
Michał
I'm not sure what you mean by search graph, but the recognizer search space is
really huge (millions of nodes)
Being heavyweight by nature recognition is unlikely to fit into lightweight
client paradigm. I suggest you to consider using other technology as a base
for your application. There was lot of success recently with Red5 + Flash
setups. For example you can visit http://speechapi.com