I just got a link to your speech toolkit and wanted to ask some basic
questions before I delve too deep. I am wanting to write a program to check
pronunciation of a foreign language. I will have a native speaker sample and a
non-native speaker sample. I want to be able to compare the two to come up
with a percentage of how close the non-native speaker was to the correct
(native) pronunciation.
Does this sound like something that would be possible using this toolkit? I've
been doing some reading about format analysis, would that be the method to use
to do something like this? Any help or advice would be appreciated. At the
moment I have passed both samples though a time-windowed fft to produce a
spectrogram and am simply doing a comparison of the spectrogram (though my
code does handle warping) to see how close they are. It works kinda, but it's
not super-accurate.
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
CMUSphinx can be used for language learning and pronunciation assessment. But
the algorithms applied are different from what you were doing earlier. They
don't care about native speaker recording, they use native speaker acoustic
model and model for typical learner mistakes.
I suggest you to read the papers about HMM for pronunciation evaluation:
A method for measuring the intelligibility and nonnativeness of phone in
foreign language pronunciation training
The first link was good reading, but the second link is broken (or at least I
cannot access it).
Regarding the first link, interestingly they also make use of an existing
speech recognition engine to generate the phoneme data which they use for
comparison. I believe your code can also be used to product a list of phonemes
from a speech sample also? Can this be done from the pocket (ie, the C
version) of the code?
I will have a read about HMM, but I've also heard of MFCC being used for this
kind of work. Are you able to comment of the efficiency of either? May I ask
what algorithms are used in Sphinx to produce the phoneme output?
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, I should explain that I am using voice synthesis to produce the native
speaker voice samples. That being the case, couldn't I just compare the
phoneme output of the synthesized voice sample with the phoneme output from
the user recorded sample to get an approximation of pronunciation correctness?
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The first link was good reading, but the second link is broken (or at least
I cannot access it).
I think you can easily google for the title
believe your code can also be used to product a list of phonemes from a
speech sample also? Can this be done from the pocket (ie, the C version) of
the code?
Accuracy of the unconstrained phoneme recognition is very low. That's why the
papers try to constrain phonetic variants for recognition. That's their main
idea.
, but I've also heard of MFCC being used for this kind of work. Are you able
to comment of the efficiency of either?
MFCC is just a feature type, it can be used for DTW with the native speaker
recording or for HMM-based recognition.
May I ask what algorithms are used in Sphinx to produce the phoneme output?
This algorithm is called Viterbi search. You can find more information in
Rabiner's HMM Tutorial
That being the case, couldn't I just compare the phoneme output of the
synthesized voice sample with the phoneme output from the user recorded sample
to get an approximation of pronunciation correctness?
I dont think it will work
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks again for a great reply. One last question. I've been reading a lot of
articles about MFCC being a feature set used for speech recognition, but none
of the articles explain exactly WHAT these numbers represent. Do they
represent anything real? I mean, do they represent for example the formant
information without the harmonic distortion or something?
Thanks
Ray
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've been reading a lot of articles about MFCC being a feature set used for
speech recognition, but none of the articles explain exactly WHAT these
numbers represen
The physical interpretation of MFCC is hard. For the linear cepstrum it's
quite easy - the zero'th coefficient is the energy, the n'th coefficient
corresponds nonlinearly to the degree to which a signal is autocorrelated with
the time delay n * Fs (Fs - sampling period).
For the linear cepstrum, this makes 13 coefficients seem reasonable at 8kHz,
since then, the last coefficient will then show autocorrelations at 12/8000s
or 1.5ms, which in turn can be safely assumed to lower bound pitch-related
delays (that we are not interested in for recognition of non-tonal speech,
since it's only a function of speaker and intonation but not of content).
However, at mel-scaled frequencies, the interpretation (roughly reproduced
from Alan V. Oppenheim and Ronald W. Schafer. From Frequency to Quefrency: A
History of the Cepstrum. IEEE SIGNAL PROCESSING MAGAZINE) is not applicable as
such, and the use of 12 or 13 coefficients seem to be due to historical
reasons in many of the reported cases. The choice of the number of MFCCs to
include in an ASR system is largely empirical. Historically people tried
increasing the number of coefficients until a law of diminishing returns
kicked in. In practice, the optimal number of coefficients depends on the
quantity of training data, the details of the training algorithm (in
particular how well the PDFs can be modelled as the dimensionality of the
feature space increases), the number of Gaussian mixtures in the HMMs, the
speaker and background noise characteristics, and sometimes the available
computing resources.
To understand why any specific number of cepstral coefficients is used, you
could do worse than look at very early (pre-HMM) papers. When using DTW using
Euclidean or even Mahalanobis distances, it quickly became apparent that the
very high cepstral coefficients were not helpful for recognition, and to a
lesser extent, neither were the very low ones. The most common solution was to
"lifter" the MFCCs - i.e. apply a weighting function to them to emphasise the
mid-range coefficients. These liftering functions were "optimised" by a number
of researchers, but they almost always ended up being close to zero by the
time you got to the 12th coefficient.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
And if you don't know much about DSP (like me) probably it is enough for now
to remember that MFCCs are correlated with the configuration of the acoustic
filter formed by the vocal tract, i.e. the position of the tongue, lips and so
on, while ignoring (to some extent) the information about the fundamental
frequency produced by the larynx(thus making MFCCs speaker-independent).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But there are a lot of steps. Does anyone have any proven test data to QA my
code? Ideally 64 floating point input values and (I guess) 12 or 13 output
values?
Thanks all!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Does anyone have any proven test data to QA my code? I
You can use any sound file with CMUSphinx feature extraction code to create
tests. But you need to remember that there are different variants of MFCC
extractions. They all differ by params and have completely different values
but accuracy is the same or almost the same.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi All,
I just got a link to your speech toolkit and wanted to ask some basic
questions before I delve too deep. I am wanting to write a program to check
pronunciation of a foreign language. I will have a native speaker sample and a
non-native speaker sample. I want to be able to compare the two to come up
with a percentage of how close the non-native speaker was to the correct
(native) pronunciation.
Does this sound like something that would be possible using this toolkit? I've
been doing some reading about format analysis, would that be the method to use
to do something like this? Any help or advice would be appreciated. At the
moment I have passed both samples though a time-windowed fft to produce a
spectrogram and am simply doing a comparison of the spectrogram (though my
code does handle warping) to see how close they are. It works kinda, but it's
not super-accurate.
Thanks
Ray
Hello
CMUSphinx can be used for language learning and pronunciation assessment. But
the algorithms applied are different from what you were doing earlier. They
don't care about native speaker recording, they use native speaker acoustic
model and model for typical learner mistakes.
I suggest you to read the papers about HMM for pronunciation evaluation:
A method for measuring the intelligibility and nonnativeness of phone in
foreign language pronunciation training
Goh Kawai and Keikichi Hirose
http://www.shlrc.mq.edu.au/proceedings/icslp98/PDF/AUTHOR/SL980782.PDF
The SRI EduSpeakTM System: Recognition and Pronunciation Scoring
Franco et al.
http://www.speech.sri.com/people/hef/papers/EduSpeak.ps
Hi. Thank you for your reply.
The first link was good reading, but the second link is broken (or at least I
cannot access it).
Regarding the first link, interestingly they also make use of an existing
speech recognition engine to generate the phoneme data which they use for
comparison. I believe your code can also be used to product a list of phonemes
from a speech sample also? Can this be done from the pocket (ie, the C
version) of the code?
I will have a read about HMM, but I've also heard of MFCC being used for this
kind of work. Are you able to comment of the efficiency of either? May I ask
what algorithms are used in Sphinx to produce the phoneme output?
Thanks
Ray
Also, I should explain that I am using voice synthesis to produce the native
speaker voice samples. That being the case, couldn't I just compare the
phoneme output of the synthesized voice sample with the phoneme output from
the user recorded sample to get an approximation of pronunciation correctness?
Thanks
Ray
I think you can easily google for the title
Accuracy of the unconstrained phoneme recognition is very low. That's why the
papers try to constrain phonetic variants for recognition. That's their main
idea.
MFCC is just a feature type, it can be used for DTW with the native speaker
recording or for HMM-based recognition.
This algorithm is called Viterbi search. You can find more information in
Rabiner's HMM Tutorial
I dont think it will work
Thanks again for a great reply. One last question. I've been reading a lot of
articles about MFCC being a feature set used for speech recognition, but none
of the articles explain exactly WHAT these numbers represent. Do they
represent anything real? I mean, do they represent for example the formant
information without the harmonic distortion or something?
Thanks
Ray
The physical interpretation of MFCC is hard. For the linear cepstrum it's
quite easy - the zero'th coefficient is the energy, the n'th coefficient
corresponds nonlinearly to the degree to which a signal is autocorrelated with
the time delay n * Fs (Fs - sampling period).
For the linear cepstrum, this makes 13 coefficients seem reasonable at 8kHz,
since then, the last coefficient will then show autocorrelations at 12/8000s
or 1.5ms, which in turn can be safely assumed to lower bound pitch-related
delays (that we are not interested in for recognition of non-tonal speech,
since it's only a function of speaker and intonation but not of content).
However, at mel-scaled frequencies, the interpretation (roughly reproduced
from Alan V. Oppenheim and Ronald W. Schafer. From Frequency to Quefrency: A
History of the Cepstrum. IEEE SIGNAL PROCESSING MAGAZINE) is not applicable as
such, and the use of 12 or 13 coefficients seem to be due to historical
reasons in many of the reported cases. The choice of the number of MFCCs to
include in an ASR system is largely empirical. Historically people tried
increasing the number of coefficients until a law of diminishing returns
kicked in. In practice, the optimal number of coefficients depends on the
quantity of training data, the details of the training algorithm (in
particular how well the PDFs can be modelled as the dimensionality of the
feature space increases), the number of Gaussian mixtures in the HMMs, the
speaker and background noise characteristics, and sometimes the available
computing resources.
To understand why any specific number of cepstral coefficients is used, you
could do worse than look at very early (pre-HMM) papers. When using DTW using
Euclidean or even Mahalanobis distances, it quickly became apparent that the
very high cepstral coefficients were not helpful for recognition, and to a
lesser extent, neither were the very low ones. The most common solution was to
"lifter" the MFCCs - i.e. apply a weighting function to them to emphasise the
mid-range coefficients. These liftering functions were "optimised" by a number
of researchers, but they almost always ended up being close to zero by the
time you got to the 12th coefficient.
And if you don't know much about DSP (like me) probably it is enough for now
to remember that MFCCs are correlated with the configuration of the acoustic
filter formed by the vocal tract, i.e. the position of the tongue, lips and so
on, while ignoring (to some extent) the information about the fundamental
frequency produced by the larynx(thus making MFCCs speaker-independent).
This is great information, thank you! I have coded up some MFCC code in C to
test this out, based on this document...
http://mirlab.org/jang/books/audioSignalProcessing/speechFeatureMfcc.asp?titl
e=12-2%20MFCC
But there are a lot of steps. Does anyone have any proven test data to QA my
code? Ideally 64 floating point input values and (I guess) 12 or 13 output
values?
Thanks all!
You can use any sound file with CMUSphinx feature extraction code to create
tests. But you need to remember that there are different variants of MFCC
extractions. They all differ by params and have completely different values
but accuracy is the same or almost the same.