Menu

Pronunciation Checking

Raeldor
2010-12-10
2012-09-22
  • Raeldor

    Raeldor - 2010-12-10

    Hi All,

    I just got a link to your speech toolkit and wanted to ask some basic
    questions before I delve too deep. I am wanting to write a program to check
    pronunciation of a foreign language. I will have a native speaker sample and a
    non-native speaker sample. I want to be able to compare the two to come up
    with a percentage of how close the non-native speaker was to the correct
    (native) pronunciation.

    Does this sound like something that would be possible using this toolkit? I've
    been doing some reading about format analysis, would that be the method to use
    to do something like this? Any help or advice would be appreciated. At the
    moment I have passed both samples though a time-windowed fft to produce a
    spectrogram and am simply doing a comparison of the spectrogram (though my
    code does handle warping) to see how close they are. It works kinda, but it's
    not super-accurate.

    Thanks
    Ray

     
  • Nickolay V. Shmyrev

    Hello

    CMUSphinx can be used for language learning and pronunciation assessment. But
    the algorithms applied are different from what you were doing earlier. They
    don't care about native speaker recording, they use native speaker acoustic
    model and model for typical learner mistakes.

    I suggest you to read the papers about HMM for pronunciation evaluation:

    A method for measuring the intelligibility and nonnativeness of phone in
    foreign language pronunciation training

    Goh Kawai and Keikichi Hirose
    http://www.shlrc.mq.edu.au/proceedings/icslp98/PDF/AUTHOR/SL980782.PDF

    The SRI EduSpeakTM System: Recognition and Pronunciation Scoring
    Franco et al.
    http://www.speech.sri.com/people/hef/papers/EduSpeak.ps

     
  • Raeldor

    Raeldor - 2010-12-11

    Hi. Thank you for your reply.

    The first link was good reading, but the second link is broken (or at least I
    cannot access it).

    Regarding the first link, interestingly they also make use of an existing
    speech recognition engine to generate the phoneme data which they use for
    comparison. I believe your code can also be used to product a list of phonemes
    from a speech sample also? Can this be done from the pocket (ie, the C
    version) of the code?

    I will have a read about HMM, but I've also heard of MFCC being used for this
    kind of work. Are you able to comment of the efficiency of either? May I ask
    what algorithms are used in Sphinx to produce the phoneme output?

    Thanks
    Ray

     
  • Raeldor

    Raeldor - 2010-12-11

    Also, I should explain that I am using voice synthesis to produce the native
    speaker voice samples. That being the case, couldn't I just compare the
    phoneme output of the synthesized voice sample with the phoneme output from
    the user recorded sample to get an approximation of pronunciation correctness?

    Thanks
    Ray

     
  • Nickolay V. Shmyrev

    The first link was good reading, but the second link is broken (or at least
    I cannot access it).

    I think you can easily google for the title

    believe your code can also be used to product a list of phonemes from a
    speech sample also? Can this be done from the pocket (ie, the C version) of
    the code?

    Accuracy of the unconstrained phoneme recognition is very low. That's why the
    papers try to constrain phonetic variants for recognition. That's their main
    idea.

    , but I've also heard of MFCC being used for this kind of work. Are you able
    to comment of the efficiency of either?

    MFCC is just a feature type, it can be used for DTW with the native speaker
    recording or for HMM-based recognition.

    May I ask what algorithms are used in Sphinx to produce the phoneme output?

    This algorithm is called Viterbi search. You can find more information in
    Rabiner's HMM Tutorial

    That being the case, couldn't I just compare the phoneme output of the
    synthesized voice sample with the phoneme output from the user recorded sample
    to get an approximation of pronunciation correctness?

    I dont think it will work

     
  • Raeldor

    Raeldor - 2010-12-12

    Thanks again for a great reply. One last question. I've been reading a lot of
    articles about MFCC being a feature set used for speech recognition, but none
    of the articles explain exactly WHAT these numbers represent. Do they
    represent anything real? I mean, do they represent for example the formant
    information without the harmonic distortion or something?

    Thanks
    Ray

     
  • Nickolay V. Shmyrev

    I've been reading a lot of articles about MFCC being a feature set used for
    speech recognition, but none of the articles explain exactly WHAT these
    numbers represen

    The physical interpretation of MFCC is hard. For the linear cepstrum it's
    quite easy - the zero'th coefficient is the energy, the n'th coefficient
    corresponds nonlinearly to the degree to which a signal is autocorrelated with
    the time delay n * Fs (Fs - sampling period).

    For the linear cepstrum, this makes 13 coefficients seem reasonable at 8kHz,
    since then, the last coefficient will then show autocorrelations at 12/8000s
    or 1.5ms, which in turn can be safely assumed to lower bound pitch-related
    delays (that we are not interested in for recognition of non-tonal speech,
    since it's only a function of speaker and intonation but not of content).

    However, at mel-scaled frequencies, the interpretation (roughly reproduced
    from Alan V. Oppenheim and Ronald W. Schafer. From Frequency to Quefrency: A
    History of the Cepstrum. IEEE SIGNAL PROCESSING MAGAZINE) is not applicable as
    such, and the use of 12 or 13 coefficients seem to be due to historical
    reasons in many of the reported cases. The choice of the number of MFCCs to
    include in an ASR system is largely empirical. Historically people tried
    increasing the number of coefficients until a law of diminishing returns
    kicked in. In practice, the optimal number of coefficients depends on the
    quantity of training data, the details of the training algorithm (in
    particular how well the PDFs can be modelled as the dimensionality of the
    feature space increases), the number of Gaussian mixtures in the HMMs, the
    speaker and background noise characteristics, and sometimes the available
    computing resources.

    To understand why any specific number of cepstral coefficients is used, you
    could do worse than look at very early (pre-HMM) papers. When using DTW using
    Euclidean or even Mahalanobis distances, it quickly became apparent that the
    very high cepstral coefficients were not helpful for recognition, and to a
    lesser extent, neither were the very low ones. The most common solution was to
    "lifter" the MFCCs - i.e. apply a weighting function to them to emphasise the
    mid-range coefficients. These liftering functions were "optimised" by a number
    of researchers, but they almost always ended up being close to zero by the
    time you got to the 12th coefficient.

     
  • Vassil Panayotov

    And if you don't know much about DSP (like me) probably it is enough for now
    to remember that MFCCs are correlated with the configuration of the acoustic
    filter formed by the vocal tract, i.e. the position of the tongue, lips and so
    on, while ignoring (to some extent) the information about the fundamental
    frequency produced by the larynx(thus making MFCCs speaker-independent).

     
  • Raeldor

    Raeldor - 2010-12-13

    This is great information, thank you! I have coded up some MFCC code in C to
    test this out, based on this document...

    http://mirlab.org/jang/books/audioSignalProcessing/speechFeatureMfcc.asp?titl
    e=12-2%20MFCC

    But there are a lot of steps. Does anyone have any proven test data to QA my
    code? Ideally 64 floating point input values and (I guess) 12 or 13 output
    values?

    Thanks all!

     
  • Nickolay V. Shmyrev

    Does anyone have any proven test data to QA my code? I

    You can use any sound file with CMUSphinx feature extraction code to create
    tests. But you need to remember that there are different variants of MFCC
    extractions. They all differ by params and have completely different values
    but accuracy is the same or almost the same.

     

Log in to post a comment.