Menu

What is 'Acoustic Score'?

Help
He Shiming
2013-01-25
2018-12-27
  • He Shiming

    He Shiming - 2013-01-25

    Dear Community,

    I'm working on a research project to provide machine based scoring for pronunciation in language learning process. After a bit of googling, I stumbled upon a handful of papers and projects that based this idea on the alignment function.

    I managed to create a demo, where the first step is to use sphinx_fe (from sphinxbase 0.8) to extract features (mfc). And the second step is to use sphinx3_align (from sphinx3-0.8) to align the recording with given text. Eventually, the results in .phseg and .wdseg contains 'Acoustic Score' for each phoneme or word, which suppose to mean something in terms of how good the samples are. (Correct me if I'm wrong here, but sphinx4 can't do this?)

    Though I read in several papers, that the Acoustic Score can be both positive and negative, I'm seeing consistent negative scores in all of my experiments.

    I'm wondering what exactly is Acoustic Score. And how do I use it as a reference to determine whether the pronunciation is good. Or is it even adequate?

    Thanks.

     
  • The Grand Janitor

    About the sign of likelihood and what acoustic score is. Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the CDF is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.

    For the sign. If you think of discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.)

    In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1
    (Also see http://blog.stata.com/2011/02/16/positive-log-likelihood-values-happen/) So those are the time you will see a positive value.

    One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.

    As for determining whether the pronunciation is good, what usually people do is to calculate some kind of phone confidence. There should be plenty of paper on that topic, try google it a bit first.

    Hope this help.

    Arthur

     
    • sumit kumar

      sumit kumar - 2018-12-27

      Hi Arthur
      Can you suggest something to calculate phoneme confidence?

       
  • Pranav Jawale

    Pranav Jawale - 2013-01-26

    Small addition to Arthur's reply. Acoustic score per frame comes from two sources. [1] Gaussian CDF [2] Discrete PDF corresponding to state transition matrix ("a" transitions in https://en.wikipedia.org/wiki/File:HiddenMarkovModel.svg). The first one has larger contribution.

     
  • The Grand Janitor

    Hey Pranav...... Thanks for your supplementary answer. You did provide more details for Mr. Shiming.

    I guess one point I want to clarify...... acoustic score per phone or per word do come from two sources. In which, the state-transition or phone-phone transition probability need to be calculated.

    On the other hand, acoustic score per frame usually just means the Gaussian distribution (CDF). (Since the cdf of gmm and discrete pdf are independent. The latter doesn't depend on the frame vector.)

    Of course, we might be just talking about something with slightly different terms.

    Arthur

    Btw, I put a little bit of an explanation post on my blog. Feel free to take a look if you are interested.
    http://grandjanitor.blogspot.com/2013/01/acoustic-score-and-its-signness.html

    Arthur

     
  • Pranav Jawale

    Pranav Jawale - 2013-01-26

    Thanks for making a distinction between per frame and per phone. Do the scores in *.stseg come from Gaussian continuous PD alone or combination of both continuous and discrete PD? I thought it was the latter.

    (I don't know why fonts are messed up after sourceforge site modifications).

     

    Last edit: Pranav Jawale 2013-01-26
  • He Shiming

    He Shiming - 2013-01-26

    Thank you all very much for the explanation.

    I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.

    Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?

    For phone confidence calculation, I've done a bit of search already. It looks like I should get the TIMIT database for reference baseline. I wonder if I'm heading in the right direction. Should I run feature extraction and alignment on TIMIT data, and calculate the mean and std deviation of each phoneme, then compare the variance of user sample to mean TIMIT sample to get the pronunciation score?

     
  • Nickolay V. Shmyrev

    I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.

    You need to understand that acoustic score is a joint probability of the model and the observation sequence and has no relation to pronuncation quality. Confidence score does if you construct it, but acoustic score is not a confidence score.

    Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?

    Increase beam

     
    • Alex Rudnicky

      Alex Rudnicky - 2013-01-26

      What Nickolay said.

      Absolute acoustic scores are not all that meaningful. It you are interested
      in scores per se you might be better off looking at the distribution across
      classes per decoding unit. So, if the the next-best class is very close to
      the first-best, that's worse than if the best-score is significantly better
      than the 2nd-best. It's all relative.

      But actually that's too complicated. You're better off examining posterior
      probabilities, i.e. the 'confidence' score.

      The 'final-state-not-reached' condition, in my experience, is often
      attributable to bad end-pointing. That is, the recognition process assumes
      that all utts end in a silence; specifically, a transition to a silence
      state. If the end of the utt is noisy, or cuts off speech, meaning there's
      no good silence, you get this error. I would suggest first looking at the
      endpointing.

      Alex

      On Sat, Jan 26, 2013 at 2:00 PM, Nickolay V. Shmyrev
      nshmyrev@users.sf.netwrote:

      I have been reviewing the scores manually. It turns out there are
      limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most
      likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often
      get a high score (-18000 to -10000). It appears silent consonants are
      treated as inferior pronunciation.

      You need to understand that acoustic score is a joint probability of the
      model and the observation sequence and has no relation to pronuncation
      quality. Confidence score does if you construct it, but acoustic score is
      not a confidence score.

      Another limitation is sometimes sphinx3_align dies with 'final state not
      reached' error. This happens in cases where a user pronounced a word
      incorrectly. It looks like sphinx3_align stops searching further as soon as
      failing to align one word. Is there any method to improve this?

      Increase beam

      What is 'Acoustic Score'?https://sourceforge.net/p/cmusphinx/discussion/help/thread/3f9456d9/?limit=25#031a

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/cmusphinx/discussion/help/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/prefs/

       
  • He Shiming

    He Shiming - 2013-01-27

    Thank you all for your help. I'll experiment with beam settings and try silence out all utterances.

    [Update]: I tried silencing out the utterances, by using sox fade 1s 0 1s. Weird thing happend, sphinx3_align spend a long time on each, and then dies with 'final state not reached'. Fade in is okay. But fade out will cause this problem. Usually sphinx3_align spends 1-2 secs on each utterance, but for fade out ones, it spends up to 10 secs.

    About the 'confidence' score, I'm wondering if using the acoustic score of TIMIT data is a good start for evaluation?

     

    Last edit: He Shiming 2013-01-28

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.