CMU Sphinx / Forums / Help: What is 'Acoustic Score'?

He Shiming - 2013-01-25

Dear Community,

I'm working on a research project to provide machine based scoring for pronunciation in language learning process. After a bit of googling, I stumbled upon a handful of papers and projects that based this idea on the alignment function.

I managed to create a demo, where the first step is to use sphinx_fe (from sphinxbase 0.8) to extract features (mfc). And the second step is to use sphinx3_align (from sphinx3-0.8) to align the recording with given text. Eventually, the results in .phseg and .wdseg contains 'Acoustic Score' for each phoneme or word, which suppose to mean something in terms of how good the samples are. (Correct me if I'm wrong here, but sphinx4 can't do this?)

Though I read in several papers, that the Acoustic Score can be both positive and negative, I'm seeing consistent negative scores in all of my experiments.

I'm wondering what exactly is Acoustic Score. And how do I use it as a reference to determine whether the pronunciation is good. Or is it even adequate?

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bic-user - 2013-01-25

checkout this: http://cmusphinx.sourceforge.net/wiki/pronunciation_evaluation

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The Grand Janitor - 2013-01-25

About the sign of likelihood and what acoustic score is. Acoustic score per frame is essentially the log value of continuous distribution function (cdf). In Sphinx's case, the CDF is a multi-dimensional Gaussian distribution. So Acoustic score per phone will be the log likelihood of the phone HMM. You can extend this definition to word HMM.

For the sign. If you think of discrete probability distribution, then this acoustic score thingy should always be negative. (Because log of a decimal number is negative.)

In the case of a Gaussian distribution though, when the standard deviation is small, it is possible that the value is larger than 1
(Also see http://blog.stata.com/2011/02/16/positive-log-likelihood-values-happen/) So those are the time you will see a positive value.

One thing you might feel disharmonious is the magnitude of the likelihood you see. Bear in mind, Sphinx2 or Sphinx3 are using a very small logbase. We are also talking about a multi-dimensional Gaussian distribution. It makes numerical values become bigger.

As for determining whether the pronunciation is good, what usually people do is to calculate some kind of phone confidence. There should be plenty of paper on that topic, try google it a bit first.

Hope this help.

Arthur

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- sumit kumar - 2018-12-27
  
  Hi Arthur
  Can you suggest something to calculate phoneme confidence?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2013-01-26

Small addition to Arthur's reply. Acoustic score per frame comes from two sources. [1] Gaussian CDF [2] Discrete PDF corresponding to state transition matrix ("a" transitions in https://en.wikipedia.org/wiki/File:HiddenMarkovModel.svg). The first one has larger contribution.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The Grand Janitor - 2013-01-26

Hey Pranav...... Thanks for your supplementary answer. You did provide more details for Mr. Shiming.

I guess one point I want to clarify...... acoustic score per phone or per word do come from two sources. In which, the state-transition or phone-phone transition probability need to be calculated.

On the other hand, acoustic score per frame usually just means the Gaussian distribution (CDF). (Since the cdf of gmm and discrete pdf are independent. The latter doesn't depend on the frame vector.)

Of course, we might be just talking about something with slightly different terms.

Arthur

Btw, I put a little bit of an explanation post on my blog. Feel free to take a look if you are interested.
http://grandjanitor.blogspot.com/2013/01/acoustic-score-and-its-signness.html

Arthur

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Pranav Jawale - 2013-01-26

Thanks for making a distinction between per frame and per phone. Do the scores in *.stseg come from Gaussian continuous PD alone or combination of both continuous and discrete PD? I thought it was the latter.

(I don't know why fonts are messed up after sourceforge site modifications).

Last edit: Pranav Jawale 2013-01-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

He Shiming - 2013-01-26

Thank you all very much for the explanation.

I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.

Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?

For phone confidence calculation, I've done a bit of search already. It looks like I should get the TIMIT database for reference baseline. I wonder if I'm heading in the right direction. Should I run feature extraction and alignment on TIMIT data, and calculate the mean and std deviation of each phoneme, then compare the variance of user sample to mean TIMIT sample to get the pronunciation score?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-01-26

I have been reviewing the scores manually. It turns out there are limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often get a high score (-18000 to -10000). It appears silent consonants are treated as inferior pronunciation.

You need to understand that acoustic score is a joint probability of the model and the observation sequence and has no relation to pronuncation quality. Confidence score does if you construct it, but acoustic score is not a confidence score.

Another limitation is sometimes sphinx3_align dies with 'final state not reached' error. This happens in cases where a user pronounced a word incorrectly. It looks like sphinx3_align stops searching further as soon as failing to align one word. Is there any method to improve this?

Increase beam

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Alex Rudnicky - 2013-01-26
  
  What Nickolay said.
  
  Absolute acoustic scores are not all that meaningful. It you are interested
  in scores per se you might be better off looking at the distribution across
  classes per decoding unit. So, if the the next-best class is very close to
  the first-best, that's worse than if the best-score is significantly better
  than the 2nd-best. It's all relative.
  
  But actually that's too complicated. You're better off examining posterior
  probabilities, i.e. the 'confidence' score.
  
  The 'final-state-not-reached' condition, in my experience, is often
  attributable to bad end-pointing. That is, the recognition process assumes
  that all utts end in a silence; specifically, a transition to a silence
  state. If the end of the utt is noisy, or cuts off speech, meaning there's
  no good silence, you get this error. I would suggest first looking at the
  endpointing.
  
  Alex
  
  On Sat, Jan 26, 2013 at 2:00 PM, Nickolay V. Shmyrev
  nshmyrev@users.sf.netwrote:
  
  I have been reviewing the scores manually. It turns out there are
  limitations. For instance, the 'K' in 'activity', or 'N' in 'men' will most
  likely get a low score (-200000 ish). While 'V', 'T' in 'activity' often
  get a high score (-18000 to -10000). It appears silent consonants are
  treated as inferior pronunciation.
  
  You need to understand that acoustic score is a joint probability of the
  model and the observation sequence and has no relation to pronuncation
  quality. Confidence score does if you construct it, but acoustic score is
  not a confidence score.
  
  Another limitation is sometimes sphinx3_align dies with 'final state not
  reached' error. This happens in cases where a user pronounced a word
  incorrectly. It looks like sphinx3_align stops searching further as soon as
  failing to align one word. Is there any method to improve this?
  
  Increase beam
  
  What is 'Acoustic Score'?https://sourceforge.net/p/cmusphinx/discussion/help/thread/3f9456d9/?limit=25#031a
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/cmusphinx/discussion/help/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/prefs/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

He Shiming - 2013-01-27

Thank you all for your help. I'll experiment with beam settings and try silence out all utterances.

[Update]: I tried silencing out the utterances, by using sox fade 1s 0 1s. Weird thing happend, sphinx3_align spend a long time on each, and then dies with 'final state not reached'. Fade in is okay. But fade out will cause this problem. Usually sphinx3_align spends 1-2 secs on each utterance, but for fade out ones, it spends up to 10 secs.

About the 'confidence' score, I'm wondering if using the acoustic score of TIMIT data is a good start for evaluation?

Last edit: He Shiming 2013-01-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

What is 'Acoustic Score'?

Speech Recognition Toolkit

Forums

Help

What is 'Acoustic Score'? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Increase beam

What is 'Acoustic Score'?https://sourceforge.net/p/cmusphinx/discussion/help/thread/3f9456d9/?limit=25#031a

What is 'Acoustic Score'?