For the sphinx 3 aligneroutput, I was wondering why sometimes the acoustic
scores for some phones, is a positive number. If these are log likelihood
probabilities to the logbase 1.0001, shouldn't they be all negative numbers?
Is there scaling happening? If yes, then is there a way I can obtain real
scores?
SFrm EFrm SegAScr Phone
0 2 -54898 SIL
3 5 -219021 SIL
6 12 -307350 M SIL IY b
13 32 131837 IY M SIL e
33 44 345816 SIL
45 68 176492 SIL
69 117 126858 SIL
Total score: 199734
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to see if I can rate the phonetic breakup of the pronunciation of
a word, using the aligner. For example, if the user says PEAK (P IY K), I
would want to determine the quality of the individual phones (context-
dependent) and then give feedback on pronunciation.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see. Thanks again! Is there documentation on exactly what the aligner scores
represent then? I could find resources for the decoder, but it is not clear
what the aligner output means. Any pointers would be nice.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm upscaling the sphinx3_align scores using the .bsenscr files produced by
sphinx3_decode. (using word segmentation info in .wdseg, add the
corresponding scores from *.bsenscr).
Is this procedure correct?
If it's correct, why would the same word's score as given by sphinx3_decode be different than that obtained by above method? Internally sphinx3_decode does this upscaling by itself and gives the score.
I'm getting different scores even wheh the word boundaries in sphinx3_decode
and sphinx3_align are exactly the same!
Could it be because phone segmentation assumed by sphinx3_decode is different
than that assumed by sphinx3_align?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For the sphinx 3 aligneroutput, I was wondering why sometimes the acoustic
scores for some phones, is a positive number. If these are log likelihood
probabilities to the logbase 1.0001, shouldn't they be all negative numbers?
Is there scaling happening? If yes, then is there a way I can obtain real
scores?
SFrm EFrm SegAScr Phone
0 2 -54898 SIL
3 5 -219021 SIL
6 12 -307350 M SIL IY b
13 32 131837 IY M SIL e
33 44 345816 SIL
45 68 176492 SIL
69 117 126858 SIL
Total score: 199734
Acoustic scores are densities, not probabilities. They are not necessary less
that 1.
Sphinx3 aligner output is unscaled.
Thanks! Is there a way to obtain likelihood probabilities for phones in a
word, using the aligner?
I am trying to see if I can rate the phonetic breakup of the pronunciation of
a word, using the aligner. For example, if the user says PEAK (P IY K), I
would want to determine the quality of the individual phones (context-
dependent) and then give feedback on pronunciation.
No, aligner doesn't print that. Aligner is for alignment, not for the phone
evaluation.
For pronunciation evaluation see the FAQ please
http://cmusphinx.sourceforge.net/wiki/faq#qhow_to_implement_pronunciation_eva
luation
I see. Thanks again! Is there documentation on exactly what the aligner scores
represent then? I could find resources for the decoder, but it is not clear
what the aligner output means. Any pointers would be nice.
Any help on this would be appreciated. Is there a detailed description of
acoustic scores for Sphinx3, somewhere?
I ended finding answers to my own questions! :) In case anyone ever follows
this post, this is how they got resolved:
http://cmusphinx.sourceforge.net/wiki/sphinx4:outstandingissues#acoustic_scor
ing
Hello nshymrev,
I'm upscaling the sphinx3_align scores using the .bsenscr files produced by
sphinx3_decode. (using word segmentation info in .wdseg, add the
corresponding scores from *.bsenscr).
Is this procedure correct?
If it's correct, why would the same word's score as given by sphinx3_decode be different than that obtained by above method? Internally sphinx3_decode does this upscaling by itself and gives the score.
I'm getting different scores even wheh the word boundaries in sphinx3_decode
and sphinx3_align are exactly the same!
Could it be because phone segmentation assumed by sphinx3_decode is different
than that assumed by sphinx3_align?
Thanks.
Hi Pranav
On your place I would disable scaling in s3 altogether in the sources and go
sleep in a good mood.