But these numbers differ from those after fv: (for example AScr(unnorm) for
<sil> is 587490 in one place whereas it is ) -1538274 in another place. Do we
add some other scaling number to get the same scores? </sil>
Secondly, as I figure from
the acoustic scores are scaled values; in each frame, the acoustic scores of all active senones are scaled such that the best senone has a log-likelihood of 0
Sorry, I realized that these two sets of numbers show the same information.
The format for FWDXCT is such that first AScr then LMScore and then the
word (I was considering the word to be first).
But there remains one question. Should AScale factor be added to (in order
to get or a monotonic function of it) ?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Should AScale factor be added to (in order to get or a monotonic function of
it) ?
The issue is that acoustic score is not probability even on the training stage
because gaussians are not normalized. You will need to rescore it anyway. Why
do you need to have that log P(O|W) + log P(W) at all?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There they talk of b(O) (eqn 3, pp 3) which is basically state level acoustic
likelihood of observation vector O for a given frame. Is the acoustic score
(Unnormalized) for a word as given in Sphinx decoder log equal to the sum over
all frames of that word ?
The authors have used sphinx3 decoder and they are able to find out which
frames do the individual states belong to
in order to get state level acoustic log-likelihood score, they use it in
their algo to get phone level and word level
confidence measure. This is why I need to know whether the acoustic score
given by sphinx is akin to b(O) (i.e. sum over all the phones,
and with proper mixture weights being multiplied to individual gaussians).
Secondly, I am not able to figure out how they got to know which frames belong
to which states.
For example if word "ON" was decoded between frames 20 and 40. /o/ belongs to
20-35 and /n/ belongs to 36-40.
Suppose HMM of /o/ has three states - state1, state2, state3. state1 belongs
to frames 20-23 state2 belongs from frames 24-26 and so on ..
Is there a flag to be set in decoder which will also give this state level
decomposition of frames?
Please help. Thanks in advance for any pointers !
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
P.S. I gather that we need to force align phone transcription of decoded o/p
with the utt to get individual phone boundaries and acoustic scores. (please
correct me if this method is incorrect).
But I'm not sure how to get which frames have been assigned to which
individual states within the phones.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(perhaps I should start another thread for the second question ) .. I found
that there is a flag -stsegdir with which state level segmentation can be
dumped. But the .stseg file that is created, is in binary format. According
to this decoder overview http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_
overview.html there is a utility stseg-read* which can be used to read this binary
file.
But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
it removed?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to implement Confidence Measures in Speech Recognition based on
Probability Distribution of Likelihoods
This is not very clean paper with screwed terminology. Though idea they
present is ok. I would suggest you to look for some other paper on the same
subject. Actually the method they are using is first proposed by Young here
"Recognition Confidence Measures: Detection of Misrecognitions and Out-Of-
Vocabulary Words" Sheryl R. Young in 1994.
You can unscale the log likelihood score since scale factor per frame it's
known in the decoder. And you don't need to have a probabliity in each state,
you can do the same procedure with the score which is not the probability but
has the same properties.
But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
it removed?
I'm also not aware about such tool. It maybe existed but later lost in
history. You can just understand the format of the stseg file looking on the
source code.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As far as stseg is concerned, I modified main_align.c (write_stseg function)
so that it writes the information in a readable format. So I can now see state
level segmentation / scores :)
I just have a couple of questions regd
You can unscale the log likelihood score since scale factor per frame it's
known in the decoder
I'm going through the source-code (it's huge for me!) to understand the
scaling etc.
Could you please tell how to access this "scale factor for each frame"? (if you could tell the variable name)
I gather that I can use first sphinx3_decode and then feed the decoder output (as -insent)
to sphinx3_align to get various (phone/state) segmentations. Is this
segmentation information ALSO available
at the decoder itself, which will save calling "sphinx3_align"?
Thanks.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
. Could you please tell how to access this "scale factor for each frame"?
(if you could tell the variable name)
int32 *ascale; /**< Same as senscale but it records the senscale for the whole sentence.
The default size is 3000 frames.
in include/srch.h
I gather that I can use first sphinx3_decode and then feed the decoder
output (as -insent)
to sphinx3_align to get various (phone/state) segmentations. Is this
segmentation information ALSO available
at the decoder itself, which will save calling "sphinx3_align"?
I would modify the sphinx3 decoder to track the phone sequence of the best
match. It shouldn't do it currently but it shoudl be easy to do so. Actually I
recommend you to start working with sphinx4. All those things are WAY EASIER
to implement in sphinx4 than in sphinx3.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
have you konwn hwo to get score ? i also need this score to call's
prounanciation evaluation ,but i got the same problem with you now ,how did
you solve it ?can you tell me?
thank you for any help!
amy
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have referred to some threads on similar topic but I'm still confused. I'm
using sphinx 3.8 - I got following information in the decoder log file-
How do I compute for any of the words? Should I use information after fv: and
simply do
(AScr + LMScr) + AScale
OR
Do I see the scores after FWDXCT and add
wa: Acoustic score for the word
and
wl: LM score for the word
(following the convention in http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc
/s3_description.html#sec_hypseg)
But these numbers differ from those after fv: (for example AScr(unnorm) for
<sil> is 587490 in one place whereas it is ) -1538274 in another place. Do we
add some other scaling number to get the same scores? </sil>
Secondly, as I figure from
(http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_description.html#sec_hyp
seg) AScrs for all the words (in hypseg) should be <= 0.
Which of these two set of numbers should be used if one wants to know log
likelihood ( ) for a word/filler?
Thanks.
oops, that log was created by sphinx 3.6 and not 3.8
Sorry, I realized that these two sets of numbers show the same information.
The format for FWDXCT is such that first AScr then LMScore and then the
word (I was considering the word to be first).
But there remains one question. Should AScale factor be added to (in order
to get or a monotonic function of it) ?
Thanks.
The issue is that acoustic score is not probability even on the training stage
because gaussians are not normalized. You will need to rescore it anyway. Why
do you need to have that log P(O|W) + log P(W) at all?
Hi,
I need log(P(O|W)) because I'm trying to implement_ Confidence Measures
in Speech Recognition based on Probability Distribution of
Likelihoods_
There they talk of b(O) (eqn 3, pp 3) which is basically state level acoustic
likelihood of observation vector O for a given frame. Is the acoustic score
(Unnormalized) for a word as given in Sphinx decoder log equal to the sum over
all frames of that word ?
The authors have used sphinx3 decoder and they are able to find out which
frames do the individual states belong to
in order to get state level acoustic log-likelihood score, they use it in
their algo to get phone level and word level
confidence measure. This is why I need to know whether the acoustic score
given by sphinx is akin to b(O) (i.e. sum over all the phones,
and with proper mixture weights being multiplied to individual gaussians).
Secondly, I am not able to figure out how they got to know which frames belong
to which states.
For example if word "ON" was decoded between frames 20 and 40. /o/ belongs to
20-35 and /n/ belongs to 36-40.
Suppose HMM of /o/ has three states - state1, state2, state3. state1 belongs
to frames 20-23 state2 belongs from frames 24-26 and so on ..
Is there a flag to be set in decoder which will also give this state level
decomposition of frames?
Please help. Thanks in advance for any pointers !
P.S. I gather that we need to force align phone transcription of decoded o/p
with the utt to get individual phone boundaries and acoustic scores. (please
correct me if this method is incorrect).
But I'm not sure how to get which frames have been assigned to which
individual states within the phones.
(perhaps I should start another thread for the second question ) .. I found
that there is a flag -stsegdir with which state level segmentation can be
dumped. But the .stseg file that is created, is in binary format. According
to this decoder overview http://www.cs.cmu.edu/~archan/s_info/Sphinx3/doc/s3_
overview.html there is a utility stseg-read* which can be used to read this binary
file.
But I could not locate this stseg-read either in sphinx3 or sphinxTrain. Was
it removed?
Thanks.
This is not very clean paper with screwed terminology. Though idea they
present is ok. I would suggest you to look for some other paper on the same
subject. Actually the method they are using is first proposed by Young here
"Recognition Confidence Measures: Detection of Misrecognitions and Out-Of-
Vocabulary Words" Sheryl R. Young in 1994.
You can unscale the log likelihood score since scale factor per frame it's
known in the decoder. And you don't need to have a probabliity in each state,
you can do the same procedure with the score which is not the probability but
has the same properties.
I'm also not aware about such tool. It maybe existed but later lost in
history. You can just understand the format of the stseg file looking on the
source code.
Hey, thanks for the reference, I'm reading it.
As far as stseg is concerned, I modified main_align.c (write_stseg function)
so that it writes the information in a readable format. So I can now see state
level segmentation / scores :)
I just have a couple of questions regd
I'm going through the source-code (it's huge for me!) to understand the
scaling etc.
Could you please tell how to access this "scale factor for each frame"? (if you could tell the variable name)
I gather that I can use first sphinx3_decode and then feed the decoder output (as -insent)
to sphinx3_align to get various (phone/state) segmentations. Is this
segmentation information ALSO available
at the decoder itself, which will save calling "sphinx3_align"?
Thanks.
in include/srch.h
I would modify the sphinx3 decoder to track the phone sequence of the best
match. It shouldn't do it currently but it shoudl be easy to do so. Actually I
recommend you to start working with sphinx4. All those things are WAY EASIER
to implement in sphinx4 than in sphinx3.
hi pranavi,
have you konwn hwo to get score ? i also need this score to call's
prounanciation evaluation ,but i got the same problem with you now ,how did
you solve it ?can you tell me?
thank you for any help!
amy