Hello,
I’m trying to get confidence scores for words decoded using Sphinx3, and after
reading previous posts I think I have a rough outline of how this is done. I
added +GARBAGE+ to my phone set, then added the entry
++GARBAGE++ +GARBAGE+
to my filler dictionary. I then added ++GARBAGE++ to my transcripts in place
of OOV words or poorly spoken words and recreated the acoustic model. Now,
when I decode with this new model, I believe I’m supposed to output the
lattice structure and measure each recognized word’s distance to a ++GARBAGE++
node. Is that correct? Is this done by just summing the acoustic scores
between nodes until I can reach the closest ++GARBAGE++ node? I’m not having
much luck doing that so far, perhaps I’m creating the model incorrectly?
Thanks,
Nate
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is specialized binary sphinx3_conf that calculates posteriour
probabilities from a lattice. Example of it's usage could be found in
/sphinx3/src/tests/regression/test-conf.sh
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks. My current objective is to just get a confidence score for the entire
utterance, not really worrying about individual words yet. I’ve looked at the
overall confidence score reported at the beginning of the output from
sphinx3_conf, but it seems to correlate more with the length of the utterance
than with the actual WER. I also tried averaging the individual scores for all
non-filler words in the output, but got something that looked similar.
So I was just wondering if I should be normalizing by length, or if there was
something else I need to be taking into account. I’ve only looked at 10
documents so far, the WER on them ranges from 11% to 16%, so I’m assuming
that’s not enough data to draw any strong conclusions yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
(18:28:03) dhd: nshm: utterance confidence score is P(utt|speech) =
P(utt,speech) / P(speech), where the numerator is the best path score and the
denominator is a sum over all paths in the lattice
(18:28:25) dhd: (which can be done easily with the forward algorithm)
(18:28:27) nshm: right, so there must be completely different calculation code
to do that
(18:28:35) nshm: it's what nlinde asking on forums
(18:28:43) dhd: no, ps_lattice.c does it
(18:28:50) nshm: he's asking about sphinx3
(18:28:53) dhd: ohhhhh
(18:29:16) dhd: if you understand DP over graphs it's easy to implement :)
(18:29:33) nshm: let me reply him with that
18:30
(18:30:02) dhd: ps lattice code is capable of rescoring lattices obtained from
s3
(18:30:07) dhd: but there's no command-line utility to do it
(18:30:18) nshm: k, good
(18:30:33) dhd: i think the python api can do it
(18:30:40) dhd: if not i should make that possible
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm using ps_lattice_read to import the lattice that sphinx3 is outputting,
then I'm filling that into search->dag and running ngram_search_bestpath. This
calls ps_lattice_posterior and fills the result into search->post (which I
think is the value I want). Are there any other steps I need to be taking?
David mentions rescoring the lattice, but I wasn't quite sure what he meant by
that -- based on the numbers I'm getting I think I still must be missing a
step.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
David mentions rescoring the lattice, but I wasn't quite sure what he meant
by that -- based on the numbers I'm getting I think I still must be missing a
step.
Hm, what's the problem exactly? It must be a single posterior number isn't it?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If I read all this correctly, your score is based on path posterior
probabilities, rather than word posterior probabilities.
If that's the case, then you won't necessarily find any correlation between it
and WER.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I’m trying to get confidence scores for words decoded using Sphinx3, and after
reading previous posts I think I have a rough outline of how this is done. I
added +GARBAGE+ to my phone set, then added the entry
++GARBAGE++ +GARBAGE+
to my filler dictionary. I then added ++GARBAGE++ to my transcripts in place
of OOV words or poorly spoken words and recreated the acoustic model. Now,
when I decode with this new model, I believe I’m supposed to output the
lattice structure and measure each recognized word’s distance to a ++GARBAGE++
node. Is that correct? Is this done by just summing the acoustic scores
between nodes until I can reach the closest ++GARBAGE++ node? I’m not having
much luck doing that so far, perhaps I’m creating the model incorrectly?
Thanks,
Nate
There is specialized binary sphinx3_conf that calculates posteriour
probabilities from a lattice. Example of it's usage could be found in
/sphinx3/src/tests/regression/test-conf.sh
Thanks. My current objective is to just get a confidence score for the entire
utterance, not really worrying about individual words yet. I’ve looked at the
overall confidence score reported at the beginning of the output from
sphinx3_conf, but it seems to correlate more with the length of the utterance
than with the actual WER. I also tried averaging the individual scores for all
non-filler words in the output, but got something that looked similar.
So I was just wondering if I should be normalizing by length, or if there was
something else I need to be taking into account. I’ve only looked at 10
documents so far, the WER on them ranges from 11% to 16%, so I’m assuming
that’s not enough data to draw any strong conclusions yet.
Hi, nlinde, here is what David said on chat if
(18:28:03) dhd: nshm: utterance confidence score is P(utt|speech) =
P(utt,speech) / P(speech), where the numerator is the best path score and the
denominator is a sum over all paths in the lattice
(18:28:25) dhd: (which can be done easily with the forward algorithm)
(18:28:27) nshm: right, so there must be completely different calculation code
to do that
(18:28:35) nshm: it's what nlinde asking on forums
(18:28:43) dhd: no, ps_lattice.c does it
(18:28:50) nshm: he's asking about sphinx3
(18:28:53) dhd: ohhhhh
(18:29:16) dhd: if you understand DP over graphs it's easy to implement :)
(18:29:33) nshm: let me reply him with that
18:30
(18:30:02) dhd: ps lattice code is capable of rescoring lattices obtained from
s3
(18:30:07) dhd: but there's no command-line utility to do it
(18:30:18) nshm: k, good
(18:30:33) dhd: i think the python api can do it
(18:30:40) dhd: if not i should make that possible
OK, thanks, I'll look into the pocket sphinx code and see if I can extract
what you guys mentioned
I'm using ps_lattice_read to import the lattice that sphinx3 is outputting,
then I'm filling that into search->dag and running ngram_search_bestpath. This
calls ps_lattice_posterior and fills the result into search->post (which I
think is the value I want). Are there any other steps I need to be taking?
David mentions rescoring the lattice, but I wasn't quite sure what he meant by
that -- based on the numbers I'm getting I think I still must be missing a
step.
Hm, what's the problem exactly? It must be a single posterior number isn't it?
Yes it is, but the numbers aren't really lining up as well as I had hoped for
my 10 samples:
WER -- score
10.7 -2394781
10.8 -2400967
11.2 -2439725
11.6 -2202656
11.9 -1820605
12.4 -2097273
12.5 -1422584
12.8 -2422531
13.5 -4501824
15.9 -1809219
The scores still seem to vary more depending on the length of the utterance
than on the WER, which makes me think I'm probably missing something.
This is basically all I'm calling to get the score:
which uses this function:
Hmm that function got a little messed up, it was supposed to be this:
int32
ngram_search_bestpath2(ps_decoder_t ps, ps_lattice_t lat)
{
int32 blank;
ps->search->dag = lat;
ngram_search_bestpath(ps->search, &blank, 0);
return ps->search->post;
}
If I read all this correctly, your score is based on path posterior
probabilities, rather than word posterior probabilities.
If that's the case, then you won't necessarily find any correlation between it
and WER.