Let's fix the confidence scoring already. I'll post a DET curve here in a
little bit with the results I'm getting right now, but they're not great, so
needless to say we have some room to work on it.
Here's my first little fix: The score function in MAPConfidenceScorer.java has
a loop which has some funky behavior.
/*startwiththefirstslot*/intslot=0;for(TokenwordToken:wordTokens){Stringword=wordToken.getWord().getSpelling();WordResultwr=null;ConfusionSetcs=null;/*trackthroughalltheslotstofindtheword*/while(slot<s.size()&&wr==null){cs=s.getConfusionSet(slot);wr=cs.getWordResult(word);if(wr==null){slot++;}}if(wr!=null){mapPath.add(wr);}else{cs.dump("Slot "+slot);thrownewError("Can't find WordResult in ConfidenceResult slot "+slot+" for word "+word);}slot++;}
I believe that
slot++
at the end needs to be a
slot=0
, otherwise that error occasionally gets thrown.
Sorry if this fix has already been made; I'm wary of touching the SVN because
of the extent of the changes I've made to the code (to make it recognize
handwriting).
Anyway, is anyone else looking into the confidence scoring? I'd like to toss
around ideas on how to improve it. I think one of the major problems is that
for whatever reason the output is essentially binary. It's very rare to get
scores between 0.99xxx and 0.001xxx.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry if this fix has already been made; I'm wary of touching the SVN
because of the extent of the changes I've made to the code (to make it
recognize handwriting).
Oh yeah, this "Can't find slot" message was botherning me for a long time. But
here I don't think that slot=0 is a proper fix. Look, the idea as I understand
it is to take best token path and map it to a sausage. You don't want to go
from start each time since words could be repeating and you'll hit the same
first occurence many times. You indeed need to advance slot each time word
match is found. The issue is that sausage algorithm could (I suppose since
it's the only explanaition) map two consequent words from best path to the
same confusion set in sausage. That's a bug in sausage algorithm I think.
Anyway, is anyone else looking into the confidence scoring? I'd like to toss
around ideas on how to improve it. I think one of the major problems is that
for whatever reason the output is essentially binary. It's very rare to get
scores between 0.99xxx and 0.001xxx.
Nobody working on it right now, so any help would be much appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, I thought it seemed like too easy of a fix.
It worked fine for me because I was testing on single words. I'll delve into
it further right now.
Anyway, here's the type of output the confidence scorer currently gives:
The ROC is a little odd once you get past ~0.35 False alarm rate because the
two quantities being compared are both asymptotic; the values past 0.35 should
just be ignored.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But basically what both curves tell you is that it doesn't matter where you
put the confidence threshold, you essentially get the same output.
Which makes sense, because what the "Confidence Scorer" is really giving you
is whether or not the output path with the highest likelihood agrees with the
path with the lowest WER.
Calling it a confidence score is deceptive; it really just gives a 1 if a word
belongs to the most likely path AND the approximate MAP path. If not, it gives
a zero.
There's really only one case where it will give scores that aren't 0 or 1.
That's when the word hypothesis from the path with the highest likelihood also
has a high posterior in the MAP scorer, AND there's another hypothesis from
the MAP scorer in the same slot with a similar posterior probability.
Then you have two (or more) posteriors which are similar (one of which is part
of the maximum likelihood output), and so you'll get a somewhat meaningful
confidence score output.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, so you're right. It's incorrectly combining things into one time slot,
because the clustering process doesn't know what to do with these complicated
overlap cases. And I'm not sure there is actually a "right" way to combine
them. I think if it's changed, it will wind up having this error for a
different case. It doesn't seem possible to simplify the lattices like this
without occasionally causing a mismatch with the decoder output.
In my case for example, I'm testing on one word. And the sausage maker
correctly combines everything into one slot: --- ----
<sil> </sil>
Screw it, I'm going home.
I'll figure out a creative way to fix the score function tomorrow...
But as I said, I don't think it's possible to make a "perfect" sausage maker.
I don't have the time it would take in any case, especially to fix something
that's a part of an algorithm who's usefulness is questionable to begin with.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
jpapon, your observations match mine (on standard speech recognition). Somehow
the best path through the lattice almost always overdominates everything else,
except in a few very rare cases with one-word utterances where the two
alternative words get about the same score (so confidence is around 0.5). I
think the key problem is within the lattice itself (rather than the sausage
maker) since the posterior probabilities (computed directly on the lattice)
reflect the same trends. I didn't spend a whole lot of time on this but I
haven't found any obvious issue with the forward-backward code to compute the
posteriors in Lattice.java. It might be an issue with how the lattices are
built in the first place?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well there's one obvious bug in MAPConfidenceScorer.score. It takes the best
Token path, and the output from the sausage maker, and attempts to traverse
the sausage using the words from the best token path. Unfortunately there are
cases where the best token path and the force-aligned sausage paths don't line
up. This occurs when you have multiple tokens of the same word, one of which
is very low scoring, and spans two slots. The tokens for the word all then get
merged incorrectly.
I think the best way to fix it would be to do an initial pruning, where low
scoring word tokens who have a higher scoring token for the same word in a
similar time slot get pruned. I'll try this right now.
The other issue is that it outputs the scores of the best token path; it would
be more interesting (I think) to output the best candidate from each slot, as
well as alternatives if there are any with scores within a certain %.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I actually wonder if this is a right approach at all to pull bestpath on the
sausage. It's probably better to take another bestpath from the sausage than
to search for impossible thing.
Some time ago I wrote a plan on confidence thing:
Get some baseline results on what do we have now. We need to fix the numbers.
Check how various scoring codes (sphinx3, SRILM) perform with lattice produced by sphinx3 decoder.
Check if lattices built with verious decoders (julius/HDecode/pocketsphinx) perform better.
Check if decoder parameters affect quality of lattices. Best accuracy doesn't mean best confidence measure.
Think about doing more advanced confidence measures like combinations of MAP and semantics and duration that gets a lot of attention in current research.
The idea is basically to try to combine sphinx4 with reliable components like
srilm. Probably we'll get closer to the understanding of the problem then.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Let's fix the confidence scoring already. I'll post a DET curve here in a
little bit with the results I'm getting right now, but they're not great, so
needless to say we have some room to work on it.
Here's my first little fix: The score function in MAPConfidenceScorer.java has
a loop which has some funky behavior.
I believe that
at the end needs to be a
, otherwise that error occasionally gets thrown.
Sorry if this fix has already been made; I'm wary of touching the SVN because
of the extent of the changes I've made to the code (to make it recognize
handwriting).
Anyway, is anyone else looking into the confidence scoring? I'd like to toss
around ideas on how to improve it. I think one of the major problems is that
for whatever reason the output is essentially binary. It's very rare to get
scores between 0.99xxx and 0.001xxx.
Oh yeah, this "Can't find slot" message was botherning me for a long time. But
here I don't think that slot=0 is a proper fix. Look, the idea as I understand
it is to take best token path and map it to a sausage. You don't want to go
from start each time since words could be repeating and you'll hit the same
first occurence many times. You indeed need to advance slot each time word
match is found. The issue is that sausage algorithm could (I suppose since
it's the only explanaition) map two consequent words from best path to the
same confusion set in sausage. That's a bug in sausage algorithm I think.
Nobody working on it right now, so any help would be much appreciated.
Yeah, I thought it seemed like too easy of a fix.
It worked fine for me because I was testing on single words. I'll delve into
it further right now.
Anyway, here's the type of output the confidence scorer currently gives:
The ROC is a little odd once you get past ~0.35 False alarm rate because the
two quantities being compared are both asymptotic; the values past 0.35 should
just be ignored.
But basically what both curves tell you is that it doesn't matter where you
put the confidence threshold, you essentially get the same output.
Which makes sense, because what the "Confidence Scorer" is really giving you
is whether or not the output path with the highest likelihood agrees with the
path with the lowest WER.
Calling it a confidence score is deceptive; it really just gives a 1 if a word
belongs to the most likely path AND the approximate MAP path. If not, it gives
a zero.
There's really only one case where it will give scores that aren't 0 or 1.
That's when the word hypothesis from the path with the highest likelihood also
has a high posterior in the MAP scorer, AND there's another hypothesis from
the MAP scorer in the same slot with a similar posterior probability.
Then you have two (or more) posteriors which are similar (one of which is part
of the maximum likelihood output), and so you'll get a somewhat meaningful
confidence score output.
Yeah, so you're right. It's incorrectly combining things into one time slot,
because the clustering process doesn't know what to do with these complicated
overlap cases. And I'm not sure there is actually a "right" way to combine
them. I think if it's changed, it will wind up having this error for a
different case. It doesn't seem possible to simplify the lattices like this
without occasionally causing a mismatch with the decoder output.
In my case for example, I'm testing on one word. And the sausage maker
correctly combines everything into one slot:
--- ----<sil> </sil>
Thing is, the decoder output inserted sil in front of the word. So it finds
the sil, then looks for the word in the slot, and can't find it.
Of course the correct sausage output would have been :
<noop> </noop>
<sil> </sil>But again, I don't think it's possible to design a set of "sausage maker"
rules which correctly do that, without causing a similar problem for other
cases.
A better solution might be to just watch out for cases like this, and check
the previous slot in the confidence scorer, or something along those lines.
sfadhgl;ashgf'ka
Screw it, I'm going home.
I'll figure out a creative way to fix the score function tomorrow...
But as I said, I don't think it's possible to make a "perfect" sausage maker.
I don't have the time it would take in any case, especially to fix something
that's a part of an algorithm who's usefulness is questionable to begin with.
jpapon, your observations match mine (on standard speech recognition). Somehow
the best path through the lattice almost always overdominates everything else,
except in a few very rare cases with one-word utterances where the two
alternative words get about the same score (so confidence is around 0.5). I
think the key problem is within the lattice itself (rather than the sausage
maker) since the posterior probabilities (computed directly on the lattice)
reflect the same trends. I didn't spend a whole lot of time on this but I
haven't found any obvious issue with the forward-backward code to compute the
posteriors in Lattice.java. It might be an issue with how the lattices are
built in the first place?
Well there's one obvious bug in MAPConfidenceScorer.score. It takes the best
Token path, and the output from the sausage maker, and attempts to traverse
the sausage using the words from the best token path. Unfortunately there are
cases where the best token path and the force-aligned sausage paths don't line
up. This occurs when you have multiple tokens of the same word, one of which
is very low scoring, and spans two slots. The tokens for the word all then get
merged incorrectly.
I think the best way to fix it would be to do an initial pruning, where low
scoring word tokens who have a higher scoring token for the same word in a
similar time slot get pruned. I'll try this right now.
The other issue is that it outputs the scores of the best token path; it would
be more interesting (I think) to output the best candidate from each slot, as
well as alternatives if there are any with scores within a certain %.
Ok, I think I've fixed it.
I added the following function to SausageMaker.java:
And then I inserted a call to it in SausageMaker.makeSausage between the intra
and inter word clustering functions:
Test it out if you want, as far as I can tell it eliminates the problem of
unlikely words mucking up the Sausage making.
Not the most graceful solution, but I couldn't think of any other way to fix
it.
Hi Jeremie
I actually wonder if this is a right approach at all to pull bestpath on the
sausage. It's probably better to take another bestpath from the sausage than
to search for impossible thing.
Some time ago I wrote a plan on confidence thing:
Get some baseline results on what do we have now. We need to fix the numbers.
Check how various scoring codes (sphinx3, SRILM) perform with lattice produced by sphinx3 decoder.
Check if lattices built with verious decoders (julius/HDecode/pocketsphinx) perform better.
Check if decoder parameters affect quality of lattices. Best accuracy doesn't mean best confidence measure.
Think about doing more advanced confidence measures like combinations of MAP and semantics and duration that gets a lot of attention in current research.
The idea is basically to try to combine sphinx4 with reliable components like
srilm. Probably we'll get closer to the understanding of the problem then.