Menu

Confidence Scoring Fix

2010-04-20
2012-09-22
  • Jeremie Papon

    Jeremie Papon - 2010-04-20

    Let's fix the confidence scoring already. I'll post a DET curve here in a
    little bit with the results I'm getting right now, but they're not great, so
    needless to say we have some room to work on it.

    Here's my first little fix: The score function in MAPConfidenceScorer.java has
    a loop which has some funky behavior.

     /* start with the first slot */
            int slot = 0;
    for (Token wordToken : wordTokens) {
                String word = wordToken.getWord().getSpelling();
                WordResult wr = null;
                ConfusionSet cs = null;
    
                /* track through all the slots to find the word */
                while (slot < s.size() && wr == null) {
                    cs = s.getConfusionSet(slot);
                    wr = cs.getWordResult(word);
                    if (wr == null) {
                        slot++;
                    }
                }
                if (wr != null) {
                    mapPath.add(wr);
                } else {
                    cs.dump("Slot " + slot);
                    throw new Error
                            ("Can't find WordResult in ConfidenceResult slot " +
                                    slot + " for word " + word);
                }
    
                slot++;
            }
    

    I believe that

    slot++
    

    at the end needs to be a

    slot=0
    

    , otherwise that error occasionally gets thrown.

    Sorry if this fix has already been made; I'm wary of touching the SVN because
    of the extent of the changes I've made to the code (to make it recognize
    handwriting).

    Anyway, is anyone else looking into the confidence scoring? I'd like to toss
    around ideas on how to improve it. I think one of the major problems is that
    for whatever reason the output is essentially binary. It's very rare to get
    scores between 0.99xxx and 0.001xxx.

     
  • Nickolay V. Shmyrev

    Sorry if this fix has already been made; I'm wary of touching the SVN
    because of the extent of the changes I've made to the code (to make it
    recognize handwriting).

    Oh yeah, this "Can't find slot" message was botherning me for a long time. But
    here I don't think that slot=0 is a proper fix. Look, the idea as I understand
    it is to take best token path and map it to a sausage. You don't want to go
    from start each time since words could be repeating and you'll hit the same
    first occurence many times. You indeed need to advance slot each time word
    match is found. The issue is that sausage algorithm could (I suppose since
    it's the only explanaition) map two consequent words from best path to the
    same confusion set in sausage. That's a bug in sausage algorithm I think.

    Anyway, is anyone else looking into the confidence scoring? I'd like to toss
    around ideas on how to improve it. I think one of the major problems is that
    for whatever reason the output is essentially binary. It's very rare to get
    scores between 0.99xxx and 0.001xxx.

    Nobody working on it right now, so any help would be much appreciated.

     
  • Jeremie Papon

    Jeremie Papon - 2010-04-20

    Yeah, I thought it seemed like too easy of a fix.
    It worked fine for me because I was testing on single words. I'll delve into
    it further right now.
    Anyway, here's the type of output the confidence scorer currently gives:

    The ROC is a little odd once you get past ~0.35 False alarm rate because the
    two quantities being compared are both asymptotic; the values past 0.35 should
    just be ignored.

     
  • Jeremie Papon

    Jeremie Papon - 2010-04-20

    But basically what both curves tell you is that it doesn't matter where you
    put the confidence threshold, you essentially get the same output.
    Which makes sense, because what the "Confidence Scorer" is really giving you
    is whether or not the output path with the highest likelihood agrees with the
    path with the lowest WER.
    Calling it a confidence score is deceptive; it really just gives a 1 if a word
    belongs to the most likely path AND the approximate MAP path. If not, it gives
    a zero.
    There's really only one case where it will give scores that aren't 0 or 1.
    That's when the word hypothesis from the path with the highest likelihood also
    has a high posterior in the MAP scorer, AND there's another hypothesis from
    the MAP scorer in the same slot with a similar posterior probability.
    Then you have two (or more) posteriors which are similar (one of which is part
    of the maximum likelihood output), and so you'll get a somewhat meaningful
    confidence score output.

     
  • Jeremie Papon

    Jeremie Papon - 2010-04-21

    sfadhgl;ashgf'ka

    Screw it, I'm going home.
    I'll figure out a creative way to fix the score function tomorrow...
    But as I said, I don't think it's possible to make a "perfect" sausage maker.
    I don't have the time it would take in any case, especially to fix something
    that's a part of an algorithm who's usefulness is questionable to begin with.

     
  • Antoine Raux

    Antoine Raux - 2010-04-21

    jpapon, your observations match mine (on standard speech recognition). Somehow
    the best path through the lattice almost always overdominates everything else,
    except in a few very rare cases with one-word utterances where the two
    alternative words get about the same score (so confidence is around 0.5). I
    think the key problem is within the lattice itself (rather than the sausage
    maker) since the posterior probabilities (computed directly on the lattice)
    reflect the same trends. I didn't spend a whole lot of time on this but I
    haven't found any obvious issue with the forward-backward code to compute the
    posteriors in Lattice.java. It might be an issue with how the lattices are
    built in the first place?

     
  • Jeremie Papon

    Jeremie Papon - 2010-04-21

    Well there's one obvious bug in MAPConfidenceScorer.score. It takes the best
    Token path, and the output from the sausage maker, and attempts to traverse
    the sausage using the words from the best token path. Unfortunately there are
    cases where the best token path and the force-aligned sausage paths don't line
    up. This occurs when you have multiple tokens of the same word, one of which
    is very low scoring, and spans two slots. The tokens for the word all then get
    merged incorrectly.
    I think the best way to fix it would be to do an initial pruning, where low
    scoring word tokens who have a higher scoring token for the same word in a
    similar time slot get pruned. I'll try this right now.

    The other issue is that it outputs the scores of the best token path; it would
    be more interesting (I think) to output the best candidate from each slot, as
    well as alternatives if there are any with scores within a certain %.

     
  • Jeremie Papon

    Jeremie Papon - 2010-04-21

    Ok, I think I've fixed it.
    I added the following function to SausageMaker.java:

        /*
         Eliminate words with extremely low scores which have better scoring tokens
         of the same word in the same cluster. They just make the alignment more difficult.
         */
         protected void eliminatebadWords(List<Cluster> clusters){
            for (Cluster c : clusters){
                double max = Double.NEGATIVE_INFINITY;
                String maxnodeID = "";
                 for (Node node : c.getElements()){
                        //System.out.println(node.getWord() +" s:"+node.getBeginTime()+" e:"+node.getEndTime()+ " Posterior:" +node.getPosterior());
                        if(node.getPosterior() > max){
                            max = node.getPosterior();
                            maxnodeID = node.getId();
                        }
                 }
                 List <Node> tempNodeList = c.getElements();
                 Vector <Node> removeNodes = new Vector();
                 for(Node testNode : c){
                    if(Math.abs(testNode.getPosterior()*0.05) > Math.abs(max))
                                removeNodes.add(testNode);
                 }
                 for(int i = 0; i < removeNodes.size(); i++ )
                        tempNodeList.remove(removeNodes.get(i));
                 c.setElements(tempNodeList);
            }
        }
    

    And then I inserted a call to it in SausageMaker.makeSausage between the intra
    and inter word clustering functions:

        /**
    
         * Turn the lattice contained in this sausage maker into a sausage object.
         *
         * @return the sausage producing by collapsing the lattice.
         */
        public Sausage makeSausage() {
            List<Cluster> clusters = new ArrayList<Cluster>(lattice.nodes.size());
            for (Node n : lattice.nodes.values()) {
                n.cacheDescendants();
                Cluster bucket = new Cluster(n);
                clusters.add(bucket);
            }
            intraWordCluster(clusters);
            eliminatebadWords(clusters);
            interWordCluster(clusters);
            clusters = topologicalSort(clusters);
            return sausageFromClusters(clusters);
        }
    

    Test it out if you want, as far as I can tell it eliminates the problem of
    unlikely words mucking up the Sausage making.

    Not the most graceful solution, but I couldn't think of any other way to fix
    it.

     
  • Nickolay V. Shmyrev

    Hi Jeremie

    I actually wonder if this is a right approach at all to pull bestpath on the
    sausage. It's probably better to take another bestpath from the sausage than
    to search for impossible thing.

    Some time ago I wrote a plan on confidence thing:

    1. Get some baseline results on what do we have now. We need to fix the numbers.

    2. Check how various scoring codes (sphinx3, SRILM) perform with lattice produced by sphinx3 decoder.

    3. Check if lattices built with verious decoders (julius/HDecode/pocketsphinx) perform better.

    4. Check if decoder parameters affect quality of lattices. Best accuracy doesn't mean best confidence measure.

    5. Think about doing more advanced confidence measures like combinations of MAP and semantics and duration that gets a lot of attention in current research.

    The idea is basically to try to combine sphinx4 with reliable components like
    srilm. Probably we'll get closer to the understanding of the problem then.

     

Log in to post a comment.

MongoDB Logo MongoDB