Menu

Word confidence in Sphinx 3

Help
n lindle
2010-03-09
2012-09-22
  • n lindle

    n lindle - 2010-03-09

    Hello,
    I’m trying to get confidence scores for words decoded using Sphinx3, and after
    reading previous posts I think I have a rough outline of how this is done. I
    added +GARBAGE+ to my phone set, then added the entry
    ++GARBAGE++ +GARBAGE+
    to my filler dictionary. I then added ++GARBAGE++ to my transcripts in place
    of OOV words or poorly spoken words and recreated the acoustic model. Now,
    when I decode with this new model, I believe I’m supposed to output the
    lattice structure and measure each recognized word’s distance to a ++GARBAGE++
    node. Is that correct? Is this done by just summing the acoustic scores
    between nodes until I can reach the closest ++GARBAGE++ node? I’m not having
    much luck doing that so far, perhaps I’m creating the model incorrectly?
    Thanks,
    Nate

     
  • Nickolay V. Shmyrev

    There is specialized binary sphinx3_conf that calculates posteriour
    probabilities from a lattice. Example of it's usage could be found in
    /sphinx3/src/tests/regression/test-conf.sh

     
  • n lindle

    n lindle - 2010-03-15

    Thanks. My current objective is to just get a confidence score for the entire
    utterance, not really worrying about individual words yet. I’ve looked at the
    overall confidence score reported at the beginning of the output from
    sphinx3_conf, but it seems to correlate more with the length of the utterance
    than with the actual WER. I also tried averaging the individual scores for all
    non-filler words in the output, but got something that looked similar.

    So I was just wondering if I should be normalizing by length, or if there was
    something else I need to be taking into account. I’ve only looked at 10
    documents so far, the WER on them ranges from 11% to 16%, so I’m assuming
    that’s not enough data to draw any strong conclusions yet.

     
  • Nickolay V. Shmyrev

    Hi, nlinde, here is what David said on chat if

    (18:28:03) dhd: nshm: utterance confidence score is P(utt|speech) =
    P(utt,speech) / P(speech), where the numerator is the best path score and the
    denominator is a sum over all paths in the lattice
    (18:28:25) dhd: (which can be done easily with the forward algorithm)
    (18:28:27) nshm: right, so there must be completely different calculation code
    to do that
    (18:28:35) nshm: it's what nlinde asking on forums
    (18:28:43) dhd: no, ps_lattice.c does it
    (18:28:50) nshm: he's asking about sphinx3
    (18:28:53) dhd: ohhhhh
    (18:29:16) dhd: if you understand DP over graphs it's easy to implement :)
    (18:29:33) nshm: let me reply him with that
    18:30
    (18:30:02) dhd: ps lattice code is capable of rescoring lattices obtained from
    s3
    (18:30:07) dhd: but there's no command-line utility to do it
    (18:30:18) nshm: k, good
    (18:30:33) dhd: i think the python api can do it
    (18:30:40) dhd: if not i should make that possible

     
  • n lindle

    n lindle - 2010-03-15

    OK, thanks, I'll look into the pocket sphinx code and see if I can extract
    what you guys mentioned

     
  • n lindle

    n lindle - 2010-03-18

    I'm using ps_lattice_read to import the lattice that sphinx3 is outputting,
    then I'm filling that into search->dag and running ngram_search_bestpath. This
    calls ps_lattice_posterior and fills the result into search->post (which I
    think is the value I want). Are there any other steps I need to be taking?
    David mentions rescoring the lattice, but I wasn't quite sure what he meant by
    that -- based on the numbers I'm getting I think I still must be missing a
    step.

     
  • Nickolay V. Shmyrev

    David mentions rescoring the lattice, but I wasn't quite sure what he meant
    by that -- based on the numbers I'm getting I think I still must be missing a
    step.

    Hm, what's the problem exactly? It must be a single posterior number isn't it?

     
  • n lindle

    n lindle - 2010-03-22

    Yes it is, but the numbers aren't really lining up as well as I had hoped for
    my 10 samples:

    WER -- score
    10.7 -2394781
    10.8 -2400967
    11.2 -2439725
    11.6 -2202656
    11.9 -1820605
    12.4 -2097273
    12.5 -1422584
    12.8 -2422531
    13.5 -4501824
    15.9 -1809219

    The scores still seem to vary more depending on the length of the utterance
    than on the WER, which makes me think I'm probably missing something.

    This is basically all I'm calling to get the score:

    config = cmd_ln_parse_file_r(NULL, ps_args_def, argv[1], TRUE);
    ps = ps_init(config);
    lat = ps_lattice_read(ps, latfn);
    pp = ngram_search_bestpath2(ps, lat);
    E_INFO("Probability: %d\n", pp);
    

    which uses this function:

    int32
    ngram_search_bestpath2(ps_decoder_t *ps, ps_lattice_t *lat)
    {
        int32 blank;
        ps->search->dag = lat;
        ngram_search_bestpath(ps->search, &blank, 0);
        return ps->search->post;
    }
    
     
  • n lindle

    n lindle - 2010-03-22

    Hmm that function got a little messed up, it was supposed to be this:

    int32
    ngram_search_bestpath2(ps_decoder_t ps, ps_lattice_t lat)
    {
    int32 blank;
    ps->search->dag = lat;
    ngram_search_bestpath(ps->search, &blank, 0);
    return ps->search->post;
    }

     
  • Jeremie Papon

    Jeremie Papon - 2010-03-31

    If I read all this correctly, your score is based on path posterior
    probabilities, rather than word posterior probabilities.
    If that's the case, then you won't necessarily find any correlation between it
    and WER.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.