Menu

Phone loop acoustic scores lower than single word acoustic scores

Help
2015-08-29
2015-12-17
  • Sean Robertson

    Sean Robertson - 2015-08-29

    Hello,

    I'm trying to compare acoustic scores between a phone loop recognizer (either uniform monophone lm or FSG with single phone words) and those of decoders forced to recognize a certain word. I would expect, given the freedom to recognize any phone versus very few and sequential phones (for the word or its alternates) should cause the phone loop recognizer to have generally higher acoustic scores than the single word scores.

    Actually, the opposite trend seems to occur. Below are the first 10 lines of the hypothesis segment files of pocketsphinx. I feel like I'm probably missing some easy answer.

    Before that, here's some relevant info:
    Pocketsphinx rev 13106
    Acoustic model adapted from LIUM French continuous model
    Feature files have been segmented by word
    Backtrace is enabled
    Word recognition is decomposed into an FSG of the phones of alternate pronunciations

    I've fiddled around with word- and phone- insertion penalties. Here's an excerpt of the hypseg file of the single word recognition
    51063_NEXTID_59_NEXTID_0 S 0 T -1151 A -1151 L 0 0 -149 0 mm 11 -256 0 aa 15 -170 0 dd 26 -434 0 aa 38 -142 0 mm 45
    25256_NEXTID_146_NEXTID_0 S 0 T -281 A -281 L 0 0 -281 0 au 16
    48891_NEXTID_13_NEXTID_0 S 0 T -946 A -946 L 0 0 -150 0 bb 4 -182 0 on 18 -194 0 jj 38 -152 0 ou 41 -268 0 rr 52
    98885_NEXTID_131_NEXTID_0 S 0 T -227 A -227 L 0 0 -227 0 au 13
    18352_NEXTID_88_NEXTID_0 S 0 T -920 A -920 L 0 0 -292 0 ii 6 -306 0 ss 23 -322 0 ii 29
    59801_NEXTID_78_NEXTID_0 S 0 T -2723 A -2723 L 0 0 -254 0 ll 6 -144 0 aa 13 -569 0 pp 34 -1756 0 in 58
    70597_NEXTID_124_NEXTID_0 S 0 T -959 A -959 L 0 0 -249 0 ii 3 -340 0 ss 17 -370 0 ii 36
    40155_NEXTID_176_NEXTID_0 S 0 T -910 A -910 L 0 0 -153 0 mm 21 -300 0 ai 24 -194 0 rr 38 -154 0 ss 55 -109 0 ii 64
    88140_NEXTID_54_NEXTID_0 S 0 T -473 A -473 L 0 0 -323 0 ww 11 -150 0 ii 22
    88140_NEXTID_54_NEXTID_1 S 0 T -619 A -619 L 0 0 -137 0 vv 6 -482 0 ou 20

    And the corresponding phone loop entries:
    51063_NEXTID_59_NEXTID_0 S 0 T -3341 A -3341 L 0 0 -426 0 eu 4 -364 0 mm 12 -510 0 ai 16 -769 0 tt 27 -753 0 un 39 -519 0 nn 45
    25256_NEXTID_146_NEXTID_0 S 0 T -1239 A -1239 L 0 0 -1239 0 an 16
    48891_NEXTID_13_NEXTID_0 S 0 T -2778 A -2778 L 0 0 -963 0 an 17 -780 0 jj 36 -1035 0 eu 52
    98885_NEXTID_131_NEXTID_0 S 0 T -774 A -774 L 0 0 -774 0 ff 13
    18352_NEXTID_88_NEXTID_0 S 0 T -1974 A -1974 L 0 0 -1056 0 zz 17 -918 0 ai 29
    59801_NEXTID_78_NEXTID_0 S 0 T -3824 A -3824 L 0 0 -2190 0 rr 37 -938 0 ii 47 -696 0 nn 58
    70597_NEXTID_124_NEXTID_0 S 0 T -1889 A -1889 L 0 0 -1052 0 ff 23 -837 0 ai 36
    40155_NEXTID_176_NEXTID_0 S 0 T -3173 A -3173 L 0 0 -908 0 mm 20 -1054 0 in 36 -439 0 ss 55 -772 0 ii 64
    88140_NEXTID_54_NEXTID_0 S 0 T -1320 A -1320 L 0 0 -708 0 ww 11 -612 0 ii 22
    88140_NEXTID_54_NEXTID_1 S 0 T -1451 A -1451 L 0 0 -400 0 ff 5 -555 0 au 14 -496 0 pp 20

    Here's with a wip and pip of 0.001 and single word:
    96950_NEXTID_74_NEXTID_0 S 0 T -1971 A -1971 L 0 0 -953 0 ww 10 -1018 0 ii 34
    96950_NEXTID_74_NEXTID_1 S 0 T 169 A 169 L 0 34
    98885_NEXTID_164_NEXTID_0 S 0 T -1424 A -1424 L 0 0 -1424 0 jj 28
    98885_NEXTID_164_NEXTID_1 S 0 T -2216 A -2216 L 0 0 -870 0 pp 5 -1346 0 eu 17

    0.001 phone loop
    96950_NEXTID_74_NEXTID_0 S 0 T -2938 A -2938 L 0 0 -999 0 ww 10 -1939 0 ii 34
    96950_NEXTID_74_NEXTID_1 S 0 T -1556 A -1556 L 0 0 -1556 0 yy 12
    98885_NEXTID_164_NEXTID_0 S 0 T -2248 A -2248 L 0 0 -2248 0 jj 28
    98885_NEXTID_164_NEXTID_1 S 0 T -2251 A -2251 L 0 0 -2251 0 vv 17

    Interestingly, I tried setting the wip and pip to 0, and I think stuff must've underflowed, which is really weird since it shouldn't touch the acoustic score. Single word:
    27066_NEXTID_215_NEXTID_0 S 0 T 10481341 A 10481341 L 0 0 2097041 0 tt 3 2096998 0 rr 6 2096835 0 ww 9 2096879 0 aa 12 2093588 0 zz 105
    96950_NEXTID_44_NEXTID_0 S 0 T 10482052 A 10482052 L 0 0 2097010 0 ll 3 2096817 0 ii 6 2097017 0 nn 9 2097058 0 dd 12 2094150 0 aa 63
    50302_NEXTID_64_NEXTID_0 S 0 T 10483426 A 10483426 L 0 0 2097081 0 bb 3 2096644 0 on 6 2096564 0 jj 9 2096950 0 ou 12 2096187 0 rr 49

    Phone loop:
    27066_NEXTID_215_NEXTID_0 S 0 T 73393952 A 73393952 L 0 0 2097055 0 dd 3 2096967 0 tt 6 2097016 0 vv 9 2097033 0 ee 12 2096945 0 pp 15 2096920 0 tt 18 2096916 0 ss 21 2096935 0 ss 24 2097024 0 ff 27 2096961 0 vv 30 2096923 0 ww 33 2096891 0 rr 36 2096970 0 oe 39 2096982 0 un 42 2097019 0 aa 45 2096908 0 un 48 2096876 0 un 51 2096928 0 un 54 2096929 0 un 57 2097076 0 aa 60 2096946 0 ll 63 2097030 0 bb 66 2096964 0 dd 69 2097016 0 bb 72 2096996 0 dd 75 2096983 0 nn 78 2096961 0 tt 81 2096995 0 vv 84 2096861 0 nn 87 2097006 0 tt 90 2096933 0 ss 93 2097047 0 zz 96 2096969 0 tt 99 2096969 0 rr 102 2097002 0 dd 104
    96950_NEXTID_44_NEXTID_0 S 0 T 44035725 A 44035725 L 0 0 2096961 0 in 3 2097016 0 ai 6 2096946 0 ll 9 2097002 0 ll 12 2096983 0 ll 15 2096869 0 uy 18 2096889 0 ei 21 2096865 0 ii 24 2096919 0 yy 27 2096926 0 uu 30 2096975 0 nn 33 2097076 0 nn 36 2097052 0 nn 39 2096949 0 gg 42 2096990 0 tt 45 2096857 0 yy 48 2096718 0 ei 51 2096914 0 un 54 2096876 0 aa 57 2096900 0 aa 60 2097042 0 aa 62
    50302_NEXTID_64_NEXTID_0 S 0 T 33551773 A 33551773 L 0 0 2097050 0 vv 3 2097002 0 vv 6 2096962 0 rr 9 2096989 0 un 12 2096950 0 on 15 2096893 0 on 18 2097002 0 dd 21 2097023 0 gg 24 2096948 0 jj 27 2097014 0 yy 30 2096937 0 uy 33 2097020 0 ll 36 2096970 0 eu 39 2097019 0 un 42 2096956 0 eu 45 2097038 0 vv 47

    Thanks,
    Sean

     
    • Nickolay V. Shmyrev

      I'm sorry, I don't quite understand what is going on there, you probably could provide code examples and data.

      Are you using context-dependent phones for words? What phones are used for loop, are they context-independent? Generally those should give different results.

      If you use CI phones in both cases phone score should be lower given pip is zero. Ideally you need non-zero pip for best phone loop accuracy and to match that you need to insert pip into word case as well. In that case you will see the positive difference.

       
  • Sean Robertson

    Sean Robertson - 2015-08-30

    Thanks for responding and I appreciate your help! I'm sorry that I wasn't clear. I tend to ramble a bit.

    I presented two issues, I suppose. I've come up with pretty minimal steps to reproduce. I'm running the latest version of pocketsphinx (r13106).

    1. Positive total acoustic scores given in hypseg file. To reproduce, simply record a wav file and, make a control file ("fileids") with just that file name and run the following
      pocketsphinx_batch \
          -adchdr 44 \
          -adcin yes \
          -cepdir . \
          -cepext .wav \
          -ctl fileids \
          -hypseg hypseg \
          -remove_noise no \
          -remove_silence no \
          -allphone /path/to/en_US/en-phone.lm.DMP
      

    My acoustic score totals tend to be positive when I do this. You can get greater positive values if you set -pip to 0.

    1. Generally higher phone loop scores than one-path scores. It's probably most clear by example. Please download the LIUM French acoustic model and the files I've uploaded here.

    You'll note two fsg files. One, bonjour_phone.fsg, simply follows sequentially through the phones of the word "bonjour." The other, "phone_loop.fsg," have a loop at the start node and a transition to the end node for every phone. You can run the former with

        pocketsphinx_batch \
            -cepdir . \
            -cepext .mfcc \
            -hmm /path/to/lium_french_f0 \
            -dict phone_dict.dic \
            -fsg bonjour_phone.fsg \
            -ctl fileids \
            -cmn none \
            -agc none \
            -hypseg hypseg \
            -remove_noise no \
            -remove_silence no \
            -fsgusefiller no
    

    and the latter with

         pocketsphinx_batch \
            -cepdir . \
            -cepext .mfcc \
            -hmm /path/to/lium_french_f0 \
            -dict phone_dict.dic \
            -fsg phone_loop.fsg \
            -ctl fileids \
            -cmn none \
            -agc none \
            -hypseg hypseg2 \
            -remove_noise no \
            -remove_silence no \
            -fsgusefiller no
    

    The score in "hypseg" is greater than that in "hypseg2." You can fiddle with -wip. If you set it to 0, the scores become immensely positive.

    Thanks for your time,
    Sean

     

    Last edit: Sean Robertson 2015-08-30
    • Nickolay V. Shmyrev

      Positive total acoustic scores given in hypseg file. To reproduce, simply record a wav file and, make a control file ("fileids") with just that file name and run the following

      This is a valid problem, thank you. I've just fixed it in trunk.

      The score in "hypseg" is greater than that in "hypseg2." You can fiddle with -wip. If you set it to 0, the scores become immensely positive.

      wip 0 does not really make sense since it's a probability. It should be 1.0 if you want to skip wip in scoring. I will look on the problem in detail a bit later though.

       
      • Sean Robertson

        Sean Robertson - 2015-09-27

        Hi again,

        Thanks for taking a look. Unfortunately, when I updated sphinxbase and pocketsphinx (r13107), the positive acoustic scores persist. I've uploaded the audio file I used for testing here, called "yo.wav." Here's the hypseg output I get when I use the above command.

            yo S 0 T -7399 A 10392 L -17791 0 -95 0 SIL 3 -187 -326 AY 7 -1317 -449 R 45 -1447 -540 NG 97 135 -583 +NSN+ 105 13930 -14475 D 108 -309 -444 JH 126 -154 -199 N 129 -40 -411 EY 135 -124 -364 SIL 159
        

        Note the highly positive acoustic score for D. I wonder if it has to do with the segment only being 3 frames? The latter problem I mentioned above might disappear when this is fixed.

        Best,
        Sean

         
  • Sean Robertson

    Sean Robertson - 2015-11-02

    No worries, just making sure this doesn't get buried :) Thanks!

     
  • Sean Robertson

    Sean Robertson - 2015-12-03

    How's this looking?

     
    • Nickolay V. Shmyrev

      Dear Sean

      Its interesting, I've just fixed this negative problem a couple of days ago, it was another bug with fillers in phonetic search (NSN phone you had). Could you please update and try again? Thank you.

      As for second problem, I set wip to 1.0 and the difference is smaller. Checking further.

       
      • Nickolay V. Shmyrev

        Ok, I investigated the second problem as well.

        The thing is that acoustic model performs score normalization to keep them in range, so it shifts everything by the best acoustic score.

        In forced sequence you basically do not have enough diversity and the norm is different from the loop where you have more diversity and different normalizer.

        It is not possible to compare scores across multiple runs as an outcome, you can only compare scores within single run.

        Not a good thign, but this is how cmusphinx always worked.

         
        • Sean Robertson

          Sean Robertson - 2015-12-17

          Thank you very much for looking into it and fixing what you could, Nickolay. :) I've taken a look at the code and see what you're talking about. I'll probably hack around a bit for my purposes.

          Quick question if you have the time: in acmod.c vs ms_mgau.c, acmod->senone_active is treated as both a score in acmod_best_score and as an index to mixtures in ms_cont_mgau_frame_eval. How can these be resolved?

          Best,
          Sean

           
  • Nickolay V. Shmyrev

    Hi Sean

    In this code

    ~~~~~~~~~
    int16 senscr;
    senscr = acmod->senone_scores;
    for (i = 0; i < acmod->n_senone_active; ++i) {
    senscr += acmod->senone_active[i];
    if (
    senscr < best) {
    best = senscr;
    out_best_senid = i;
    }
    }
    ~~~~~~~

    a variable name is a bit confusing, it is called senscr, but it is actually a pointer in array of senone scores, not score itself. senone_active uses a tricky compression method to save memory, it contains deltas between active senone indices, not senone indexes indices, so you need to accumulate indices to get index of the next active senone.

    So the code is correct. See for details acmod_flags2list.

    I think some of pocketsphinx methods are too complex and could be probably simplified, but this is the way it goes. And it allows very compact memory representation.

     
  • Sean Robertson

    Sean Robertson - 2015-12-17

    Thanks very much! You're awesome.

     

Log in to post a comment.