CMU Sphinx / Forums / Help: Phone loop acoustic scores lower than single word acoustic scores

Sean Robertson - 2015-08-29

Hello,

I'm trying to compare acoustic scores between a phone loop recognizer (either uniform monophone lm or FSG with single phone words) and those of decoders forced to recognize a certain word. I would expect, given the freedom to recognize any phone versus very few and sequential phones (for the word or its alternates) should cause the phone loop recognizer to have generally higher acoustic scores than the single word scores.

Actually, the opposite trend seems to occur. Below are the first 10 lines of the hypothesis segment files of pocketsphinx. I feel like I'm probably missing some easy answer.

Before that, here's some relevant info:
Pocketsphinx rev 13106
Acoustic model adapted from LIUM French continuous model
Feature files have been segmented by word
Backtrace is enabled
Word recognition is decomposed into an FSG of the phones of alternate pronunciations

I've fiddled around with word- and phone- insertion penalties. Here's an excerpt of the hypseg file of the single word recognition
51063_NEXTID_59_NEXTID_0 S 0 T -1151 A -1151 L 0 0 -149 0 mm 11 -256 0 aa 15 -170 0 dd 26 -434 0 aa 38 -142 0 mm 45
25256_NEXTID_146_NEXTID_0 S 0 T -281 A -281 L 0 0 -281 0 au 16
48891_NEXTID_13_NEXTID_0 S 0 T -946 A -946 L 0 0 -150 0 bb 4 -182 0 on 18 -194 0 jj 38 -152 0 ou 41 -268 0 rr 52
98885_NEXTID_131_NEXTID_0 S 0 T -227 A -227 L 0 0 -227 0 au 13
18352_NEXTID_88_NEXTID_0 S 0 T -920 A -920 L 0 0 -292 0 ii 6 -306 0 ss 23 -322 0 ii 29
59801_NEXTID_78_NEXTID_0 S 0 T -2723 A -2723 L 0 0 -254 0 ll 6 -144 0 aa 13 -569 0 pp 34 -1756 0 in 58
70597_NEXTID_124_NEXTID_0 S 0 T -959 A -959 L 0 0 -249 0 ii 3 -340 0 ss 17 -370 0 ii 36
40155_NEXTID_176_NEXTID_0 S 0 T -910 A -910 L 0 0 -153 0 mm 21 -300 0 ai 24 -194 0 rr 38 -154 0 ss 55 -109 0 ii 64
88140_NEXTID_54_NEXTID_0 S 0 T -473 A -473 L 0 0 -323 0 ww 11 -150 0 ii 22
88140_NEXTID_54_NEXTID_1 S 0 T -619 A -619 L 0 0 -137 0 vv 6 -482 0 ou 20

And the corresponding phone loop entries:
51063_NEXTID_59_NEXTID_0 S 0 T -3341 A -3341 L 0 0 -426 0 eu 4 -364 0 mm 12 -510 0 ai 16 -769 0 tt 27 -753 0 un 39 -519 0 nn 45
25256_NEXTID_146_NEXTID_0 S 0 T -1239 A -1239 L 0 0 -1239 0 an 16
48891_NEXTID_13_NEXTID_0 S 0 T -2778 A -2778 L 0 0 -963 0 an 17 -780 0 jj 36 -1035 0 eu 52
98885_NEXTID_131_NEXTID_0 S 0 T -774 A -774 L 0 0 -774 0 ff 13
18352_NEXTID_88_NEXTID_0 S 0 T -1974 A -1974 L 0 0 -1056 0 zz 17 -918 0 ai 29
59801_NEXTID_78_NEXTID_0 S 0 T -3824 A -3824 L 0 0 -2190 0 rr 37 -938 0 ii 47 -696 0 nn 58
70597_NEXTID_124_NEXTID_0 S 0 T -1889 A -1889 L 0 0 -1052 0 ff 23 -837 0 ai 36
40155_NEXTID_176_NEXTID_0 S 0 T -3173 A -3173 L 0 0 -908 0 mm 20 -1054 0 in 36 -439 0 ss 55 -772 0 ii 64
88140_NEXTID_54_NEXTID_0 S 0 T -1320 A -1320 L 0 0 -708 0 ww 11 -612 0 ii 22
88140_NEXTID_54_NEXTID_1 S 0 T -1451 A -1451 L 0 0 -400 0 ff 5 -555 0 au 14 -496 0 pp 20

Here's with a wip and pip of 0.001 and single word:
96950_NEXTID_74_NEXTID_0 S 0 T -1971 A -1971 L 0 0 -953 0 ww 10 -1018 0 ii 34
96950_NEXTID_74_NEXTID_1 S 0 T 169 A 169 L 0 34
98885_NEXTID_164_NEXTID_0 S 0 T -1424 A -1424 L 0 0 -1424 0 jj 28
98885_NEXTID_164_NEXTID_1 S 0 T -2216 A -2216 L 0 0 -870 0 pp 5 -1346 0 eu 17

0.001 phone loop
96950_NEXTID_74_NEXTID_0 S 0 T -2938 A -2938 L 0 0 -999 0 ww 10 -1939 0 ii 34
96950_NEXTID_74_NEXTID_1 S 0 T -1556 A -1556 L 0 0 -1556 0 yy 12
98885_NEXTID_164_NEXTID_0 S 0 T -2248 A -2248 L 0 0 -2248 0 jj 28
98885_NEXTID_164_NEXTID_1 S 0 T -2251 A -2251 L 0 0 -2251 0 vv 17

Interestingly, I tried setting the wip and pip to 0, and I think stuff must've underflowed, which is really weird since it shouldn't touch the acoustic score. Single word:
27066_NEXTID_215_NEXTID_0 S 0 T 10481341 A 10481341 L 0 0 2097041 0 tt 3 2096998 0 rr 6 2096835 0 ww 9 2096879 0 aa 12 2093588 0 zz 105
96950_NEXTID_44_NEXTID_0 S 0 T 10482052 A 10482052 L 0 0 2097010 0 ll 3 2096817 0 ii 6 2097017 0 nn 9 2097058 0 dd 12 2094150 0 aa 63
50302_NEXTID_64_NEXTID_0 S 0 T 10483426 A 10483426 L 0 0 2097081 0 bb 3 2096644 0 on 6 2096564 0 jj 9 2096950 0 ou 12 2096187 0 rr 49

Phone loop:
27066_NEXTID_215_NEXTID_0 S 0 T 73393952 A 73393952 L 0 0 2097055 0 dd 3 2096967 0 tt 6 2097016 0 vv 9 2097033 0 ee 12 2096945 0 pp 15 2096920 0 tt 18 2096916 0 ss 21 2096935 0 ss 24 2097024 0 ff 27 2096961 0 vv 30 2096923 0 ww 33 2096891 0 rr 36 2096970 0 oe 39 2096982 0 un 42 2097019 0 aa 45 2096908 0 un 48 2096876 0 un 51 2096928 0 un 54 2096929 0 un 57 2097076 0 aa 60 2096946 0 ll 63 2097030 0 bb 66 2096964 0 dd 69 2097016 0 bb 72 2096996 0 dd 75 2096983 0 nn 78 2096961 0 tt 81 2096995 0 vv 84 2096861 0 nn 87 2097006 0 tt 90 2096933 0 ss 93 2097047 0 zz 96 2096969 0 tt 99 2096969 0 rr 102 2097002 0 dd 104
96950_NEXTID_44_NEXTID_0 S 0 T 44035725 A 44035725 L 0 0 2096961 0 in 3 2097016 0 ai 6 2096946 0 ll 9 2097002 0 ll 12 2096983 0 ll 15 2096869 0 uy 18 2096889 0 ei 21 2096865 0 ii 24 2096919 0 yy 27 2096926 0 uu 30 2096975 0 nn 33 2097076 0 nn 36 2097052 0 nn 39 2096949 0 gg 42 2096990 0 tt 45 2096857 0 yy 48 2096718 0 ei 51 2096914 0 un 54 2096876 0 aa 57 2096900 0 aa 60 2097042 0 aa 62
50302_NEXTID_64_NEXTID_0 S 0 T 33551773 A 33551773 L 0 0 2097050 0 vv 3 2097002 0 vv 6 2096962 0 rr 9 2096989 0 un 12 2096950 0 on 15 2096893 0 on 18 2097002 0 dd 21 2097023 0 gg 24 2096948 0 jj 27 2097014 0 yy 30 2096937 0 uy 33 2097020 0 ll 36 2096970 0 eu 39 2097019 0 un 42 2096956 0 eu 45 2097038 0 vv 47

Thanks,
Sean

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-08-29
  
  I'm sorry, I don't quite understand what is going on there, you probably could provide code examples and data.
  
  Are you using context-dependent phones for words? What phones are used for loop, are they context-independent? Generally those should give different results.
  
  If you use CI phones in both cases phone score should be lower given pip is zero. Ideally you need non-zero pip for best phone loop accuracy and to match that you need to insert pip into word case as well. In that case you will see the positive difference.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sean Robertson - 2015-08-30

Thanks for responding and I appreciate your help! I'm sorry that I wasn't clear. I tend to ramble a bit.

I presented two issues, I suppose. I've come up with pretty minimal steps to reproduce. I'm running the latest version of pocketsphinx (r13106).

Positive total acoustic scores given in hypseg file. To reproduce, simply record a wav file and, make a control file ("fileids") with just that file name and run the following
pocketsphinx_batch \ -adchdr 44 \ -adcin yes \ -cepdir . \ -cepext .wav \ -ctl fileids \ -hypseg hypseg \ -remove_noise no \ -remove_silence no \ -allphone /path/to/en_US/en-phone.lm.DMP

My acoustic score totals tend to be positive when I do this. You can get greater positive values if you set -pip to 0.

Generally higher phone loop scores than one-path scores. It's probably most clear by example. Please download the LIUM French acoustic model and the files I've uploaded here.

You'll note two fsg files. One, bonjour_phone.fsg, simply follows sequentially through the phones of the word "bonjour." The other, "phone_loop.fsg," have a loop at the start node and a transition to the end node for every phone. You can run the former with

pocketsphinx_batch \ -cepdir . \ -cepext .mfcc \ -hmm /path/to/lium_french_f0 \ -dict phone_dict.dic \ -fsg bonjour_phone.fsg \ -ctl fileids \ -cmn none \ -agc none \ -hypseg hypseg \ -remove_noise no \ -remove_silence no \ -fsgusefiller no

and the latter with

pocketsphinx_batch \ -cepdir . \ -cepext .mfcc \ -hmm /path/to/lium_french_f0 \ -dict phone_dict.dic \ -fsg phone_loop.fsg \ -ctl fileids \ -cmn none \ -agc none \ -hypseg hypseg2 \ -remove_noise no \ -remove_silence no \ -fsgusefiller no

The score in "hypseg" is greater than that in "hypseg2." You can fiddle with -wip. If you set it to 0, the scores become immensely positive.

Thanks for your time,
Sean

Last edit: Sean Robertson 2015-08-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-09-09
  
  Positive total acoustic scores given in hypseg file. To reproduce, simply record a wav file and, make a control file ("fileids") with just that file name and run the following
  
  This is a valid problem, thank you. I've just fixed it in trunk.
  
  The score in "hypseg" is greater than that in "hypseg2." You can fiddle with -wip. If you set it to 0, the scores become immensely positive.
  
  wip 0 does not really make sense since it's a probability. It should be 1.0 if you want to skip wip in scoring. I will look on the problem in detail a bit later though.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Sean Robertson - 2015-09-27
    
    Hi again,
    
    Thanks for taking a look. Unfortunately, when I updated sphinxbase and pocketsphinx (r13107), the positive acoustic scores persist. I've uploaded the audio file I used for testing here, called "yo.wav." Here's the hypseg output I get when I use the above command.
    
    yo S 0 T -7399 A 10392 L -17791 0 -95 0 SIL 3 -187 -326 AY 7 -1317 -449 R 45 -1447 -540 NG 97 135 -583 +NSN+ 105 13930 -14475 D 108 -309 -444 JH 126 -154 -199 N 129 -40 -411 EY 135 -124 -364 SIL 159
    
    Note the highly positive acoustic score for D. I wonder if it has to do with the segment only being 3 frames? The latter problem I mentioned above might disappear when this is fixed.
    
    Best,
    Sean
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sean Robertson - 2015-11-02

No worries, just making sure this doesn't get buried :) Thanks!

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sean Robertson - 2015-12-03

How's this looking?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2015-12-11
  
  Dear Sean
  
  Its interesting, I've just fixed this negative problem a couple of days ago, it was another bug with fillers in phonetic search (NSN phone you had). Could you please update and try again? Thank you.
  
  As for second problem, I set wip to 1.0 and the difference is smaller. Checking further.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2015-12-11
    
    Ok, I investigated the second problem as well.
    
    The thing is that acoustic model performs score normalization to keep them in range, so it shifts everything by the best acoustic score.
    
    In forced sequence you basically do not have enough diversity and the norm is different from the loop where you have more diversity and different normalizer.
    
    It is not possible to compare scores across multiple runs as an outcome, you can only compare scores within single run.
    
    Not a good thign, but this is how cmusphinx always worked.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Sean Robertson - 2015-12-17
      
      Thank you very much for looking into it and fixing what you could, Nickolay. :) I've taken a look at the code and see what you're talking about. I'll probably hack around a bit for my purposes.
      
      Quick question if you have the time: in acmod.c vs ms_mgau.c, acmod->senone_active is treated as both a score in acmod_best_score and as an index to mixtures in ms_cont_mgau_frame_eval. How can these be resolved?
      
      Best,
      Sean
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2015-12-17

Hi Sean

In this code

~~~~~~~~~
int16 senscr;
senscr = acmod->senone_scores;
for (i = 0; i < acmod->n_senone_active; ++i) {
senscr += acmod->senone_active[i];
if (senscr < best) {
best = senscr;
out_best_senid = i;
}
}
~~~~~~~

a variable name is a bit confusing, it is called senscr, but it is actually a pointer in array of senone scores, not score itself. senone_active uses a tricky compression method to save memory, it contains deltas between active senone indices, not senone indexes indices, so you need to accumulate indices to get index of the next active senone.

So the code is correct. See for details acmod_flags2list.

I think some of pocketsphinx methods are too complex and could be probably simplified, but this is the way it goes. And it allows very compact memory representation.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Sean Robertson - 2015-12-17

Thanks very much! You're awesome.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Phone loop acoustic scores lower than single word acoustic scores

Speech Recognition Toolkit

Forums

Help

Phone loop acoustic scores lower than single word acoustic scores document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Phone loop acoustic scores lower than single word acoustic scores