CMU Sphinx / Forums / Help: change cmn mode

Jonas Helm - 2016-09-15

Hi, I'm using pocketsphinx0.1.0 with python3.5 and I have a few questions concerning the cepstral mean normalization.

1) what are the exact differences between current,live and prior cmn?

2) when i change the feat.params file to say "-cmn prior", while configuration of the decoder runs, the "Current configuration:" text says "-cmn [VALUE] prior" but the first INFO line then says:

INFO: feat.c(715): Initializing feature stream to type: '1s_c_d_dd', ceplen=13, CMN='live', VARNORM='no', AGC='none'

While decoding it says also cmn live:
INFO: cmn_live.c(88): Update from < 40.00 3.00 -1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 >
INFO: cmn_live.c(105): Update to < 58.21 14.48 -0.80 16.94 -5.80 0.03 1.24 0.23 -1.61 -2.09 -2.75 -5.72 -5.38 >

3) The reason i wanted to use "prior" was this:
I am decoding different audiofiles (~ <10s) one after another with varying quality (diff. recording systems used, diff. environmental influences, SNRs etc.) Now, if I decode a file once, the cmn update is made at some point during the utterance and the accuracy for this output is not so good. If I then decode it again, the accuracy is higher (my guess is that the cmn is done with the new cmn coefficients, derived from the same file, and so the feature extraction is more rubust or accurate) So my idea was to process a file one time just for the sake of getting the fitting cmn coefficients and then a second time, for collecting the recognition result (I'm doing it with prerecorded files and the consumed time is not such an issue)
I don't know if prior is even the right choice for this, so i would appreciate any help!

4) For interest maybe an expert can tell me how there are different cmn coefficients calculated for the same file (maybe it has something to do with which portions of the audio is used to do this!?) And how often is the calculation done (i read something about 500 frames or something in this forum? How long is one frame?)

Thank you in advance!!! ;)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-09-15
  
  1) what are the exact differences between current,live and prior cmn?
  
  "live" is a new name of "prior". "current" is same as "batch". They were renamed for conistency.
  
  While decoding it says also cmn live:
  
  In continuos processing mode live is automatically enabled. Batch mode is only used in pocketsphinx_batch.
  
  So my idea was to process a file one time just for the sake of getting the fitting cmn coefficients and then a second time, for collecting the recognition result (I'm doing it with prerecorded files and the consumed time is not such an issue)
  I don't know if prior is even the right choice for this, so i would appreciate any help!
  
  I would simply set a reasonable cmninit estimate in feat.params file and it should works ok. You can take the values printed in the log. You can reprocess the beginning of the file too, but you will have to modify pocketsphinx code for that.
  
  4) For interest maybe an expert can tell me how there are different cmn coefficients calculated for the same file (maybe it has something to do with which portions of the audio is used to do this!?) And how often is the calculation done (i read something about 500 frames or something in this forum? How long is one frame?)
  
  Yes, it also depends on the history. Estimate is updated with a sliding window every 5 seconds or 500 frames. 1 frame is 1/100 second.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jonas Helm - 2016-09-15

Thank you for the quick answer!
As I understand, when i set the cmninit values, these values will be set in the configuration but "only" used for the first five seconds of decoding, until the next estimate or update of cmn_live is done, right? When I, for example, after decoding a 6s long file or so, now choose another file with totally different kinds of convolutional distortions, the cmninit values of course would not have an influence, because they would have already been updated.
So if I for example would always give the audiofile twice to the decoder (like I'm doing at the moment) it could be, that sometimes during the first decoding process the cmn values would be updated and that sometimes they're not, depending on the lenght and history.
So if every file was at least 5 seconds, my approach would work and I would be sure that I would always have updated cmn values when beginning to process the file the second time?
You see my "problem" that cmninit may not help me, because I'm processing one file after another in a loop, which may have very different acoustic properties.
I hope my thoughts can be understood :D

Last edit: Jonas Helm 2016-09-15

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-09-15
  
  If your files are all short and you don't need very fast response you can use ps_process_raw with last argument (full_utt) set to TRUE. Then it will use batch CMN and process the whole file at once.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Thats interesting.
I did this now, but I still get a worse recognition result the first time I decode the file and a better result in the second run.
I thought, that with this approach the first run would already be fine because the cmn update would be done directly for this file? Or is it the case that with this approach I can be sure that during the second run there are the right cmn values for normalizing?

Or can I even somehow let the cmn values be calculated and the for some time freeze them?

Also how long or short would you suggest should the files be to use batch cmn?

The first run looks like this:

INFO: cmn_live.c(120): Update from < 56.54 15.29  1.58  8.86 -4.96 12.71 -8.18  2.65 -5.99 -4.90  4.62 -1.29  2.24 >
INFO: cmn_live.c(138): Update to   < 61.94 10.59 -16.40 -1.37 -13.80  2.51 -1.88 -6.83  1.16 -7.47 -3.23 -1.32  0.22 >
INFO: ngram_search.c(467): Resized score stack to 200000 entries
INFO: ngram_search.c(459): Resized backpointer table to 10000 entries
INFO: ngram_search.c(467): Resized score stack to 400000 entries
INFO: ngram_search.c(459): Resized backpointer table to 20000 entries
INFO: ngram_search_fwdtree.c(949): cand_sf[] increased to 64 entries
INFO: cmn_live.c(120): Update from < 61.94 10.59 -16.40 -1.37 -13.80  2.51 -1.88 -6.83  1.16 -7.47 -3.23 -1.32  0.22 >
INFO: cmn_live.c(138): Update to   < 61.94 10.59 -16.40 -1.37 -13.80  2.51 -1.88 -6.83  1.16 -7.47 -3.23 -1.32  0.22 >
INFO: ngram_search_fwdtree.c(1550):    16863 words recognized (62/fr)
INFO: ngram_search_fwdtree.c(1552):   919966 senones evaluated (3407/fr)
INFO: ngram_search_fwdtree.c(1556):  5205933 channels searched (19281/fr), 159141 1st, 427843 last
INFO: ngram_search_fwdtree.c(1559):    25153 words for which last channels evaluated (93/fr)
INFO: ngram_search_fwdtree.c(1561):   411977 candidate words for entering last phone (1525/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 5.94 CPU 2.201 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 6.08 wall 2.251 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 666 words
INFO: ngram_search_fwdflat.c(948):    15756 words recognized (58/fr)
INFO: ngram_search_fwdflat.c(950):   508763 senones evaluated (1884/fr)
INFO: ngram_search_fwdflat.c(952):  1359512 channels searched (5035/fr)
INFO: ngram_search_fwdflat.c(954):    72329 words searched (267/fr)
INFO: ngram_search_fwdflat.c(957):    39579 word transitions (146/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 2.25 CPU 0.832 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 2.24 wall 0.829 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.253
INFO: ngram_search.c(1276): Eliminated 3 nodes before end node
INFO: ngram_search.c(1381): Lattice has 1978 nodes, 66293 links
INFO: ps_lattice.c(1380): Bestpath score: -12349
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:253:268) = -618981
INFO: ps_lattice.c(1441): Joint P(O,S) = -853917 P(S|O) = -234936
INFO: ngram_search.c(1027): bestpath 0.72 CPU 0.267 xRT
INFO: ngram_search.c(1030): bestpath 0.74 wall 0.276 xRT

The second run like this:

INFO: cmn_live.c(120): Update from < 61.94 10.59 -16.40 -1.37 -13.80  2.51 -1.88 -6.83  1.16 -7.47 -3.23 -1.32  0.22 >
INFO: cmn_live.c(138): Update to   < 61.86 10.41 -16.26 -1.32 -13.85  2.55 -2.02 -6.96  1.25 -7.46 -3.24 -1.28  0.13 >
INFO: ngram_search.c(467): Resized score stack to 800000 entries
INFO: cmn_live.c(120): Update from < 61.86 10.41 -16.26 -1.32 -13.85  2.55 -2.02 -6.96  1.25 -7.46 -3.24 -1.28  0.13 >
INFO: cmn_live.c(138): Update to   < 61.86 10.41 -16.26 -1.32 -13.85  2.55 -2.02 -6.96  1.25 -7.46 -3.24 -1.28  0.13 >
INFO: ngram_search_fwdtree.c(1550):    19376 words recognized (75/fr)
INFO: ngram_search_fwdtree.c(1552):   918877 senones evaluated (3562/fr)
INFO: ngram_search_fwdtree.c(1556):  5355185 channels searched (20756/fr), 152000 1st, 550723 last
INFO: ngram_search_fwdtree.c(1559):    31010 words for which last channels evaluated (120/fr)
INFO: ngram_search_fwdtree.c(1561):   409664 candidate words for entering last phone (1587/fr)
INFO: ngram_search_fwdtree.c(1564): fwdtree 5.54 CPU 2.147 xRT
INFO: ngram_search_fwdtree.c(1567): fwdtree 5.54 wall 2.148 xRT
INFO: ngram_search_fwdflat.c(302): Utterance vocabulary contains 924 words
INFO: ngram_search_fwdflat.c(948):    14048 words recognized (54/fr)
INFO: ngram_search_fwdflat.c(950):   508828 senones evaluated (1972/fr)
INFO: ngram_search_fwdflat.c(952):  1427446 channels searched (5532/fr)
INFO: ngram_search_fwdflat.c(954):    77711 words searched (301/fr)
INFO: ngram_search_fwdflat.c(957):    51757 word transitions (200/fr)
INFO: ngram_search_fwdflat.c(960): fwdflat 2.15 CPU 0.834 xRT
INFO: ngram_search_fwdflat.c(963): fwdflat 2.18 wall 0.846 xRT
INFO: ngram_search.c(1250): lattice start node <s>.0 end node </s>.214
INFO: ngram_search.c(1276): Eliminated 4 nodes before end node
INFO: ngram_search.c(1381): Lattice has 1863 nodes, 49661 links
INFO: ps_lattice.c(1380): Bestpath score: -11688
INFO: ps_lattice.c(1384): Normalizer P(O) = alpha(</s>:214:256) = -657338
INFO: ps_lattice.c(1441): Joint P(O,S) = -914868 P(S|O) = -257530
INFO: ngram_search.c(1027): bestpath 0.39 CPU 0.152 xRT
INFO: ngram_search.c(1030): bestpath 0.38 wall 0.150 xRT

Last edit: Jonas Helm 2016-09-16

Nickolay V. Shmyrev - 2016-09-16

Or can I even somehow let the cmn values be calculated and the for some time freeze them?

This feature is not supported by pocketsphinx yet.

Also how long or short would you suggest should the files be to use batch cmn?

3-10 seconds are enough.

You can continue discussion in other thread:

https://sourceforge.net/p/cmusphinx/discussion/help/thread/51e2979b

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

change cmn mode

Speech Recognition Toolkit

Forums

Help

change cmn mode document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

change cmn mode