I am currently playing arround with the pocketsphinx api with live input data. The results are getting really good after a while but it always takes a long warmup phase to optimize the CMN values I guess. To shorten this warmup phase I tried to adjust the cmninit value but since in my understanding cmn tries to concern device specific and environmental peculiarity this will only affect my own device (but I could not improve something even here). I also tried to redecode the first frame of speech to let pocketsphinx decode the frame once again after tuning it's cmn (like proposed here) but I guess one frame is just not enough. And furthermore I'd like to understand the meaning of the values, so what would e.g. 44,3,-6 stand for? Have you any suggestions concerning this problem? Thanks in advance...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
there is a whole chapter about feature extraction.
Our plan for robust CMN is to implement a buffer to store first 2 seconds of audio before decoding (not a single frame) to estimate initial values. I wrote about this in original node-pocketsphinx discussion.
This feature would require quite a big rework of the pocketsphinx framework though so it's delayed. It is planned for next release still.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am currently playing arround with the pocketsphinx api with live input data. The results are getting really good after a while but it always takes a long warmup phase to optimize the CMN values I guess. To shorten this warmup phase I tried to adjust the cmninit value but since in my understanding cmn tries to concern device specific and environmental peculiarity this will only affect my own device (but I could not improve something even here). I also tried to redecode the first frame of speech to let pocketsphinx decode the frame once again after tuning it's cmn (like proposed here) but I guess one frame is just not enough. And furthermore I'd like to understand the meaning of the values, so what would e.g. 44,3,-6 stand for? Have you any suggestions concerning this problem? Thanks in advance...
To understand speech recognition theory it is helpful to read the textbook:
http://www.amazon.com/Spoken-Language-Processing-Algorithm-Development/dp/0130226165
there is a whole chapter about feature extraction.
Our plan for robust CMN is to implement a buffer to store first 2 seconds of audio before decoding (not a single frame) to estimate initial values. I wrote about this in original node-pocketsphinx discussion.
This feature would require quite a big rework of the pocketsphinx framework though so it's delayed. It is planned for next release still.
Ok thanks Nickolay. That's what I expected. I guess I just wait for the next release then.