After a year of development and models training, we have achieved very great results (86% true positives and 1% false positives for keyword recognition on 3.500 phrases dataset) and 90% acc for grammar search, so I thank you for whole work on your the Sphinx projects' family.
However, I've noticed that our device drops accuracy after hours of running. Here is how we implemented it:
We have a wrapper (generated with swig configurations from your repository) that allows us to use pocketsphinx in C#. We have two instances of Decoder class, one configured for keyword spotting, another for grammar-based recognition.
The device is a raspberry pi + microphone array assistant running 24/7 in our workspace. All the input is parsed with our own voice activity detection algorithm to detect single word potential keyword, then a single fragment with this word is passed to the decoder (with procedure: start utt, processraw, endutt, hyp launched for whole fragment). After activation, same goes for the command recognition decoder.
The problem is that freshly after a restart of our device, the accuracy is about 100%, but after some time it can drop even to zero percent. Also we have noticed that after a long series of faulty recognition the device recognizes a given command correctly, the accuracy rises very noticeably.
To debug this issue, we collected every fragment passed to the decoder and then we passed it to our accuracy analizing tool (which just passes files with fragments one after another and checks the results). Surprisingly, for a set of fragments that had 0% accuracy, the analyzing tool happens to return 100% accuracy result.
Both the tool and the runtime application have no differences in the implementation of pocketpshinx. So I suspect that pocketsphinx's own environment-adjustment mechanisms are causing this trouble. In logs I see the "history entries" output which I am not familiar with and cmn adjustments that I've read about in your FAQ. However, our own VAD does its job quite well and makes cpu usage much lower allowing us to avoid unnecessary decoder usage for all the audio data.
Are there any settings or approaches that could help us get rid of that problem?
Best regards,
Michael
Last edit: Michael Wityk 2018-10-17
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can change the code to reduce cmn update frequency, make them 8000 and 5000:
#define CMN_WIN_HWM 800 /* #frames after which window shifted */
#define CMN_WIN 500
There could be more advanced algorithms for cmn tracking, not sure if there is sense to implement them. Overall you can always check the value in the log and see if they change significantly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I forgot to mention that we use full_utt parameter at value of true. Today I've searched through this forum and I found out that it estimates CMN for each file separately. If so, the cause of our problem is rendered unknown .
Apart from this, I found some more parameters to work with like CMN initial value (but does it change anything if we use batch cmn?). I also tried to disable noise/silence reduction, but it dragged the overall accuracy down.
Here are the startup logs, yet most of them are at default value, cmn init was changed today, before that it was also at default value. https://pastebin.com/P6v0wbja Also cmn used to be live, but I assume that full_utt overrides this setting anyway (or does it?).
Here are some example waveforms of the data we feed the decoder with.
As you can see, the volume change a lot depending on the distance to the microphone matrix. Every audio fragment looks exactly like that - doesn't exceed three seconds, there is always some silence around the voice.
Should we maybe place SIL at the beggining and the end of grammar entry?
Or maybe the CPU usage affects it? I am constantly getting worse results on same model on the device.
Or maybe it's having two instances of decoder object? I don't know if there are any static dependencies that could be shared between those.
I don't know if it would be a good idea to introduce own normalization that would make all the provided audio be at the same level. Also as you can see on the waveforms the DC offset is sometimes above zero, and I am not sure if this is relevant. I was successful to fix it with a high pass filter, but it didn't change overall accuracy in the test tool. Not sure about CMN.
Also I have a off-topic question - I couldn't find any changelogs, do you provide any? I considered rebuilding the pocketsphinx if there are any changes since last build on early 2017 and it was hard to check it in the commit history at the github.
Last edit: Michael Wityk 2018-10-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You need to look on cmn values in decoding log, not in startup log.
I don't know if it would be a good idea to introduce own normalization that would make all the provided audio be at the same level.
If you have a moving speech source you need a custom normalization for sure. Or you can use longer context neural network and disable normalization altogether, you'll get much better accuracy then.
Also I have a off-topic question - I couldn't find any changelogs, do you provide any? I considered rebuilding the pocketsphinx if there are any changes since last build on early 2017 and it was hard to check it in the commit history at the github.
There were no significant changes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello.
After a year of development and models training, we have achieved very great results (86% true positives and 1% false positives for keyword recognition on 3.500 phrases dataset) and 90% acc for grammar search, so I thank you for whole work on your the Sphinx projects' family.
However, I've noticed that our device drops accuracy after hours of running. Here is how we implemented it:
We have a wrapper (generated with swig configurations from your repository) that allows us to use pocketsphinx in C#. We have two instances of Decoder class, one configured for keyword spotting, another for grammar-based recognition.
The device is a raspberry pi + microphone array assistant running 24/7 in our workspace. All the input is parsed with our own voice activity detection algorithm to detect single word potential keyword, then a single fragment with this word is passed to the decoder (with procedure: start utt, processraw, endutt, hyp launched for whole fragment). After activation, same goes for the command recognition decoder.
The problem is that freshly after a restart of our device, the accuracy is about 100%, but after some time it can drop even to zero percent. Also we have noticed that after a long series of faulty recognition the device recognizes a given command correctly, the accuracy rises very noticeably.
To debug this issue, we collected every fragment passed to the decoder and then we passed it to our accuracy analizing tool (which just passes files with fragments one after another and checks the results). Surprisingly, for a set of fragments that had 0% accuracy, the analyzing tool happens to return 100% accuracy result.
Both the tool and the runtime application have no differences in the implementation of pocketpshinx. So I suspect that pocketsphinx's own environment-adjustment mechanisms are causing this trouble. In logs I see the "history entries" output which I am not familiar with and cmn adjustments that I've read about in your FAQ. However, our own VAD does its job quite well and makes cpu usage much lower allowing us to avoid unnecessary decoder usage for all the audio data.
Are there any settings or approaches that could help us get rid of that problem?
Best regards,
Michael
Last edit: Michael Wityk 2018-10-17
You can change the code to reduce cmn update frequency, make them 8000 and 5000:
There could be more advanced algorithms for cmn tracking, not sure if there is sense to implement them. Overall you can always check the value in the log and see if they change significantly.
Thank you for your response.
I forgot to mention that we use full_utt parameter at value of true. Today I've searched through this forum and I found out that it estimates CMN for each file separately. If so, the cause of our problem is rendered unknown .
Apart from this, I found some more parameters to work with like CMN initial value (but does it change anything if we use batch cmn?). I also tried to disable noise/silence reduction, but it dragged the overall accuracy down.
Here are the startup logs, yet most of them are at default value, cmn init was changed today, before that it was also at default value. https://pastebin.com/P6v0wbja Also cmn used to be live, but I assume that full_utt overrides this setting anyway (or does it?).
Here are some example waveforms of the data we feed the decoder with.
https://pastebin.com/P6v0wbja
As you can see, the volume change a lot depending on the distance to the microphone matrix. Every audio fragment looks exactly like that - doesn't exceed three seconds, there is always some silence around the voice.
Should we maybe place SIL at the beggining and the end of grammar entry?
Or maybe the CPU usage affects it? I am constantly getting worse results on same model on the device.
Or maybe it's having two instances of decoder object? I don't know if there are any static dependencies that could be shared between those.
I don't know if it would be a good idea to introduce own normalization that would make all the provided audio be at the same level. Also as you can see on the waveforms the DC offset is sometimes above zero, and I am not sure if this is relevant. I was successful to fix it with a high pass filter, but it didn't change overall accuracy in the test tool. Not sure about CMN.
Also I have a off-topic question - I couldn't find any changelogs, do you provide any? I considered rebuilding the pocketsphinx if there are any changes since last build on early 2017 and it was hard to check it in the commit history at the github.
Last edit: Michael Wityk 2018-10-19
You need to look on cmn values in decoding log, not in startup log.
If you have a moving speech source you need a custom normalization for sure. Or you can use longer context neural network and disable normalization altogether, you'll get much better accuracy then.
There were no significant changes.