The command line is: pocketsphinx_continuous.exe -infile marley-looped.wav -hmm cmusphinx-en-us-5.2 -lm ..\..\..\model\en-us\en-us.lm.bin -dict ..\..\..\model\en-us\cmudict-en-us.dict
I'd like to automatically determine initial CMN values for a given WAVE file. I found another thread, where you recommend this approach:
no initial estimate -> record full utterance -> normalize only last CMN (current mode) -> decode
few decoding cycles are done -> have reliable CMN estimate -> normalize CMN (live_mode)
Is there any existing code I can look at to see how this is done?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After a bit more research, my understanding is this:
pocketsphinx_continuous continually adapts the CMN values to the input by using a variation of a sliding window approach. This allows for low latency, but can lead to poor results at the start or immediately after the recording characteristics have changed in mid-recording.
pocketsphinx_batch, on the other hand, does not use historic cepstral values. Instead, it analyzes each utterance as a whole, determines the actual mean value for this utterance, and subtracts it.
So I assume that batch mode will always give better results and should be preferred whenever latency is not an issue. Is this correct?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
pocketsphinx_batch, on the other hand, does not use historic cepstral values. Instead, it analyzes each utterance as a whole, determines the actual mean value for this utterance, and subtracts it.
Sort of. Batch also has issues. For example if half of your audio is loud half is quiet speech (call recording with two speakers mixed for example). The best solution would be short-term normalization which normalizes on a range of 0.1 seconds. This should be a part of new acoustic model research though and pretty complex problem.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you, Nickolay; that makes sense. I'm working with dialog recordings for computer games, so there will never be multiple speakers in a single recording. The volume should be pretty stable within an utterance. So I'll look into the code for pocketsphinx_batch.
Two questions regarding that:
I assume that what I described is what pocketsphinx_batch calls 'current' normalization scheme, whereas 'prior' would always use the CMN values from the previous utterance. When does it make sense to use 'prior' mode?
pocketsphinx_batch expects a "file listing utterances to be processed". I assume that this file must contain the names of WAVE files plus timecodes of utterances, but I couldn't find any documentation on the exact file format. Could you point me to some documentation or an example file?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I assume that what I described is what pocketsphinx_batch calls 'current' normalization scheme, whereas 'prior' would always use the CMN values from the previous utterance. When does it make sense to use 'prior' mode?
There is no 'prior/current' anymore, it is now called 'live' or 'batch'. You can use 'batch' if you don't need continuous processing. But again, it needs testing. For example, if you have many very short utterances like 'yes', batch is not very efficient for them, live is more reliable estimation. On the other hand batch is used for training, so in test it's closer to training. If you have high quality volume normalized audio like in games both methods should be fine, there should be no difference at all.
pocketsphinx_batch expects a "file listing utterances to be processed". I assume that this file must contain the names of WAVE files plus timecodes of utterances, but I couldn't find any documentation on the exact file format. Could you point me to some documentation or an example file?
I don't think timecodes are frequenty used. http://cmusphinx.sourceforge.net/wiki/tutorialtuning explains how to run the batch. You can call ps_process_raw from API with final_utt set to TRUE to use batch processing.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I understand your argument concerning very short utterances. A similar argument can be made for "utterances" detected by VAD that actually contain only breathing. In batch mode, calculating CMN values based on such an utterance probably won't give ideal results.
Given that my recordings are usually very "stable", I had the following idea: I could concatenate the utterances of an entire recording into a single long utterance, then analyze the first 10 seconds or so of this combined utterance to get reliable CMN values for the entire recording. Then I could use these fixed CMN values for all utterances of the recording. That way, anomalies like very short utterances or breath utterances won't affect the CMN values.
I'd need to do two things:
Analyze a number of samples with minimal processing, just to get the CMN values
Perform word recognition and alignment on an utterance using these fixed CMN values.
Is this possible using Pocketsphinx?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One more question before I do: You said: "You can call ps_process_raw from API with final_utt set to TRUE to use batch processing." I don't see a final_utt parameter in ps_process_raw. My understanding was that to use batch CMN mode, I simply specify -cmn = batch when creating the decoder configuration.
So what is the correct way to use batch CMN in a program that's similar to pocketsphinx_continuous?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
First, thank you Nickolay for the answers on the other thread and also here.
I now got the batch mode to work, in my case, like Daniel did, I also had to change the "-cmn" in the feat.params file to "batch". Using set_string in the configuration somehow didn't work.
The only thing is, that the very first time decoding, the values are still slightly different. After this, decoding the same file over and over (which i did to observe the behaviour of the decoder) delivers the exact same cmn values erverytime, to my satisfaction, because I now have in principal reproducible results.
I wonder why this is happening?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You mention setting -cmn = batch. Did you also change the call to ps_process_raw so that you pass an entire utterance at once and also pass true for final_utt?
I didn't check the log details, but doing so vastly increased the recognition quality for me.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes I did that. It is also very important for me, because if I don't set "full_utt=True" in the call, the batch mode is changed into live mode for this decoding process. (You could see in the log cmn_live.c instead of cmn.c updating)
Here's just an extract of the log, just the CMN values and the corresponding WER (its a very bad quality audio and even some words are missing in the dict, so the WER in this case is not representative, but it's good to see the difference between first and second decoding process):
If I do it over again, the second CMN values and the WER are staying constant.
At the moment I decode every file twice and only the second time I fetch the recognized words, that makes my program produce reproducible results.
But I'm just wondering how the same file would result in different CMN values in batch mode, at least the first time calculating. The full log is in my previous post.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, that's interesting.
Does this "-remove_noise" only effect the non speech parts, equivalent to the silence removal i guess, or is it like a general processing step of noise reduction for the whole file?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When performing word recognition, the first utterance is often detected very poorly. After that, accuracy is great.
I have a recording of a man saying (very clearly):
When I run this recording through
pocketsphinx_continuous
(exact invocation and output below), the result is:The first sentence is recognized as garbage, the second sentence is recognized perfectly.
I edited the WAVE file, looping it once. So the WAVE file now contains:
Now, the output becomes:
So the same utterance that was recognized as garbage the first time was recognized perfectly later on!
My questions are:
Details
Here is the looped WAVE file.
The command line is:
pocketsphinx_continuous.exe -infile marley-looped.wav -hmm cmusphinx-en-us-5.2 -lm ..\..\..\model\en-us\en-us.lm.bin -dict ..\..\..\model\en-us\cmudict-en-us.dict
Here is the full output:
You can see in the log the initial CMN estimation is off, you can set better cmninit value and it will recognize words from start.
Last edit: Nickolay V. Shmyrev 2016-09-14
Thanks, that makes sense!
I'd like to automatically determine initial CMN values for a given WAVE file. I found another thread, where you recommend this approach:
Is there any existing code I can look at to see how this is done?
After a bit more research, my understanding is this:
pocketsphinx_continuous
continually adapts the CMN values to the input by using a variation of a sliding window approach. This allows for low latency, but can lead to poor results at the start or immediately after the recording characteristics have changed in mid-recording.pocketsphinx_batch
, on the other hand, does not use historic cepstral values. Instead, it analyzes each utterance as a whole, determines the actual mean value for this utterance, and subtracts it.So I assume that batch mode will always give better results and should be preferred whenever latency is not an issue. Is this correct?
Sort of. Batch also has issues. For example if half of your audio is loud half is quiet speech (call recording with two speakers mixed for example). The best solution would be short-term normalization which normalizes on a range of 0.1 seconds. This should be a part of new acoustic model research though and pretty complex problem.
Thank you, Nickolay; that makes sense. I'm working with dialog recordings for computer games, so there will never be multiple speakers in a single recording. The volume should be pretty stable within an utterance. So I'll look into the code for
pocketsphinx_batch
.Two questions regarding that:
pocketsphinx_batch
calls 'current' normalization scheme, whereas 'prior' would always use the CMN values from the previous utterance. When does it make sense to use 'prior' mode?pocketsphinx_batch
expects a "file listing utterances to be processed". I assume that this file must contain the names of WAVE files plus timecodes of utterances, but I couldn't find any documentation on the exact file format. Could you point me to some documentation or an example file?There is no 'prior/current' anymore, it is now called 'live' or 'batch'. You can use 'batch' if you don't need continuous processing. But again, it needs testing. For example, if you have many very short utterances like 'yes', batch is not very efficient for them, live is more reliable estimation. On the other hand batch is used for training, so in test it's closer to training. If you have high quality volume normalized audio like in games both methods should be fine, there should be no difference at all.
I don't think timecodes are frequenty used. http://cmusphinx.sourceforge.net/wiki/tutorialtuning explains how to run the batch. You can call ps_process_raw from API with final_utt set to TRUE to use batch processing.
I understand your argument concerning very short utterances. A similar argument can be made for "utterances" detected by VAD that actually contain only breathing. In batch mode, calculating CMN values based on such an utterance probably won't give ideal results.
Given that my recordings are usually very "stable", I had the following idea: I could concatenate the utterances of an entire recording into a single long utterance, then analyze the first 10 seconds or so of this combined utterance to get reliable CMN values for the entire recording. Then I could use these fixed CMN values for all utterances of the recording. That way, anomalies like very short utterances or breath utterances won't affect the CMN values.
I'd need to do two things:
Is this possible using Pocketsphinx?
It might be possible with small code modifications.
That's good to hear. I'll look into it, then.
One more question before I do: You said: "You can call ps_process_raw from API with final_utt set to TRUE to use batch processing." I don't see a
final_utt
parameter inps_process_raw
. My understanding was that to use batch CMN mode, I simply specify-cmn
=batch
when creating the decoder configuration.So what is the correct way to use batch CMN in a program that's similar to pocketsphinx_continuous?
It is full_utt
http://cmusphinx.sourceforge.net/doc/pocketsphinx/pocketsphinx_8h.html#a572ad08651b4caae820d178a12c8f95f
First, thank you Nickolay for the answers on the other thread and also here.
I now got the batch mode to work, in my case, like Daniel did, I also had to change the "-cmn" in the feat.params file to "batch". Using set_string in the configuration somehow didn't work.
The only thing is, that the very first time decoding, the values are still slightly different. After this, decoding the same file over and over (which i did to observe the behaviour of the decoder) delivers the exact same cmn values erverytime, to my satisfaction, because I now have in principal reproducible results.
I wonder why this is happening?
You could provide at least the log to give more information about your problem.
Of course;)
You can see in the first cmn.c there are slightly different values than in the second one, for the same file being decoded.
You mention setting
-cmn = batch
. Did you also change the call tops_process_raw
so that you pass an entire utterance at once and also passtrue
forfinal_utt
?I didn't check the log details, but doing so vastly increased the recognition quality for me.
Yes I did that. It is also very important for me, because if I don't set "full_utt=True" in the call, the batch mode is changed into live mode for this decoding process. (You could see in the log cmn_live.c instead of cmn.c updating)
Here's just an extract of the log, just the CMN values and the corresponding WER (its a very bad quality audio and even some words are missing in the dict, so the WER in this case is not representative, but it's good to see the difference between first and second decoding process):
If I do it over again, the second CMN values and the WER are staying constant.
At the moment I decode every file twice and only the second time I fetch the recognized words, that makes my program produce reproducible results.
But I'm just wondering how the same file would result in different CMN values in batch mode, at least the first time calculating. The full log is in my previous post.
There is also noise and silence removal which need some time to adapt. You can disable them with
if your audio has no noise and silence is already stipped. With disabled noise and silence removal results must be identical.
Thanks, that's interesting.
Does this "-remove_noise" only effect the non speech parts, equivalent to the silence removal i guess, or is it like a general processing step of noise reduction for the whole file?
remove_noise works on the whole file, including speech parts.