sphinx_fe.exe -i noam-bug4-sil.wav -o noam-bug4-sil.mfc -nist no -raw no -mswav yes -samprate 16000 -nfilt 40 -lowerf 133.3334 -upperf 6855.4976
The differences are substantial, yet the sample rate in the continuous run is 16,000
and the CMN values look similar (I copied the batch CMN values into the log of the
continuous run to make comparison easy).
What could be the reason for the different outputs?
I tried with the sphinx5-prealpha binaries (from "cmusphinx/files", not the trunk)
and got the same phenomena: batch was good but continuous was not good.
The models were trained using sphinxtrain-0.8 so I used "-remove_noise no" in the
sphinx5-prealpha runs - as advised in another discussion.
What could be the reason for the difference?
Many thanks,
Yuval
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, there is no dictionary and lm in your archive so it's hard to reproduce the results but most likely it's due to CMN values. Try -cmninit and it should be better.
As for the same values printed, continuous cmn is computed after the utterance so they will be used for the next utterance while batch cmn is computed before utterance, this is the difference. So despite values are similar, initial value still has more effect.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The first CMN value was '13.92', I got the correct recognition with -cmninit values of 12 to 17 .
Under what conditions can I expect ps-continuous to give similar (even if not identical)
results to ps-batch?
does it have to do with properties of the training data?
or should I use the CMN values from the log of the sphinxtrain decode phase?
Last edit: Yuval Karon 2015-01-05
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Proper CMN estimation is important and it's not trivial how to do that in continuous setting. Batch CMN has it's own disadvantages actually because it also can not properly estimate CMN from short utterances.
In sphinx4 we implemented combined method where we read first few seconds of audio and process it in batch mode and then switch to live continuous mode. In pocketsphinx this approach is not implemented yet.
Initial mean estimation affects only the first utterance. It's ok to use initial estimation from sphinxtrain, on second utterance it wouldn't be important.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Forgive my ignorance, does the computation of CMN values involve the LM and dictionary or are they inherent properties of the audio?
I would like to use ps-continuous for keyword search. If the CMN are properties of the audio alone, perhaps I could estimate them with a batch run - even if the audio contains
words missing in the dictionary? and use for kws? does it make sense? (sounds too good...)
Is the continuous CMN estimation more reliable than batch in short utterances?
(then, if the input contains several utterances, would it be better to calculate the CMN
values from concatenation of the utterances?)
Yuval
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Forgive my ignorance, does the computation of CMN values involve the LM and dictionary or are they inherent properties of the audio?
CMN is property of the audio. Essentially it's volume in different frequency bands.
I would like to use ps-continuous for keyword search. If the CMN are properties of the audio alone, perhaps I could estimate them with a batch run - even if the audio contains words missing in the dictionary? and use for kws? does it make sense? (sounds too good...)
Like I wrote above, intelligent CMN algorithm could be implemented. For example you might estimate CMN from first 5 seconds of the speech and then proceed with that estimation in live mode. If you have a whole audio you can also estimate CMN for the whole at once.
Is the continuous CMN estimation more reliable than batch in short utterances?
(then, if the input contains several utterances, would it be better to calculate the CMN values from concatenation of the utterances?)
This is correct.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Nickolay, as you wrote above that we can estimate CMN for the audio file at once, how can I do this ?
pocketsphinx_batch can give the CMN values for the audio files. Is this what you were talking about ?
If possible is there any example that can explain exact difference between pockesphinx_batch and pocketsphinx_continuous ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I have trained a model and use it with pocketsphinx_batch with good results.
When I tried it with pocketsphinx_continuous the results were not good.
The log files (batch and continuous), audio, mfc, model and training configuration are in:
https://www.dropbox.com/s/7i0fdwwx06zr3t5/n4.zip?dl=0
The mfc file was created by the command:
The differences are substantial, yet the sample rate in the continuous run is 16,000
and the CMN values look similar (I copied the batch CMN values into the log of the
continuous run to make comparison easy).
What could be the reason for the different outputs?
I tried with the sphinx5-prealpha binaries (from "cmusphinx/files", not the trunk)
and got the same phenomena: batch was good but continuous was not good.
The models were trained using sphinxtrain-0.8 so I used "-remove_noise no" in the
sphinx5-prealpha runs - as advised in another discussion.
What could be the reason for the difference?
Hello Yuval
Sorry, there is no dictionary and lm in your archive so it's hard to reproduce the results but most likely it's due to CMN values. Try -cmninit and it should be better.
As for the same values printed, continuous cmn is computed after the utterance so they will be used for the next utterance while batch cmn is computed before utterance, this is the difference. So despite values are similar, initial value still has more effect.
It worked, thanks!
The first CMN value was '13.92', I got the correct recognition with -cmninit values of 12 to 17 .
Under what conditions can I expect ps-continuous to give similar (even if not identical)
results to ps-batch?
does it have to do with properties of the training data?
or should I use the CMN values from the log of the sphinxtrain decode phase?
Last edit: Yuval Karon 2015-01-05
Proper CMN estimation is important and it's not trivial how to do that in continuous setting. Batch CMN has it's own disadvantages actually because it also can not properly estimate CMN from short utterances.
In sphinx4 we implemented combined method where we read first few seconds of audio and process it in batch mode and then switch to live continuous mode. In pocketsphinx this approach is not implemented yet.
Initial mean estimation affects only the first utterance. It's ok to use initial estimation from sphinxtrain, on second utterance it wouldn't be important.
Thank you!
Forgive my ignorance, does the computation of CMN values involve the LM and dictionary or are they inherent properties of the audio?
I would like to use ps-continuous for keyword search. If the CMN are properties of the audio alone, perhaps I could estimate them with a batch run - even if the audio contains
words missing in the dictionary? and use for kws? does it make sense? (sounds too good...)
Is the continuous CMN estimation more reliable than batch in short utterances?
(then, if the input contains several utterances, would it be better to calculate the CMN
values from concatenation of the utterances?)
Yuval
CMN is property of the audio. Essentially it's volume in different frequency bands.
Like I wrote above, intelligent CMN algorithm could be implemented. For example you might estimate CMN from first 5 seconds of the speech and then proceed with that estimation in live mode. If you have a whole audio you can also estimate CMN for the whole at once.
This is correct.
I see, thank you,
Yuval
Nickolay, as you wrote above that we can estimate CMN for the audio file at once, how can I do this ?
pocketsphinx_batch can give the CMN values for the audio files. Is this what you were talking about ?
If possible is there any example that can explain exact difference between pockesphinx_batch and pocketsphinx_continuous ?