Hi, I am comparing the whole audio utterance decoding and continuous streaming
mode and have quite different results on the same acoustic, language model and
decoder.
In the first case I am using just one call of the ps_process_raw(self, data,
data_length, no_search, TRUE) method
In the second case I am calling ps_process_raw(self, shorts, nshorts,
no_search, FALSE) several times passing pieces of the whole utterance.
What can be a problem?
The first case gives me good recognition result and the second gives quite
poor results, which are not always consistent.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The difference between both types of ps_process is the following. On full
process it just builds features from the whole utterance and then decodes. On
partial processing it builds features from the submitted data and remembers
them for further processing when full processing will be invoked. It might be
that since the number of samples is not a product of window size it calculates
features differently. I suggest you to dump features and compare them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Even I have observed the same thing that feature calculation is different in
whole utterance decoding and continuous streaming mode. While tracing the code
I observed that by CMN calculation happens differently in the two modes. In
whole utterance decoding the default CMN mode is CMN_CURRENT while for
continous streaming mode it is CMN_PRIOR. In CMN_CURRENT mode the mean of the
entire current utterance is taken for normalization, while in CMN_PRIOR the
normalization starts with some user defined initial value and is updated at
the end of every utterance. This updated value then serves as the initial mean
value for the next utterance. This might be the reason or one of the reasons
for difference in features
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, I am comparing the whole audio utterance decoding and continuous streaming
mode and have quite different results on the same acoustic, language model and
decoder.
In the first case I am using just one call of the ps_process_raw(self, data,
data_length, no_search, TRUE) method
In the second case I am calling ps_process_raw(self, shorts, nshorts,
no_search, FALSE) several times passing pieces of the whole utterance.
What can be a problem?
The first case gives me good recognition result and the second gives quite
poor results, which are not always consistent.
Hello
The difference between both types of ps_process is the following. On full
process it just builds features from the whole utterance and then decodes. On
partial processing it builds features from the submitted data and remembers
them for further processing when full processing will be invoked. It might be
that since the number of samples is not a product of window size it calculates
features differently. I suggest you to dump features and compare them.
Hello,
Even I have observed the same thing that feature calculation is different in
whole utterance decoding and continuous streaming mode. While tracing the code
I observed that by CMN calculation happens differently in the two modes. In
whole utterance decoding the default CMN mode is CMN_CURRENT while for
continous streaming mode it is CMN_PRIOR. In CMN_CURRENT mode the mean of the
entire current utterance is taken for normalization, while in CMN_PRIOR the
normalization starts with some user defined initial value and is updated at
the end of every utterance. This updated value then serves as the initial mean
value for the next utterance. This might be the reason or one of the reasons
for difference in features