We are doing speech recognition with grammar. But We get wired acoustic score from the recognizer when we say something other than the sentence in grammar. For example, We set the grammar as "How are you". When we say "How are you", the recognizer will output the result with acoustic score. However, when we say "tomorrow is good", the acoustic score sometimes is even higher than the right sentence. I have several questions here:
the acoustic is the frame based likelihood, so if the duration is longer, the value should be smaller, right?
the acoustic represent how much the input match the model, right? If this is the case, why would we have this problem?
Thank you very much!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
the acoustic is the frame based likelihood, so if the duration is longer, the value should be smaller, right?
This is correct
the acoustic represent how much the input match the model, right? If this is the case, why would we have this problem?
I do not see a problem here
If acoustic model for the second result is higher it means that the second phrase has better match in the model than the first one. It might be that the whole second phrase is matched by the silence model and that was not trained properly so it was quite close to the second phrase. You can check word segmentation to see what units covered what parts of "tomorrow is good" audio. You can also check state segmentation from force aligner for more precise picture.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
We are doing speech recognition with grammar. But We get wired acoustic score from the recognizer when we say something other than the sentence in grammar. For example, We set the grammar as "How are you". When we say "How are you", the recognizer will output the result with acoustic score. However, when we say "tomorrow is good", the acoustic score sometimes is even higher than the right sentence. I have several questions here:
the acoustic is the frame based likelihood, so if the duration is longer, the value should be smaller, right?
the acoustic represent how much the input match the model, right? If this is the case, why would we have this problem?
Thank you very much!
This is correct
I do not see a problem here
If acoustic model for the second result is higher it means that the second phrase has better match in the model than the first one. It might be that the whole second phrase is matched by the silence model and that was not trained properly so it was quite close to the second phrase. You can check word segmentation to see what units covered what parts of "tomorrow is good" audio. You can also check state segmentation from force aligner for more precise picture.
Thank you. Could you tell me how can we get the timestamp from the pocketsphinx-android? Also, How can we have the force aligner?
Thank you once again.
I recommend you to save audio with -rawlogdir option and analyze them on desktop with sphinx3_align and similar tools.