it seems like decoding accuracy varies on the quality of audio (e.g. noise, background, volume).. can anyone give a good recommendation on how to maximize decoding accuracy by means of the audio quality? what are the desirable factors of an audio file to get better decoding accuracy? and what factors affect the recognition rate?
thanks...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hm, you need to implement noise cancellation. There are advanced algorithms on noise substraction based on ML as well. and probably use another feature set if you train model yourself.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, I'll try implementing a noise cancellation algorithm for my audio file. Any more audio factors that would affect recognition? More on the speaker side?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The same microphone that was used for collection of training data? I wonder if it's possible to give any practical advice here. Once your speech is clean enough and follows the required dialect other factors become more important I think.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh ok! Thanks Nickolay! I guess the most important factor here is the noise of the audio background. I read some papers that gender also affects recognition accuracy (e.g. male have greater accuracy than feamle dictators). So as their age and some other factors. Is this true for sphinx?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> I read some papers that gender also affects recognition accuracy (e.g. male have greater accuracy than feamle dictators).
Probably true, but it's a very minor difference (not more then a percent of WER) once compared to the issue of using proper acoustic model and language model (10% of WER).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
From what you stated, can I safely conclude that physiological differences and audio quality would not be very significant comparing to proper acoustic models and language models in terms of recognition accuracy right?
Additionally, if what I stated above is true, would it mean that anyone, given an appropriate acoustic and language models, would still yield high accuracy? Would it mean that all dictators are "speech recable" (lol, speech recognizable?)
Thanks for replying.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi everyone,
it seems like decoding accuracy varies on the quality of audio (e.g. noise, background, volume).. can anyone give a good recommendation on how to maximize decoding accuracy by means of the audio quality? what are the desirable factors of an audio file to get better decoding accuracy? and what factors affect the recognition rate?
thanks...
Hm, you need to implement noise cancellation. There are advanced algorithms on noise substraction based on ML as well. and probably use another feature set if you train model yourself.
Ok, I'll try implementing a noise cancellation algorithm for my audio file. Any more audio factors that would affect recognition? More on the speaker side?
The same microphone that was used for collection of training data? I wonder if it's possible to give any practical advice here. Once your speech is clean enough and follows the required dialect other factors become more important I think.
Oh ok! Thanks Nickolay! I guess the most important factor here is the noise of the audio background. I read some papers that gender also affects recognition accuracy (e.g. male have greater accuracy than feamle dictators). So as their age and some other factors. Is this true for sphinx?
> I read some papers that gender also affects recognition accuracy (e.g. male have greater accuracy than feamle dictators).
Probably true, but it's a very minor difference (not more then a percent of WER) once compared to the issue of using proper acoustic model and language model (10% of WER).
From what you stated, can I safely conclude that physiological differences and audio quality would not be very significant comparing to proper acoustic models and language models in terms of recognition accuracy right?
Additionally, if what I stated above is true, would it mean that anyone, given an appropriate acoustic and language models, would still yield high accuracy? Would it mean that all dictators are "speech recable" (lol, speech recognizable?)
Thanks for replying.