My question is what kind of speech should i record. The general rule of thumb is one needs to record those sound that one wants to recognize. However most of the big speech recognition systems have large amount of data(200 hour+) we are thinking about taking incremently(i mean record some and then test then again record ...) 10-20 hour of recording. However the rergional bias is also a issue. As our system is Speaker Independent
How do we find an ideal recording script(the texts that we want to record) that will work well on the following two working conditions
1. General Command and Control for PC and Mobile
2. Common search sentences in Web
we want to cover around most common 5000-15000 words in a foreign(not common) language
my concerns are following
Where should the recording take place a) isolated room with clean sound or b) in a noisy environment
In speech synthesis there is a notion called phonetically balanced corpus, in case of speech recognition does that notion hold any value, i mean if i try to record sentences with different phonemes rather that similar phonemes again and again will it increase my accuracy. If so is there any automated way or algorithm to find out the suitable sentences from the text corpus.
How can i generalise the regional bias(accent) so that the accuracy improves.
thank you
Last edit: Yeasin Ar Rahman 2016-07-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Where should the recording take place a) isolated room with clean sound or b) in a noisy environment
The general rule of thumb is one needs to record those sound that one wants to recognize. I doubt your recognition is going to work in an isolated room. So it is better to record noisy sound.
In speech synthesis there is a notion called phonetically balanced corpus, in case of speech recognition does that notion hold any value, i mean if i try to record those sentences with different phonemes rathen that similar phonemes will it increase my accuracy. If so is there any automated way or algorithm to find out the suitable sentences from text corpus.
These days the preference is to collect more data than to spend time on balancing corpus. Such approach is both more efficient and allows you to avoid shortcomings of hand-prepared data. So you do not need any phonetic balance, you just need more data. You can get it from the books, podcasts, tv shows and so on.
How can i generalise the regional bias(accent) so that the accuracy improves.
It is still an open question on how to support regional accents efficiently. There is no good solution, you can just use more data.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
can we not add noise to the sound later, i think i asked you this question before in another question. I mean is such a thing possible. It is because i can not record in all kinds of noisy environment. My idea is to mask the sound with noise data to make it more robust.
there are many papers that say similar things. But as we do not have enough experience we have no idea if the techniques described are practically attainable.
NB:the idea came to me after i saw a paper of google which added carefully selected noise for their image recognition engine.
Last edit: Yeasin Ar Rahman 2016-07-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But as we do not have enough experience we have no idea if the techniques described are practically attainable.
Those things are practically attainable but they are prefered when you do not have capabilities to record real data. If you have an option to record real data it is better than to construct data artificially.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
My question is what kind of speech should i record. The general rule of thumb is one needs to record those sound that one wants to recognize. However most of the big speech recognition systems have large amount of data(200 hour+) we are thinking about taking incremently(i mean record some and then test then again record ...) 10-20 hour of recording. However the rergional bias is also a issue. As our system is Speaker Independent
How do we find an ideal recording script(the texts that we want to record) that will work well on the following two working conditions
1. General Command and Control for PC and Mobile
2. Common search sentences in Web
we want to cover around most common 5000-15000 words in a foreign(not common) language
my concerns are following
thank you
Last edit: Yeasin Ar Rahman 2016-07-11
The general rule of thumb is one needs to record those sound that one wants to recognize. I doubt your recognition is going to work in an isolated room. So it is better to record noisy sound.
These days the preference is to collect more data than to spend time on balancing corpus. Such approach is both more efficient and allows you to avoid shortcomings of hand-prepared data. So you do not need any phonetic balance, you just need more data. You can get it from the books, podcasts, tv shows and so on.
It is still an open question on how to support regional accents efficiently. There is no good solution, you can just use more data.
Thank you very much for your quick reply
@Nickolay V. Shmyrev
can we not add noise to the sound later, i think i asked you this question before in another question. I mean is such a thing possible. It is because i can not record in all kinds of noisy environment. My idea is to mask the sound with noise data to make it more robust.
here are some papers that show huge improvements
https://www.researchgate.net/publication/221489763_Adding_noise_to_improve_noise_robustness_in_speech_recognition
https://www.microsoft.com/en-us/research/publication/an-investigation-of-deep-neural-networks-for-noise-robust-speech-recognition/
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=940823&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D940823
there are many papers that say similar things. But as we do not have enough experience we have no idea if the techniques described are practically attainable.
NB:the idea came to me after i saw a paper of google which added carefully selected noise for their image recognition engine.
Last edit: Yeasin Ar Rahman 2016-07-11
Those things are practically attainable but they are prefered when you do not have capabilities to record real data. If you have an option to record real data it is better than to construct data artificially.
Thank you very much for your reply