Hi
I know the frequency of women voice is more than the men and also the frequency of childs voice is more than both. but for speech recognition and CMUSphinx engine, is it important that our training voices are from which group?
Can I train a general dictation or a command-controll system with only men or women or children voices and my system will works for everyone in future?
In my special case, I am trying to build a system for childs (between 5 to 15) but it's very easy for me to record the voice of 12 to 15 girls and also I can record the voice of girls between 6 to 12 old too, but it's very hard to record the boys voices in any olds.
should I record 50% of girls and 50% of boys for my training data? or it's not important and I can record 100% voices of girls and my system will work properly for both girls and boys in the future?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also I like t know, if a system trained with a good enough data of voices of only one gender(male or female), will it work like a system thet trained with 50% voices of men and 50% voices of women?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Young child speech is not so different between boys and girls. However for 5 years old you do have many spectral and pronunciation differences.
If you train on girls, it will not perform well on boys out of the box. You can have some smaller amount of boys speech and do model and feature adaptation to improve the performance.
The answer on your second question is also no, it will not
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you Arseniy,
I found this paper with this abstraction:
The fundamental frequency and jitter of the voice was measured by electroglottography in 71 children between the age of 7 and 15 years. In this series of children the fundamental frequency and jitter did not depend on the gender. The median (range) fundamental frequency was 244 (182–331) Hz in girls and 250 (205–293) Hz in boys. It decreased with increasing height (r = −0.59; P < 0.0005) and age (r = −0.57; P < 0.001). The median jitter ratio was 9.7 (1.6–33.3) in girls and 10.3 (2.0–4.3) in boys. The jitter ratio was negatively related to height (r = −0.31; P < 0.05), but not to age.
So I can use only the girls voice (under 15 years old) and do some adaption to improve my system with smaller amount of boys voice if needed. (althogh in the paper it says the range of girls voice was (182-331)Hz and for boys was (205-293)Hz, so I can choose between the girls all the things I want! Isn't it?)
Last edit: rezaee 2016-11-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Fundamental frequency is only part of the story. Yes, you can try building the model on girls and test it on boys. I believe you will get pretty bad results for boys after 10 years old. If this degradation will bee too large to use your models for your task, you will need to adapt it (see VTLN, MLLR, MAP)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can I use changing voice softwares(male to female or female to male) or manipulating the girls voices wave forms(like multiplying them by some numbers) to use them as boys voice for train?
Is there any good formula or toolkit to convert voices male-to-female or vise-versa by computer?
Last edit: rezaee 2016-11-14
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi
I know the frequency of women voice is more than the men and also the frequency of childs voice is more than both. but for speech recognition and CMUSphinx engine, is it important that our training voices are from which group?
Can I train a general dictation or a command-controll system with only men or women or children voices and my system will works for everyone in future?
In my special case, I am trying to build a system for childs (between 5 to 15) but it's very easy for me to record the voice of 12 to 15 girls and also I can record the voice of girls between 6 to 12 old too, but it's very hard to record the boys voices in any olds.
should I record 50% of girls and 50% of boys for my training data? or it's not important and I can record 100% voices of girls and my system will work properly for both girls and boys in the future?
Also I like t know, if a system trained with a good enough data of voices of only one gender(male or female), will it work like a system thet trained with 50% voices of men and 50% voices of women?
Young child speech is not so different between boys and girls. However for 5 years old you do have many spectral and pronunciation differences.
If you train on girls, it will not perform well on boys out of the box. You can have some smaller amount of boys speech and do model and feature adaptation to improve the performance.
The answer on your second question is also no, it will not
Thank you Arseniy,
I found this paper with this abstraction:
http://www.sciencedirect.com/science/article/pii/016558769501197J
So I can use only the girls voice (under 15 years old) and do some adaption to improve my system with smaller amount of boys voice if needed. (althogh in the paper it says the range of girls voice was (182-331)Hz and for boys was (205-293)Hz, so I can choose between the girls all the things I want! Isn't it?)
Last edit: rezaee 2016-11-14
Fundamental frequency is only part of the story. Yes, you can try building the model on girls and test it on boys. I believe you will get pretty bad results for boys after 10 years old. If this degradation will bee too large to use your models for your task, you will need to adapt it (see VTLN, MLLR, MAP)
Can I use changing voice softwares(male to female or female to male) or manipulating the girls voices wave forms(like multiplying them by some numbers) to use them as boys voice for train?
Is there any good formula or toolkit to convert voices male-to-female or vise-versa by computer?
Last edit: rezaee 2016-11-14
This is the idea of VTLN actually. You apply a feature warping factor and transform the features. This factor is better to estimate from the model.
Well, just to have some fun you can also play with sox pitch, but I do not think it will be that much useful
Very thank you Arseniy!