Hi,
I am using MFCC as features for discrete HMM speech recognition system. Does
anyone have any idea how do implement some simple vocal tract length
normalization.
I was looking for some good explanation of this approach but haven't found
anything.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
VTLN is implemented in CMUSphinx are you looking at the particular place in
the code like fe_warp_affine.c in sphinxbase or in the description of the
algorithm like
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2011-03-10
Hello
VTLN is implemented in CMUSphinx are you looking at the particular place in
the code like fe_warp_affine.c in sphinxbase or in the description of the
algorithm like
Is there any simpler way to make my speech recognition system less speaker
dependent?
Sorry, I'm not sure what do you mean by "less speaker dependent".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2011-03-22
Sorry, I'm not sure what do you mean by "less speaker dependent".
I mean, what approach should I use to get similar results (success rate) for
wider range of speakers - to make system speaker independent?
I have already implemented the vocal tract normalization using the bilinear
transformation of frequency axis. I would like to ask you one more question to
be sure if I got it right: How could I discover the warping factor (a)? I have
read that its value should be between 0.88 to 1.12. So far I record a speaker
saying known word and than I run trough all possible warping factors and take
the one for which the probability is biggest. For example - I say word
"internet" and the probabilty with a=0.95 for model "internet" is the biggest
- so I take the warping factor 0.95 for this particular speaker.
Is this approach right?
Thank you for your replies.
Peter
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Interesting, but I prefer other method:
if we'll average log area over the voiced frames (at least, but it can be
proven that over all training data) , then resulting "mean log area" will have
next features:
1. Clearly visible negative slope at the (mean) place of glottis.
2. Synthesized from that "mean log area" sound will have 1st 4 (at least in bad source quality 3) "mean formants", which will reliably determine the VTL . The method as simple as finding VTL that minimizes rms deviations of (3 or 4) first "mean" formants from eigenfrequencies of classic open-closed tube (=odd series) is enough to determinate "mean VTL" with enough precision. Of course, the VTL can vary while lip protrusions or so, however, this difference is relatively small.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am using MFCC as features for discrete HMM speech recognition system. Does
anyone have any idea how do implement some simple vocal tract length
normalization.
I was looking for some good explanation of this approach but haven't found
anything.
Thanks
Hello
VTLN is implemented in CMUSphinx are you looking at the particular place in
the code like fe_warp_affine.c in sphinxbase or in the description of the
algorithm like
http://www.cs.cmu.edu/%7Eegouvea/paper/thesis.pdf
VTLN is implemented in CMUSphinx are you looking at the particular place in
the code like fe_warp_affine.c in sphinxbase or in the description of the
algorithm like
http://www.cs.cmu.edu/%7Eegouvea/paper/thesis.pdf
Thank you very much for your reply. I think both code and description could be
helpful for me. I will check them both.
Peter
I have one aditional question:
Is there any simpler way to make my speech recognition system less speaker
dependent?
Sorry, I'm not sure what do you mean by "less speaker dependent".
I mean, what approach should I use to get similar results (success rate) for
wider range of speakers - to make system speaker independent?
I have already implemented the vocal tract normalization using the bilinear
transformation of frequency axis. I would like to ask you one more question to
be sure if I got it right: How could I discover the warping factor (a)? I have
read that its value should be between 0.88 to 1.12. So far I record a speaker
saying known word and than I run trough all possible warping factors and take
the one for which the probability is biggest. For example - I say word
"internet" and the probabilty with a=0.95 for model "internet" is the biggest
- so I take the warping factor 0.95 for this particular speaker.
Is this approach right?
Thank you for your replies.
Peter
You can try search google:
http://www.google.ru/search?q=vtln+factor+estimation
For example
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.6249
Decoding is quite slow method for VTLN estimation. I would better use fast
GMM-classifier based classifier.
Interesting, but I prefer other method:
if we'll average log area over the voiced frames (at least, but it can be
proven that over all training data) , then resulting "mean log area" will have
next features:
1. Clearly visible negative slope at the (mean) place of glottis.
2. Synthesized from that "mean log area" sound will have 1st 4 (at least in bad source quality 3) "mean formants", which will reliably determine the VTL . The method as simple as finding VTL that minimizes rms deviations of (3 or 4) first "mean" formants from eigenfrequencies of classic open-closed tube (=odd series) is enough to determinate "mean VTL" with enough precision. Of course, the VTL can vary while lip protrusions or so, however, this difference is relatively small.