Vocal tract length normalization

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Vocal tract length normalization

Forum: Speech Recognition Theory

Creator: Anonymous

Created: 2011-03-09

Updated: 2012-09-22

Anonymous - 2011-03-09

Hi,
I am using MFCC as features for discrete HMM speech recognition system. Does
anyone have any idea how do implement some simple vocal tract length
normalization.
I was looking for some good explanation of this approach but haven't found
anything.
Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-09

Hello

VTLN is implemented in CMUSphinx are you looking at the particular place in
the code like fe_warp_affine.c in sphinxbase or in the description of the
algorithm like

http://www.cs.cmu.edu/%7Eegouvea/paper/thesis.pdf

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2011-03-10

Hello

VTLN is implemented in CMUSphinx are you looking at the particular place in
the code like fe_warp_affine.c in sphinxbase or in the description of the
algorithm like

http://www.cs.cmu.edu/%7Eegouvea/paper/thesis.pdf

Thank you very much for your reply. I think both code and description could be
helpful for me. I will check them both.
Peter

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2011-03-10

I have one aditional question:
Is there any simpler way to make my speech recognition system less speaker
dependent?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-14

Is there any simpler way to make my speech recognition system less speaker
dependent?

Sorry, I'm not sure what do you mean by "less speaker dependent".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Anonymous - 2011-03-22

Sorry, I'm not sure what do you mean by "less speaker dependent".

I mean, what approach should I use to get similar results (success rate) for
wider range of speakers - to make system speaker independent?

I have already implemented the vocal tract normalization using the bilinear
transformation of frequency axis. I would like to ask you one more question to
be sure if I got it right: How could I discover the warping factor (a)? I have
read that its value should be between 0.88 to 1.12. So far I record a speaker
saying known word and than I run trough all possible warping factors and take
the one for which the probability is biggest. For example - I say word
"internet" and the probabilty with a=0.95 for model "internet" is the biggest
- so I take the warping factor 0.95 for this particular speaker.
Is this approach right?

Thank you for your replies.
Peter

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2011-03-22

You can try search google:

http://www.google.ru/search?q=vtln+factor+estimation

For example

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.26.6249

Decoding is quite slow method for VTLN estimation. I would better use fast
GMM-classifier based classifier.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rmkf - 2011-06-03

Interesting, but I prefer other method:
if we'll average log area over the voiced frames (at least, but it can be
proven that over all training data) , then resulting "mean log area" will have
next features:
1. Clearly visible negative slope at the (mean) place of glottis.
2. Synthesized from that "mean log area" sound will have 1st 4 (at least in bad source quality 3) "mean formants", which will reliably determine the VTL . The method as simple as finding VTL that minimizes rms deviations of (3 or 4) first "mean" formants from eigenfrequencies of classic open-closed tube (=odd series) is enough to determinate "mean VTL" with enough precision. Of course, the VTL can vary while lip protrusions or so, however, this difference is relatively small.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.