Menu

Vocal tract length normalization

Anonymous
2011-03-09
2012-09-22
  • Anonymous

    Anonymous - 2011-03-09

    Hi,
    I am using MFCC as features for discrete HMM speech recognition system. Does
    anyone have any idea how do implement some simple vocal tract length
    normalization.
    I was looking for some good explanation of this approach but haven't found
    anything.
    Thanks

     
  • Nickolay V. Shmyrev

    Hello

    VTLN is implemented in CMUSphinx are you looking at the particular place in
    the code like fe_warp_affine.c in sphinxbase or in the description of the
    algorithm like

    http://www.cs.cmu.edu/%7Eegouvea/paper/thesis.pdf

     
  • Anonymous

    Anonymous - 2011-03-10

    Hello

    VTLN is implemented in CMUSphinx are you looking at the particular place in
    the code like fe_warp_affine.c in sphinxbase or in the description of the
    algorithm like

    http://www.cs.cmu.edu/%7Eegouvea/paper/thesis.pdf

    Thank you very much for your reply. I think both code and description could be
    helpful for me. I will check them both.
    Peter

     
  • Anonymous

    Anonymous - 2011-03-10

    I have one aditional question:
    Is there any simpler way to make my speech recognition system less speaker
    dependent?

     
  • Nickolay V. Shmyrev

    Is there any simpler way to make my speech recognition system less speaker
    dependent?

    Sorry, I'm not sure what do you mean by "less speaker dependent".

     
  • Anonymous

    Anonymous - 2011-03-22

    Sorry, I'm not sure what do you mean by "less speaker dependent".

    I mean, what approach should I use to get similar results (success rate) for
    wider range of speakers - to make system speaker independent?

    I have already implemented the vocal tract normalization using the bilinear
    transformation of frequency axis. I would like to ask you one more question to
    be sure if I got it right: How could I discover the warping factor (a)? I have
    read that its value should be between 0.88 to 1.12. So far I record a speaker
    saying known word and than I run trough all possible warping factors and take
    the one for which the probability is biggest. For example - I say word
    "internet" and the probabilty with a=0.95 for model "internet" is the biggest
    - so I take the warping factor 0.95 for this particular speaker.
    Is this approach right?

    Thank you for your replies.
    Peter

     
  • Rmkf

    Rmkf - 2011-06-03

    Interesting, but I prefer other method:
    if we'll average log area over the voiced frames (at least, but it can be
    proven that over all training data) , then resulting "mean log area" will have
    next features:
    1. Clearly visible negative slope at the (mean) place of glottis.
    2. Synthesized from that "mean log area" sound will have 1st 4 (at least in bad source quality 3) "mean formants", which will reliably determine the VTL . The method as simple as finding VTL that minimizes rms deviations of (3 or 4) first "mean" formants from eigenfrequencies of classic open-closed tube (=odd series) is enough to determinate "mean VTL" with enough precision. Of course, the VTL can vary while lip protrusions or so, however, this difference is relatively small.

     

Log in to post a comment.