Menu

Interpretation of combining mfcc coefficients

Ray
2014-03-18
2014-05-13
  • Ray

    Ray - 2014-03-18

    Hello,

    I was wondering if someone who is more knowledgeable in speech processing
    could offer some advice or some wisdom regarding the following scenario.
    Consider the act of recording a word by saying it once then repeating it.
    When saying the first time the frontend processes the audio and outputs a
    configurable number of mfcc coefficients for each speech frame. These are
    stored in the backend. After that the same word is once again repeated. The
    backend uses the dynamic time warping to perform recognition. Because the
    two recordings are of the same word I thought that by combining the 2
    resulting sequence of mfcc vectors( 1 sequence of mfcc vectors for each
    recording ) you can obtain a better model for the recorded word. When i say
    combine i am referring to a something like average or linear combination
    between the the mfcc vectors. In theory could this lead to a more robust (
    or better distinguishable ) model for that command? I've read that in cases
    where we have multiple recordings for the same word they are usually both
    stored and said to form a class, however I could not find cases where these
    vectors are combined in some sort of way. My main reason for this is
    because storing them all takes a lot of space in memory and I would like to
    find a better alternative.

    Thank you.

     
  • bic-user

    bic-user - 2014-03-18

    if you are using DTW for recognition, you can collect several sequences, find out the one, total distance from which to others is minimal and use it as reference. You can also calculate mean and variance of distances between reference and others, thus having an idea what distance threshold you need to set. if you have exactly two instances it's about guessing which one should be used as reference. Hope this helps.

     
  • Ray

    Ray - 2014-03-19

    Thank you for the response however (if I understand correctly) I am not interested in finding suitable thresholds but rather in obtaining a "better" feature vector out of 2 feature vectors of the same word.
    What I was refering to was:
    recording of word "one" after the frontend process it will have a[0],a[1],...a[m] vectors of mfcc coefficients
    when we repeat the word "one" we will get b[0],b[1],..b[n]. a[0], b[0] and so on is a vector of mfcc coefficients( a[0][0] first mfcc coefficients, a[0][1] second and so on ). What I was asking if there is a way to create a resulting c[0],c[1],...c[k] ; Max(m,n) < k < m + n in which c[p] = alfa * a[i] + beta * b[j], i in [0,m), j in [0,n], p in [0,k) alfa,beta in [0,1). This is done for all coefficients in a[i] and b[j]. I hope that by doing that the resulting c[0]...c[k] will contain "better" features ( more robust, or more distinguishable ). From a theoretical point of view should this "improve" the features or it just simply doesn't make sense? If there is way of doing something along the above how could this be done?

     

    Last edit: Ray 2014-03-19
  • bic-user

    bic-user - 2014-03-19

    yes, you need to dig into DTW algorithm to find out which a[i] corresponds to which b[j]. But it's hard to define those alfa and betta. You want to have some "average" feature vector. Approach i describe above requires at least three instances to choose the one. Please post here if you'll find out how to deal with two..

     
  • Ray

    Ray - 2014-03-19

    Sorry for not understanding the first time, but I still can't see how to combine the coefficients even if I have 3 instances. Assuming that from the 3 we have already selected a reference, can you please go into more detail how you would combine the coefficients, please?
    Thank-you.

     
  • bic-user

    bic-user - 2014-03-19

    No, not combine. Just select the most "average" feature vector among those three and than match against it

     
  • Pranav Jawale

    Pranav Jawale - 2014-04-01

    @Ray what you are talking about sounds similar to GMM. Instead of 2 files, you have lots of wav files from which you extract features and combine those features that belong to say a particular phone. Mean and standard deviation of the feature values are the properties that are considered for "combination".

     
  • Ray

    Ray - 2014-05-13

    @Li3 thank-you for that presentation, it was really helpful. There was an idea to merge the features vectors of the two recordings (since it is known they represent the same word) along the warped path (DTW is used as a backend) but I wasn't sure if it was correct. That presentation confirmed it. I also found another paper who explained a different way of merging but it required 3 templates. The main idea there was to merge feature vectors for which the transition slope in the DTW fulfilled some conditions. Should I find it will post a link to it.
    @Pranav Jawale I am not knowledgeable enough about GMM to say this for sure but I believe we will need lots of sample words in order to extract the statistics. Also we don't use phonemes modeling. If I have misunderstood something or if there is a way to leverage them for a small vocabulary( 10-20 words with only 2 repetitions of the same words ) please provide some guidance in doing so. Thank-you.

     

Log in to post a comment.