I was wondering if someone who is more knowledgeable in speech processing
could offer some advice or some wisdom regarding the following scenario.
Consider the act of recording a word by saying it once then repeating it.
When saying the first time the frontend processes the audio and outputs a
configurable number of mfcc coefficients for each speech frame. These are
stored in the backend. After that the same word is once again repeated. The
backend uses the dynamic time warping to perform recognition. Because the
two recordings are of the same word I thought that by combining the 2
resulting sequence of mfcc vectors( 1 sequence of mfcc vectors for each
recording ) you can obtain a better model for the recorded word. When i say
combine i am referring to a something like average or linear combination
between the the mfcc vectors. In theory could this lead to a more robust (
or better distinguishable ) model for that command? I've read that in cases
where we have multiple recordings for the same word they are usually both
stored and said to form a class, however I could not find cases where these
vectors are combined in some sort of way. My main reason for this is
because storing them all takes a lot of space in memory and I would like to
find a better alternative.
Thank you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
if you are using DTW for recognition, you can collect several sequences, find out the one, total distance from which to others is minimal and use it as reference. You can also calculate mean and variance of distances between reference and others, thus having an idea what distance threshold you need to set. if you have exactly two instances it's about guessing which one should be used as reference. Hope this helps.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for the response however (if I understand correctly) I am not interested in finding suitable thresholds but rather in obtaining a "better" feature vector out of 2 feature vectors of the same word.
What I was refering to was:
recording of word "one" after the frontend process it will have a[0],a[1],...a[m] vectors of mfcc coefficients
when we repeat the word "one" we will get b[0],b[1],..b[n]. a[0], b[0] and so on is a vector of mfcc coefficients( a[0][0] first mfcc coefficients, a[0][1] second and so on ). What I was asking if there is a way to create a resulting c[0],c[1],...c[k] ; Max(m,n) < k < m + n in which c[p] = alfa * a[i] + beta * b[j], i in [0,m), j in [0,n], p in [0,k) alfa,beta in [0,1). This is done for all coefficients in a[i] and b[j]. I hope that by doing that the resulting c[0]...c[k] will contain "better" features ( more robust, or more distinguishable ). From a theoretical point of view should this "improve" the features or it just simply doesn't make sense? If there is way of doing something along the above how could this be done?
Last edit: Ray 2014-03-19
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
yes, you need to dig into DTW algorithm to find out which a[i] corresponds to which b[j]. But it's hard to define those alfa and betta. You want to have some "average" feature vector. Approach i describe above requires at least three instances to choose the one. Please post here if you'll find out how to deal with two..
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry for not understanding the first time, but I still can't see how to combine the coefficients even if I have 3 instances. Assuming that from the 3 we have already selected a reference, can you please go into more detail how you would combine the coefficients, please?
Thank-you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
@Ray what you are talking about sounds similar to GMM. Instead of 2 files, you have lots of wav files from which you extract features and combine those features that belong to say a particular phone. Mean and standard deviation of the feature values are the properties that are considered for "combination".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
@Li3 thank-you for that presentation, it was really helpful. There was an idea to merge the features vectors of the two recordings (since it is known they represent the same word) along the warped path (DTW is used as a backend) but I wasn't sure if it was correct. That presentation confirmed it. I also found another paper who explained a different way of merging but it required 3 templates. The main idea there was to merge feature vectors for which the transition slope in the DTW fulfilled some conditions. Should I find it will post a link to it.
@Pranav Jawale I am not knowledgeable enough about GMM to say this for sure but I believe we will need lots of sample words in order to extract the statistics. Also we don't use phonemes modeling. If I have misunderstood something or if there is a way to leverage them for a small vocabulary( 10-20 words with only 2 repetitions of the same words ) please provide some guidance in doing so. Thank-you.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I was wondering if someone who is more knowledgeable in speech processing
could offer some advice or some wisdom regarding the following scenario.
Consider the act of recording a word by saying it once then repeating it.
When saying the first time the frontend processes the audio and outputs a
configurable number of mfcc coefficients for each speech frame. These are
stored in the backend. After that the same word is once again repeated. The
backend uses the dynamic time warping to perform recognition. Because the
two recordings are of the same word I thought that by combining the 2
resulting sequence of mfcc vectors( 1 sequence of mfcc vectors for each
recording ) you can obtain a better model for the recorded word. When i say
combine i am referring to a something like average or linear combination
between the the mfcc vectors. In theory could this lead to a more robust (
or better distinguishable ) model for that command? I've read that in cases
where we have multiple recordings for the same word they are usually both
stored and said to form a class, however I could not find cases where these
vectors are combined in some sort of way. My main reason for this is
because storing them all takes a lot of space in memory and I would like to
find a better alternative.
Thank you.
if you are using DTW for recognition, you can collect several sequences, find out the one, total distance from which to others is minimal and use it as reference. You can also calculate mean and variance of distances between reference and others, thus having an idea what distance threshold you need to set. if you have exactly two instances it's about guessing which one should be used as reference. Hope this helps.
Thank you for the response however (if I understand correctly) I am not interested in finding suitable thresholds but rather in obtaining a "better" feature vector out of 2 feature vectors of the same word.
What I was refering to was:
recording of word "one" after the frontend process it will have a[0],a[1],...a[m] vectors of mfcc coefficients
when we repeat the word "one" we will get b[0],b[1],..b[n]. a[0], b[0] and so on is a vector of mfcc coefficients( a[0][0] first mfcc coefficients, a[0][1] second and so on ). What I was asking if there is a way to create a resulting c[0],c[1],...c[k] ; Max(m,n) < k < m + n in which c[p] = alfa * a[i] + beta * b[j], i in [0,m), j in [0,n], p in [0,k) alfa,beta in [0,1). This is done for all coefficients in a[i] and b[j]. I hope that by doing that the resulting c[0]...c[k] will contain "better" features ( more robust, or more distinguishable ). From a theoretical point of view should this "improve" the features or it just simply doesn't make sense? If there is way of doing something along the above how could this be done?
Last edit: Ray 2014-03-19
yes, you need to dig into DTW algorithm to find out which a[i] corresponds to which b[j]. But it's hard to define those alfa and betta. You want to have some "average" feature vector. Approach i describe above requires at least three instances to choose the one. Please post here if you'll find out how to deal with two..
Sorry for not understanding the first time, but I still can't see how to combine the coefficients even if I have 3 instances. Assuming that from the 3 we have already selected a reference, can you please go into more detail how you would combine the coefficients, please?
Thank-you.
No, not combine. Just select the most "average" feature vector among those three and than match against it
Ray,
Check this out. The DTW section of this presentation should have the answer you are
looking for.
http://www.cs.cmu.edu/~bhiksha/courses/yahoo2009/01-01.speechrecfordummies.pdf
@Ray what you are talking about sounds similar to GMM. Instead of 2 files, you have lots of wav files from which you extract features and combine those features that belong to say a particular phone. Mean and standard deviation of the feature values are the properties that are considered for "combination".
@Li3 thank-you for that presentation, it was really helpful. There was an idea to merge the features vectors of the two recordings (since it is known they represent the same word) along the warped path (DTW is used as a backend) but I wasn't sure if it was correct. That presentation confirmed it. I also found another paper who explained a different way of merging but it required 3 templates. The main idea there was to merge feature vectors for which the transition slope in the DTW fulfilled some conditions. Should I find it will post a link to it.
@Pranav Jawale I am not knowledgeable enough about GMM to say this for sure but I believe we will need lots of sample words in order to extract the statistics. Also we don't use phonemes modeling. If I have misunderstood something or if there is a way to leverage them for a small vocabulary( 10-20 words with only 2 repetitions of the same words ) please provide some guidance in doing so. Thank-you.