Hi all,
I have a general question about the calculation of MFCCs.
Therefore, I like to summarize my understanding and like you to judge if it is
correct or not.
For example audio samples are captured at 16kHz.
Entire captured samples are pre-emphased based on a simple FIR to amplify high frequency parts of the signal.
Audio stream is moved into buffers whereby, the first 160 samples (10ms) of each buffer are the last 160 samples of the previous buffer.
Hamming window is performed on buffer
Samples are transformed to frequency domain based on 512 point FFT. Due to symmetrical result, only 256 samples are needed for further processing. DC-offset can be discarded.
Square of the absolute values is calculated of the FFT output.
Mel-Filtering: E.g. 40 Mel filters are available. Each filter represents are band-pass for specific bandwidth and position. Because we are in frequency domain, filtering is done via multiplication. This leads to just one result for every filter. In total 40 values are calculated based on Mel-Filter-bank.
Mel-filter result is compressed by natural logarithm.
Values are transformed via discrete cosine transform, but only first 13 samples of the output are used for further steps.
1st and 2nd derivatives are calculated based on previous and “future” coefficients.
MFCC consists on 39 values in total, whereby the last 26 values are the results of the derivatives.
This (MFCC) coefficient represents a compressed and transformed version of the
480 audio samples and it makes no sense to play it via digital to analogue
converter.
Is this correct?
Thank you for your effort.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This (MFCC) coefficient represents a compressed and transformed version of
the 480 audio samples and it makes no sense > to play it via digital to
analogue converter.
You are correct about process. As for playing the cepstrum, you can convert it
back to audio with MLSA filter, but the quality degrades.
Hi all,
I have a general question about the calculation of MFCCs.
Therefore, I like to summarize my understanding and like you to judge if it is
correct or not.
For example audio samples are captured at 16kHz.
Entire captured samples are pre-emphased based on a simple FIR to amplify high frequency parts of the signal.
Audio stream is moved into buffers whereby, the first 160 samples (10ms) of each buffer are the last 160 samples of the previous buffer.
Hamming window is performed on buffer
Samples are transformed to frequency domain based on 512 point FFT. Due to symmetrical result, only 256 samples are needed for further processing. DC-offset can be discarded.
Square of the absolute values is calculated of the FFT output.
Mel-Filtering: E.g. 40 Mel filters are available. Each filter represents are band-pass for specific bandwidth and position. Because we are in frequency domain, filtering is done via multiplication. This leads to just one result for every filter. In total 40 values are calculated based on Mel-Filter-bank.
Mel-filter result is compressed by natural logarithm.
Values are transformed via discrete cosine transform, but only first 13 samples of the output are used for further steps.
1st and 2nd derivatives are calculated based on previous and “future” coefficients.
MFCC consists on 39 values in total, whereby the last 26 values are the results of the derivatives.
This (MFCC) coefficient represents a compressed and transformed version of the
480 audio samples and it makes no sense to play it via digital to analogue
converter.
Is this correct?
Thank you for your effort.
You are correct about process. As for playing the cepstrum, you can convert it
back to audio with MLSA filter, but the quality degrades.
http://onlinelibrary.wiley.com/doi/10.1002/ecja.4400660203/abstract
Thank you very much for the reply and the interesting link!