CMU Sphinx / Forums / Help: MFCC extraction using sphinx

Balaji - 2020-09-16

Hello ,

I extracted MFCC features of a Wav files using Sphinx 4, converted and viewed in a text format.
To learn the extraction method, I created a python script which does: framing, windowing, FFT, etc., as per MFCC tutorial (http://practicalcryptography.com/miscellaneous/machine-learning/guide-mel-frequency-cepstral-coefficients-mfccs/ and https://www.kaggle.com/ilyamich/mfcc-implementation-and-tutorial)

Following are the parameters I am using: 10 ms frames and 25.6ms is the window length.
Now, my question is about the number of feature vectors generated:

My first .wav file is of length = 1.9934375s = 1993.4375 ms - generated 104 feature vectors (13 coeffs in each row)

My second .wav file is of length = 2.12125s = 2121.25 ms - generated 124 feature vectors (13 coeffs in each row)

My doubt is if there are 100 frames per second and if each frame is converted into one 13-element feature vector, then in one second, there must be 100 vectors (each comprising 13 coeffs). But, the above examples are not aligning with this.
Can you explain the arithmetic please. I am not sure if I missed to configure any parameter correctly.

Thank you.

Balaji.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2020-09-16
  
  There is voice activity detection which removes frames, you can add -remove_silence no to see remaining ones.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Balaji - 2020-09-17

Thank you. I got it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Balaji - 2020-09-17

How is the identification of silent frames done? Is there any threshold set ?
I inspected the frames but, could not make out from the values.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Balaji - 2020-11-30

Hello,

I am performing MFCC extraction using sphinx_fe. I am separately performing each step of the feature extraction procedure and comparing the results with the output of sphinx_fe.

I have questions on these parameters for feature extraction:
1. frate: default is 100. This means the hop_length is 10 milliseconds, so the frames are generated at 0, 10, 20,... 990 th milliseconds, right? With -samprate = 16000, the hop_length is 160samples. Is this correct?
2. In each frame, what is number of samples? - The parameter wlen = 0.025625. I interpret this as framesize=0.025625 seconds. That is, 25.625milliseconds = 410 samples (with 16KHz sampling rate). Is this correct?
Or, is it nfft=512 parameter that defines framesize as 512 samples.
3. lowerf =133.33334 and upperf=6855.4976. Why these values? Can we set it to 0 to 8000? (Nyquist freq for 16000 sampling rate)

Thanks for your help.

Last edit: Balaji 2020-11-30

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2020-12-13

frate: default is 100. This means the hop_length is 10 milliseconds, so the frames are generated at 0, 10, 20,... 990 th milliseconds, right? With -samprate = 16000, the hop_length is 160samples. Is this correct?

Yes

In each frame, what is number of samples? - The parameter wlen = 0.025625. I interpret this as framesize=0.025625 seconds. That is, 25.625milliseconds = 410 samples (with 16KHz sampling rate). Is this correct?

Yes

Or, is it nfft=512 parameter that defines framesize as 512 samples.

no

lowerf =133.33334 and upperf=6855.4976. Why these values? Can we set it to 0 to 8000? (Nyquist freq for 16000 sampling rate)

Most of the training audio doesn't include very high or very low frequency anyway, so it would be useless noise for recognition and training. Values show good results in experiments, though in modern ASR it is usually from 20 to 7600.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Balaji - 2020-12-14

Thank you very much Nickolay.
Any reason why these odd sizes (framesize: 410 samples and hop_length: 160 samples)?

The examples I ran through had 2 power samples only. Like: framesize: 1024 and hop_length: 512, etc.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

MFCC extraction using sphinx_fe.exe

Speech Recognition Toolkit

Forums

Help

MFCC extraction using sphinx_fe.exe

MFCC extraction using sphinx_fe.exe

Speech Recognition Toolkit

Forums

Help

MFCC extraction using sphinx_fe.exe document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

MFCC extraction using sphinx_fe.exe