This is the sphinx3_align results. Now, I am trying to convert frame number to corresponding onset and offset time of words.
SFrm EFrm SegAScr Word
0 58 -518766 <s>
59 106 -288027 <sil>
107 123 -42713 <s>
124 138 -129898 THE
139 171 -299927 QUICK
172 206 -231186 BROWN
207 263 -346435 FOX
264 298 -313259 JUMPS
299 321 -121317 OVER
322 331 -114349 THE
332 373 -390465 LAZY
374 424 -634808 DOG
425 433 -154447 <sil>
434 459 -105426 </s>
460 480 -138397 </s>
Total score: -3829420
I am not change any configuration settings. By default these are the parameters in pocket sphinx in feature vector computation.
Frame rate=100,
window length=0.0256
From these I have calculated number of samples/frame=0.025616000=410 samples/frame
window shift=Fs/ frame rate= 16000/100=160
For every 160 samples there will be shift in window. For the last word </s> (4.6 s- 4.8 s) onset and offset time. But the total length of audio signal is showing around 6.195 sec. There is a mismatch. Please help me if am wrong. Thank you.
Last edit: Diwakar.G 2017-01-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
There is some problem with feature extraction with wav files. When I try to do feature extraction with .sph files with nist header the wavfile length and feature vectors are matched but for .wav files with RIFF header the length of wavfile and feature vectors are not matching.
The wavfile is actually 5.12s instead of getting 512 or 511 frames I am getting only 401 frames. The same problem occured for all wavfiles. Please help me.
Last edit: Diwakar.G 2017-01-10
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
By default feature extractor removes silence. You can add -remove_silence no to sphinx_fe to disable that but large silence in audio is harmful for other reasons.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is the sphinx3_align results. Now, I am trying to convert frame number to corresponding onset and offset time of words.
I am not change any configuration settings. By default these are the parameters in pocket sphinx in feature vector computation.
Frame rate=100,
window length=0.0256
From these I have calculated number of samples/frame=0.025616000=410 samples/frame
window shift=Fs/ frame rate= 16000/100=160
For every 160 samples there will be shift in window. For the last word
</s>
(4.6 s- 4.8 s) onset and offset time. But the total length of audio signal is showing around 6.195 sec. There is a mismatch. Please help me if am wrong. Thank you.Last edit: Diwakar.G 2017-01-10
There is some problem with feature extraction with wav files. When I try to do feature extraction with .sph files with nist header the wavfile length and feature vectors are matched but for .wav files with RIFF header the length of wavfile and feature vectors are not matching.
The wavfile is actually 5.12s instead of getting 512 or 511 frames I am getting only 401 frames. The same problem occured for all wavfiles. Please help me.
Last edit: Diwakar.G 2017-01-10
By default feature extractor removes silence. You can add
-remove_silence no
to sphinx_fe to disable that but large silence in audio is harmful for other reasons.