hi, i am trying to use sphinx2-align to do phone alignment
on a set of audio sentences i've recorded. if i am not
mistaken, the output of the aligner seems to be in units of "frames". so for the following example, the middle SIL
lasts from frames 478 to 511...
my question: is it at all possible to provide output
that is based on msec rather than frames? if so, how?
if not, is there a simple way to convert between the
output provided to the timing in msec? (ie, 256 samples
per frame, etc)?
thanks in advance
--tony
Phone Beg End Acoustic Score
SIL 0 197 -42977523
SIL 198 466 -47576466
L(SIL,IY)b 467 470 -1560823
IY(L,F) 471 473 -812794
F(IY,SIL)e 474 477 -1207767
SIL 478 511 -6958443
M(SIL,EY)b 512 517 -1698059
EY(M,T) 518 520 -1356272
T(EY,SIL)e 521 524 -943199
SIL 525 1211 -149509886
SIL 1212 1363 -3405662
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Each frame # is a certain # of ms. I am not positive, but from observing the results I think it is 1000 samples per frame, so at 16KHz that's 6.25ms per frame. See if that jives with your observations.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, it was giving results at 0.00625 seconds per increment. Actually the frames are about 410 samples padded to 512 at 16kHz, but they are overlapped, giving that time.
However, that was a bug. I just checked in uttproc.c with a fix that makes it the (proper) 0.01 seconds (ten milliseconds) per increment. This makes things faster and more accurate :)
kevin
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2002-03-05
Can I ask a stupid question ? No dont answer that, I'm going to ask it anyway.
I to have a set of voice recordings and have created a transcript file from them. The stupid question is, do I need to force-align these files before I use SphinxTrain ?
I ask this question, because the CI training is throwing up so many errors and would force-alignment help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, you need a model to force-align. Then, according to the doc's, you should iteratively force-align. I'm not sure that force-alignment helps consistently. I've had it get worse -- seems like you just take your chances and cross your fingers.
So, the first time, there is no model to use so you can't force-align...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi, i am trying to use sphinx2-align to do phone alignment
on a set of audio sentences i've recorded. if i am not
mistaken, the output of the aligner seems to be in units of "frames". so for the following example, the middle SIL
lasts from frames 478 to 511...
my question: is it at all possible to provide output
that is based on msec rather than frames? if so, how?
if not, is there a simple way to convert between the
output provided to the timing in msec? (ie, 256 samples
per frame, etc)?
thanks in advance
--tony
Phone Beg End Acoustic Score
SIL 0 197 -42977523
SIL 198 466 -47576466
L(SIL,IY)b 467 470 -1560823
IY(L,F) 471 473 -812794
F(IY,SIL)e 474 477 -1207767
SIL 478 511 -6958443
M(SIL,EY)b 512 517 -1698059
EY(M,T) 518 520 -1356272
T(EY,SIL)e 521 524 -943199
SIL 525 1211 -149509886
SIL 1212 1363 -3405662
Each frame # is a certain # of ms. I am not positive, but from observing the results I think it is 1000 samples per frame, so at 16KHz that's 6.25ms per frame. See if that jives with your observations.
Yes, it was giving results at 0.00625 seconds per increment. Actually the frames are about 410 samples padded to 512 at 16kHz, but they are overlapped, giving that time.
However, that was a bug. I just checked in uttproc.c with a fix that makes it the (proper) 0.01 seconds (ten milliseconds) per increment. This makes things faster and more accurate :)
kevin
Can I ask a stupid question ? No dont answer that, I'm going to ask it anyway.
I to have a set of voice recordings and have created a transcript file from them. The stupid question is, do I need to force-align these files before I use SphinxTrain ?
I ask this question, because the CI training is throwing up so many errors and would force-alignment help.
Well, you need a model to force-align. Then, according to the doc's, you should iteratively force-align. I'm not sure that force-alignment helps consistently. I've had it get worse -- seems like you just take your chances and cross your fingers.
So, the first time, there is no model to use so you can't force-align...