CMU Sphinx / Forums / Speech Recognition Theory: alignment

Anonymous - 2000-12-19

hi, i am trying to use sphinx2-align to do phone alignment
on a set of audio sentences i've recorded. if i am not
mistaken, the output of the aligner seems to be in units of "frames". so for the following example, the middle SIL
lasts from frames 478 to 511...

my question: is it at all possible to provide output
that is based on msec rather than frames? if so, how?
if not, is there a simple way to convert between the
output provided to the timing in msec? (ie, 256 samples
per frame, etc)?

thanks in advance
--tony

           Phone Beg End Acoustic Score
                 SIL    0 197    -42977523
                 SIL 198 466    -47576466
          L(SIL,IY)b 467 470     -1560823
             IY(L,F) 471 473      -812794
          F(IY,SIL)e 474 477     -1207767
                 SIL 478 511     -6958443
          M(SIL,EY)b 512 517     -1698059
             EY(M,T) 518 520     -1356272
          T(EY,SIL)e 521 524      -943199
                 SIL 525 1211   -149509886
                 SIL 1212 1363    -3405662

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Eric Herrmann - 2000-12-21
  
  Each frame # is a certain # of ms. I am not positive, but from observing the results I think it is 1000 samples per frame, so at 16KHz that's 6.25ms per frame. See if that jives with your observations.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Kevin A. Lenzo - 2000-12-21
    
    Yes, it was giving results at 0.00625 seconds per increment. Actually the frames are about 410 samples padded to 512 at 16kHz, but they are overlapped, giving that time.
    
    However, that was a bug. I just checked in uttproc.c with a fix that makes it the (proper) 0.01 seconds (ten milliseconds) per increment. This makes things faster and more accurate :)
    
    kevin
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Anonymous - 2002-03-05
  
  Can I ask a stupid question ? No dont answer that, I'm going to ask it anyway.
  
  I to have a set of voice recordings and have created a transcript file from them. The stupid question is, do I need to force-align these files before I use SphinxTrain ?
  
  I ask this question, because the CI training is throwing up so many errors and would force-alignment help.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Joe Beauchamp - 2002-06-13
  
  Well, you need a model to force-align. Then, according to the doc's, you should iteratively force-align. I'm not sure that force-alignment helps consistently. I've had it get worse -- seems like you just take your chances and cross your fingers.
  
  So, the first time, there is no model to use so you can't force-align...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

alignment

Speech Recognition Toolkit

Forums

Help

alignment

alignment

Speech Recognition Toolkit

Forums

Help

alignment document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

alignment