Menu

Conversion from frame number to time stamps of words or phones

Help
Diwakar.G
2017-01-10
2017-01-10
  • Diwakar.G

    Diwakar.G - 2017-01-10

    This is the sphinx3_align results. Now, I am trying to convert frame number to corresponding onset and offset time of words.

      SFrm  EFrm    SegAScr Word
            0   58      -518766 <s>
           59   106    -288027 <sil>
          107   123     -42713 <s>
          124   138    -129898 THE
          139   171    -299927 QUICK
          172   206    -231186 BROWN
          207   263    -346435 FOX
          264   298    -313259 JUMPS
          299   321    -121317 OVER
          322   331    -114349 THE
          332   373    -390465 LAZY
          374   424    -634808 DOG
          425   433    -154447 <sil>
          434   459    -105426 </s>
          460   480    -138397 </s>
     Total score:    -3829420
    

    I am not change any configuration settings. By default these are the parameters in pocket sphinx in feature vector computation.
    Frame rate=100,
    window length=0.0256
    From these I have calculated number of samples/frame=0.025616000=410 samples/frame
    window shift=Fs/ frame rate= 16000/100=160
    For every 160 samples there will be shift in window. For the last word </s> (4.6 s- 4.8 s) onset and offset time. But the total length of audio signal is showing around 6.195 sec. There is a mismatch. Please help me if am wrong. Thank you.

     

    Last edit: Diwakar.G 2017-01-10
  • Diwakar.G

    Diwakar.G - 2017-01-10

    There is some problem with feature extraction with wav files. When I try to do feature extraction with .sph files with nist header the wavfile length and feature vectors are matched but for .wav files with RIFF header the length of wavfile and feature vectors are not matching.

    sitecsp@acl-pg-06:~$ sphinx_cepview -b 82 -e 98 -d 13 -f FC01-Session1-0054.mfc
    Current configuration:
    [NAME]      [DEFLT]     [VALUE]
    -b      0       82
    -d      10      13
    -describe   0       0
    -e      2147483647  98
    -f              FC01-Session1-0054.mfc
    -header     0       0
    -i      13      13
    
    INFO: main_cepview.c(152): Displaying 13 out of 13 columns per frame
    INFO: main_cepview.c(153): Total 401 frames
    
     53.251  -6.944   1.407  -2.457  -6.865  -5.867   9.838   9.280   2.394   5.270  -0.121  -4.562  10.742 
     52.650  -6.407  -1.116  -3.184   4.144  -1.115   3.089  -1.796   1.651   2.165   1.681  -6.425   4.559 
     52.858  -9.507  -4.892 -13.696  -9.678  -4.131   3.844   4.681   6.215   0.275  -7.676   1.772   6.018 
     49.998  -9.793  -0.598 -12.458 -10.120  -5.591   8.318   4.080  -2.615   0.523  -7.819  -2.630   0.533 
     50.406 -12.133  -5.096 -10.152 -13.763  -9.402   3.709   4.309  -5.395  -2.200   1.035   4.833   3.242 
     51.560  -9.216  -4.687  -4.951 -12.551  -6.305  11.447   5.209   2.412   0.401  -2.296   4.876   7.482 
     49.922  -7.210   3.928   2.236  -0.227  -1.214   3.342   4.979   2.664   4.103  -0.035   1.079   1.714 
     49.982  -8.810   2.402  -3.998 -11.644  -5.471   9.500   9.433  -4.328   6.917  -1.290   2.723   7.185 
     48.995 -11.293   5.384  -1.381 -15.051  -4.903  12.733   2.482   1.283  -0.518   2.501   9.957   5.690 
     49.450  -7.544   4.241   0.479 -13.377  -4.857   2.167   2.199  -3.727  -1.452   0.471  -1.558   5.217 
     50.368  -8.751  -3.676  -9.867 -16.020   3.206  13.360   1.570  -4.307   5.506   9.127   2.034   1.171 
     49.394  -5.125   7.343  -8.553  -2.801   5.864   9.743   1.970   0.877   6.508   3.909   5.188   6.046 
     50.854  -9.557  -1.881  -4.577 -17.259  -4.722  13.825  14.667   7.021   4.386  -2.905  -0.568  14.095 
     50.453 -10.525   5.201   3.677 -11.436   0.758  21.489   6.983   0.205  17.617  -6.318   1.005  18.804 
     54.942  -6.155  14.309  21.192  -3.079   9.075  25.065  15.158   7.923  18.296   5.295   5.404  16.285 
     58.618   4.933  21.806  30.604   1.804   7.033  23.187  14.198   5.428   9.530   1.043   3.065  13.867
    

    The wavfile is actually 5.12s instead of getting 512 or 511 frames I am getting only 401 frames. The same problem occured for all wavfiles. Please help me.

     

    Last edit: Diwakar.G 2017-01-10
    • Nickolay V. Shmyrev

      By default feature extractor removes silence. You can add -remove_silence no to sphinx_fe to disable that but large silence in audio is harmful for other reasons.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.