CMU Sphinx / Forums / Help: Maximum size for raw audio format?

chotty - 2007-11-29

Hi all,

is there a maximum size for my audio input file which I want to have a transcript of?
sphinx3_livepretend ends with the message
lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.
when I use bigger input files (>150 MB), so I wonder if this could be the reason for the error.

Any ideas?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2007-11-29
  
  Hi, you probably want to use sphinx3_continuous instead - it will segment your input into individual utterances and do recognition on these independently.
  
  There is an actual hard-coded limit, and I'm a bit surprised that sphinx3 isn't complaining about it before it reaches that assertion. I believe the limit is 150 seconds, which is well beyond the amount of speech any human can produce without pausing to breathe :-)
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- chotty - 2007-11-30
  
  Thanks for the hint!
  I assume I have to execute sphinx3_continuous using the following parameters?
  
  ctrlfile - file with list of input files (batch)
  rawdir - directory where file in above list are to be found.
  cfgfile - file with config params.
  
  Unfortunately, when I do so, the system stops without starting the recognition. My last output is:
  
  [...]
  INFO: Operation Mode = 4, Operation Name = fwdtree
  INFO:
  INFO: s3_decode.c(267): Input data will NOT be byte swapped
  INFO: s3_decode.c(272): Partial hypothesis WILL be dumped
  ERROR: "cont_ad_base.c", line 718: cont_ad_read requires buffer of at least 83845 samples
  INFO: corpus.c(647): 05-11-07: -0.0 sec CPU, 0.0 sec Clk; TOT: -0.0 sec CPU, 0.0 sec Clk
  
  INFO: stat.c(223): SUMMARY: 0 fr , No report
  
  Any ideas what could be the problem?
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2007-12-01
    
    Hm, reproducable for me, it really looks like a bug at least it worked before.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2007-12-01
      
      I've just committed a fix to sphinxbase, please update from svn and try again, now everything should be fine.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - chotty - 2007-12-03
        
        Thanks, it's running now without any error message. But what does it exactly do now, does it automatically segment my file into smaller uterances or does it just stop once it reaches the maximum size of an utterance and ignores the rest?
        Right now, my recognition is too bad so I can't judge from the output... Just looking at the number of recognised words, I doubt that it goes through the whole audio file!?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2007-12-03
        
        For me it automatically segments the file by pauses in speech and decode each chunk. If it's not, please provide audio file and model parameters you are using
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        chotty - 2007-12-04
        
        Just noticed that I was actually wrong. The system now automatically divides my input into the lattice files
        30-10-07_0.256.lat.gz
        30-10-07_10.912.lat.gz
        30-10-07_195.984.lat.gz
        30-10-07_290.064.lat.gz
        30-10-07_31.984.lat.gz
        30-10-07_78.000.lat.gz
        but still still crashes with
        
        ..lt-sphinx3_continuous: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.
        
        (Find my output file including my settings here: http://us.share.geocities.com/ww.ranger/30-10-07.txt)
        
        I don't know what I am doing wrong :( Do you mind trying with my settings and my audio file (http://tinyurl.com/39vsfv right mouse click and download, approx. 60 MB) on your system?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        chotty - 2007-12-04
        
        Mh, the output file link is broken now, try this one please http://tinyurl.com/2naqxo
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2007-12-04
        
        Oh, your file has a lot of music and noise. Decoding such files will be very-very challenging task. Probably someone else will suggest something but I need a time to think :)
        
        About splitting. Usually clean speech has pauses with very significant energy drops. By such silence regions sphinx3_continuous splits file on utterances and decodes each one. If your file has music chunks will be too large for a single decoder pass thus you probably might try to segment speech manually and pass small chunks to the decoder. For example try adcin tool from julius speech recognizer, probably it will split speech on chunks better.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        chotty - 2007-12-06
        
        Hi guys,
        
        first of all, thanks a lot for your permanent feedback!!
        
        Now, to my question ;-)
        I decided to make a simple test run and divide the video in a "dump" way every 30 seconds. Do I therefore have to divide my wav file physically or can I just enter in my control file where it should be divided?
        
        On http://cmusphinx.sourceforge.net/sphinx3/doc/s3_description.html#sec_ctl , it says I can use the format
        
        AudioFile [ StartFrame EndFrame UttID ]
        
        However, if I do so, StartFrame, EndFrame and UttID are completely ignored and it tool takes the whole audio file as input (and divides it then based on silence).
        
        What am I doing wrong there now?
        
        Thanks!!
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        David Huggins-Daines - 2007-12-04
        
        Ahh. Yes, you need something more sophisticated than the simple speech/silence based endpointer used by sphinx3_continuous.
        
        LIUM in France has contributed a segmenter, which they have used successfully in French broadcast news transcription, which you can find at:
        http://www-lium.univ-lemans.fr/tools/index.php?option=com_content&task=blogcategory&id=29&Itemid=56
        
        However I haven't actually tried to use it on anything yet so I can't answer any questions about it at the moment.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Maximum size for raw audio format?

Speech Recognition Toolkit

Forums

Help

Maximum size for raw audio format? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Maximum size for raw audio format?