is there a maximum size for my audio input file which I want to have a transcript of?
sphinx3_livepretend ends with the message
lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.
when I use bigger input files (>150 MB), so I wonder if this could be the reason for the error.
Any ideas?
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, you probably want to use sphinx3_continuous instead - it will segment your input into individual utterances and do recognition on these independently.
There is an actual hard-coded limit, and I'm a bit surprised that sphinx3 isn't complaining about it before it reaches that assertion. I believe the limit is 150 seconds, which is well beyond the amount of speech any human can produce without pausing to breathe :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the hint!
I assume I have to execute sphinx3_continuous using the following parameters?
ctrlfile - file with list of input files (batch)
rawdir - directory where file in above list are to be found.
cfgfile - file with config params.
Unfortunately, when I do so, the system stops without starting the recognition. My last output is:
[...]
INFO: Operation Mode = 4, Operation Name = fwdtree
INFO:
INFO: s3_decode.c(267): Input data will NOT be byte swapped
INFO: s3_decode.c(272): Partial hypothesis WILL be dumped
ERROR: "cont_ad_base.c", line 718: cont_ad_read requires buffer of at least 83845 samples
INFO: corpus.c(647): 05-11-07: -0.0 sec CPU, 0.0 sec Clk; TOT: -0.0 sec CPU, 0.0 sec Clk
INFO: stat.c(223): SUMMARY: 0 fr , No report
Any ideas what could be the problem?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, it's running now without any error message. But what does it exactly do now, does it automatically segment my file into smaller uterances or does it just stop once it reaches the maximum size of an utterance and ignores the rest?
Right now, my recognition is too bad so I can't judge from the output... Just looking at the number of recognised words, I doubt that it goes through the whole audio file!?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For me it automatically segments the file by pauses in speech and decode each chunk. If it's not, please provide audio file and model parameters you are using
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just noticed that I was actually wrong. The system now automatically divides my input into the lattice files
30-10-07_0.256.lat.gz
30-10-07_10.912.lat.gz
30-10-07_195.984.lat.gz
30-10-07_290.064.lat.gz
30-10-07_31.984.lat.gz
30-10-07_78.000.lat.gz
but still still crashes with
I don't know what I am doing wrong :( Do you mind trying with my settings and my audio file (http://tinyurl.com/39vsfv right mouse click and download, approx. 60 MB) on your system?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Oh, your file has a lot of music and noise. Decoding such files will be very-very challenging task. Probably someone else will suggest something but I need a time to think :)
About splitting. Usually clean speech has pauses with very significant energy drops. By such silence regions sphinx3_continuous splits file on utterances and decodes each one. If your file has music chunks will be too large for a single decoder pass thus you probably might try to segment speech manually and pass small chunks to the decoder. For example try adcin tool from julius speech recognizer, probably it will split speech on chunks better.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
first of all, thanks a lot for your permanent feedback!!
Now, to my question ;-)
I decided to make a simple test run and divide the video in a "dump" way every 30 seconds. Do I therefore have to divide my wav file physically or can I just enter in my control file where it should be divided?
However, if I do so, StartFrame, EndFrame and UttID are completely ignored and it tool takes the whole audio file as input (and divides it then based on silence).
What am I doing wrong there now?
Thanks!!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
is there a maximum size for my audio input file which I want to have a transcript of?
sphinx3_livepretend ends with the message
lt-sphinx3_livepretend: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.
when I use bigger input files (>150 MB), so I wonder if this could be the reason for the error.
Any ideas?
Thanks
Hi, you probably want to use sphinx3_continuous instead - it will segment your input into individual utterances and do recognition on these independently.
There is an actual hard-coded limit, and I'm a bit surprised that sphinx3 isn't complaining about it before it reaches that assertion. I believe the limit is 150 seconds, which is well beyond the amount of speech any human can produce without pausing to breathe :-)
Thanks for the hint!
I assume I have to execute sphinx3_continuous using the following parameters?
ctrlfile - file with list of input files (batch)
rawdir - directory where file in above list are to be found.
cfgfile - file with config params.
Unfortunately, when I do so, the system stops without starting the recognition. My last output is:
[...]
INFO: Operation Mode = 4, Operation Name = fwdtree
INFO:
INFO: s3_decode.c(267): Input data will NOT be byte swapped
INFO: s3_decode.c(272): Partial hypothesis WILL be dumped
ERROR: "cont_ad_base.c", line 718: cont_ad_read requires buffer of at least 83845 samples
INFO: corpus.c(647): 05-11-07: -0.0 sec CPU, 0.0 sec Clk; TOT: -0.0 sec CPU, 0.0 sec Clk
INFO: stat.c(223): SUMMARY: 0 fr , No report
Any ideas what could be the problem?
Hm, reproducable for me, it really looks like a bug at least it worked before.
I've just committed a fix to sphinxbase, please update from svn and try again, now everything should be fine.
Thanks, it's running now without any error message. But what does it exactly do now, does it automatically segment my file into smaller uterances or does it just stop once it reaches the maximum size of an utterance and ignores the rest?
Right now, my recognition is too bad so I can't judge from the output... Just looking at the number of recognised words, I doubt that it goes through the whole audio file!?
For me it automatically segments the file by pauses in speech and decode each chunk. If it's not, please provide audio file and model parameters you are using
Just noticed that I was actually wrong. The system now automatically divides my input into the lattice files
30-10-07_0.256.lat.gz
30-10-07_10.912.lat.gz
30-10-07_195.984.lat.gz
30-10-07_290.064.lat.gz
30-10-07_31.984.lat.gz
30-10-07_78.000.lat.gz
but still still crashes with
..lt-sphinx3_continuous: lextree.c:1262: lextree_hmm_eval: Assertion `((hmm_t *)(ln))->frame == frm' failed.
(Find my output file including my settings here: http://us.share.geocities.com/ww.ranger/30-10-07.txt)
I don't know what I am doing wrong :( Do you mind trying with my settings and my audio file (http://tinyurl.com/39vsfv right mouse click and download, approx. 60 MB) on your system?
Mh, the output file link is broken now, try this one please http://tinyurl.com/2naqxo
Oh, your file has a lot of music and noise. Decoding such files will be very-very challenging task. Probably someone else will suggest something but I need a time to think :)
About splitting. Usually clean speech has pauses with very significant energy drops. By such silence regions sphinx3_continuous splits file on utterances and decodes each one. If your file has music chunks will be too large for a single decoder pass thus you probably might try to segment speech manually and pass small chunks to the decoder. For example try adcin tool from julius speech recognizer, probably it will split speech on chunks better.
Hi guys,
first of all, thanks a lot for your permanent feedback!!
Now, to my question ;-)
I decided to make a simple test run and divide the video in a "dump" way every 30 seconds. Do I therefore have to divide my wav file physically or can I just enter in my control file where it should be divided?
On http://cmusphinx.sourceforge.net/sphinx3/doc/s3_description.html#sec_ctl , it says I can use the format
AudioFile [ StartFrame EndFrame UttID ]
However, if I do so, StartFrame, EndFrame and UttID are completely ignored and it tool takes the whole audio file as input (and divides it then based on silence).
What am I doing wrong there now?
Thanks!!
Ahh. Yes, you need something more sophisticated than the simple speech/silence based endpointer used by sphinx3_continuous.
LIUM in France has contributed a segmenter, which they have used successfully in French broadcast news transcription, which you can find at:
http://www-lium.univ-lemans.fr/tools/index.php?option=com_content&task=blogcategory&id=29&Itemid=56
However I haven't actually tried to use it on anything yet so I can't answer any questions about it at the moment.