Hi,
I have some long utterances for training (some more than an hour). I also have some alignment information that I could use if needed. I would like to use to create an acoustic model with SphinxTrain. Do I use force alignment? Which commands to use? Thanks.
TP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-04-01
As I understand it, the acoustic model training paradigm assumes a set of discrete "utterances", each with a transcription in terms of words (and "filler" words, if appropriate). For each, the word sequence is converted into a phone sequence, and thence into a sequence of triphone HMMs. Then the Baum-Welch training program, starting with some kind of initial model, for each utterance, optimally aligns the signal feature frames against the HMMs and accumulates feature statistics for each HMM state, which results in a new, improved acoustic model, and the process is iterated.
How long can these utterances be? The assumption seems to be that they can be "long", but perhaps not "too long". As a practical matter, I think this depends on how storage is allocated in the Baum-Welch program; this program appears to allocate feature storage as needed, so perhaps it can accomodate an hour's speech. But I think there's a theoretical consideration as well -- what assumptions are made in the Baum-Welch program about the duration of data that can be aligned at once, and what does this imply about an ideal maximum utterance duration? I raised this question a few months ago (http://sourceforge.net/forum/forum.php?thread_id=1205506&forum_id=5470), but I never got a satisfactory answer. One respondent (Ivan Uemlianin), however, told of splitting ~60-second training utterances into ~10-second ones and achieving better training and recognition performance as a result, which suggests that 60 seconds may be longer than is optimal.
So I've not been able to get any hard answers to these kind of questions. I suspect that in your case, you should seek to develop some new means of splitting your long utterance data into pieces of 10-20 seconds maximum, but that's only a guess.
Please let us know what you do!
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
SphinxTrain will not let you build models with utterances more than 60s long. Somewhere in the code I found a line saying something like:
MAX_FILE_LENGTH = 60;
Unfortunately, changing this line (wherever it is) will not help (I tried). To put it technically, after a certain amount of time (less than 60s), the probability differentials just run into the sand.
You need to cut up your speech data (and associated transcriptions) as small as possible. If your alignment information allows you to segment the files by silence (alternatively try using CMUSeg), this is the best option. Otherwise I would recommend segmenting by a few words at a time - try to get your files to under 20s, under 10s even better.
If you can't segment by silence, pad your segments with digital silence. If you do this remember to:
(a) have your segments overlap, to counter the distortion at the boundaries;
(b) use '-dither yes' when making your feature files.
Finally, if you have any success with any of these methods, please let us know!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-04-04
Ivan, I must dispute your assertion that "SphinxTrain will not let you build models with utterances more than 60s long." I have (since 1 Dec 04) used SphinxTrain with datasets where some of the input audio files exceeded 60 sec, and I encountered no error messages as a result. (I also grepped the source files to find all examples of "MAX_" and didn't notice any that would have the effect of an absolute limit on file length. Caveat: just because I didn't find such a limit in that quick look doesn't mean there is none. Furthermore, there may once have been such a limitation, and it has been removed since then; the wave2feat program, for example, has been updated in the last several months.)
The question of the effectiveness/usefulness of Baum-Welch alignment with "longer" utterances (or alternatively, is there an upper limit on duration beyond which the training alignment becomes degraded or less effective) is a separate and important one, which I raised in the Open Discussion thread referenced above. Your reply there gave us evidence that 60 sec is "too long", but I was disappointed that there was no substantive answer from the training gurus at CMU (or in the SphinxTrain documentation, either).
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Thank you very much for your replies. I plan to segment the wav fiels and perform training after that, since I can use the alignment information for the purpose. Will update you all if I got some results. Thanks again.
TP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I have some long utterances for training (some more than an hour). I also have some alignment information that I could use if needed. I would like to use to create an acoustic model with SphinxTrain. Do I use force alignment? Which commands to use? Thanks.
TP
As I understand it, the acoustic model training paradigm assumes a set of discrete "utterances", each with a transcription in terms of words (and "filler" words, if appropriate). For each, the word sequence is converted into a phone sequence, and thence into a sequence of triphone HMMs. Then the Baum-Welch training program, starting with some kind of initial model, for each utterance, optimally aligns the signal feature frames against the HMMs and accumulates feature statistics for each HMM state, which results in a new, improved acoustic model, and the process is iterated.
How long can these utterances be? The assumption seems to be that they can be "long", but perhaps not "too long". As a practical matter, I think this depends on how storage is allocated in the Baum-Welch program; this program appears to allocate feature storage as needed, so perhaps it can accomodate an hour's speech. But I think there's a theoretical consideration as well -- what assumptions are made in the Baum-Welch program about the duration of data that can be aligned at once, and what does this imply about an ideal maximum utterance duration? I raised this question a few months ago (http://sourceforge.net/forum/forum.php?thread_id=1205506&forum_id=5470), but I never got a satisfactory answer. One respondent (Ivan Uemlianin), however, told of splitting ~60-second training utterances into ~10-second ones and achieving better training and recognition performance as a result, which suggests that 60 seconds may be longer than is optimal.
So I've not been able to get any hard answers to these kind of questions. I suspect that in your case, you should seek to develop some new means of splitting your long utterance data into pieces of 10-20 seconds maximum, but that's only a guess.
Please let us know what you do!
cheers,
jerry
SphinxTrain will not let you build models with utterances more than 60s long. Somewhere in the code I found a line saying something like:
MAX_FILE_LENGTH = 60;
Unfortunately, changing this line (wherever it is) will not help (I tried). To put it technically, after a certain amount of time (less than 60s), the probability differentials just run into the sand.
You need to cut up your speech data (and associated transcriptions) as small as possible. If your alignment information allows you to segment the files by silence (alternatively try using CMUSeg), this is the best option. Otherwise I would recommend segmenting by a few words at a time - try to get your files to under 20s, under 10s even better.
If you can't segment by silence, pad your segments with digital silence. If you do this remember to:
(a) have your segments overlap, to counter the distortion at the boundaries;
(b) use '-dither yes' when making your feature files.
Finally, if you have any success with any of these methods, please let us know!
Good luck,
Ivan
CMUSeg - http://www.nist.gov/speech/tools/CMUseg_05targz.htm
Ivan, I must dispute your assertion that "SphinxTrain will not let you build models with utterances more than 60s long." I have (since 1 Dec 04) used SphinxTrain with datasets where some of the input audio files exceeded 60 sec, and I encountered no error messages as a result. (I also grepped the source files to find all examples of "MAX_" and didn't notice any that would have the effect of an absolute limit on file length. Caveat: just because I didn't find such a limit in that quick look doesn't mean there is none. Furthermore, there may once have been such a limitation, and it has been removed since then; the wave2feat program, for example, has been updated in the last several months.)
The question of the effectiveness/usefulness of Baum-Welch alignment with "longer" utterances (or alternatively, is there an upper limit on duration beyond which the training alignment becomes degraded or less effective) is a separate and important one, which I raised in the Open Discussion thread referenced above. Your reply there gave us evidence that 60 sec is "too long", but I was disappointed that there was no substantive answer from the training gurus at CMU (or in the SphinxTrain documentation, either).
cheers,
jerry
Mea culpa, the 60s limit was in an old version, and I hadn't checked it was still there (I can't find it either now).
Ivan
Hi,
Thank you very much for your replies. I plan to segment the wav fiels and perform training after that, since I can use the alignment information for the purpose. Will update you all if I got some results. Thanks again.
TP