I am working on Speech Recognition from videos. I have trained 75 speakers. It
trained successfully. There are only 93 errors in final baumwelch iteration.
That is out of 1400 wav files 93 files are ignored.
I tried to add one more speaker data of about 1 hour for training. This
speakers voice is very feeble. Now there are 1540 files. But in the baum welch
iterations, many of them are ignored i.e. 1400 files. Final, iteration has
1330 errors. What could be the reason.
The transcription of this speaker is created carefull. It is 99% accurate. Is
the problem because voice characteristics because of having very very feeble
audio. I have given the sample files of audio.
But when you use CMN, amplitude shouldn't affect much I guess.
What do you mean by ignored wav files? It may be ignored in a previous
iteration, but used probably in next iterations. You can try training with
force-alignment, to see if there is problem with audio.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It might be that silences inside files you have are too big to properly
converge the baum-welch. The initial estimation gets wrong and whole training
process is wrong after that.
You need to use smaller small-utterance audios for model bootstrapping or you
just can resplit the whole of your data on utterances. Each utterance
shouldn't have significant amount of silence inside. The silences on the
boundaries shouldn't be more than 0.5 secs. You can use long audio aligner
branch for that.
See recent very similar discussion on this subject.
Dear Sir,
I am working on Speech Recognition from videos. I have trained 75 speakers. It
trained successfully. There are only 93 errors in final baumwelch iteration.
That is out of 1400 wav files 93 files are ignored.
I tried to add one more speaker data of about 1 hour for training. This
speakers voice is very feeble. Now there are 1540 files. But in the baum welch
iterations, many of them are ignored i.e. 1400 files. Final, iteration has
1330 errors. What could be the reason.
The transcription of this speaker is created carefull. It is 99% accurate. Is
the problem because voice characteristics because of having very very feeble
audio. I have given the sample files of audio.
http://www.4shared.com/archive/rOMzMnMj/exampletar.html
But when you use CMN, amplitude shouldn't affect much I guess.
What do you mean by ignored wav files? It may be ignored in a previous
iteration, but used probably in next iterations. You can try training with
force-alignment, to see if there is problem with audio.
It might be that silences inside files you have are too big to properly
converge the baum-welch. The initial estimation gets wrong and whole training
process is wrong after that.
You need to use smaller small-utterance audios for model bootstrapping or you
just can resplit the whole of your data on utterances. Each utterance
shouldn't have significant amount of silence inside. The silences on the
boundaries shouldn't be more than 0.5 secs. You can use long audio aligner
branch for that.
See recent very similar discussion on this subject.
https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/4970030