we used sphinxtrain to adapt a language model to a singel speaker(in german
language) and got pretty good results(up to 75% correct sentences). But
obviously one third of all mistakes happen in the very first or last word of a
sentence. Is there any known reason for this effect and is there a possibility
to reduce this mistakes ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As for boundaries, it's mandatory to have silence around utterance during
training and during adaptation as well as use force alignment to detect
silences inside the utterances. Missing silence could cause bad effects.
The nature of speech recognition errors could be very different and the
debugging process is actually complex and undocumented. First you need to
separate various aspects to support your hypothesis that utterance boundary is
problematic. You need to decode with very wide beams in order to understand if
pruning is the reason of failures. Then you need to try unigram langauge model
in order to find out if langauge model has any effect on speech recognition
errors. If the issue is an acoustic model it might be worth to try phonetic
recognizer accuracy to find out which senones were not correctly trained.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hello,
we used sphinxtrain to adapt a language model to a singel speaker(in german
language) and got pretty good results(up to 75% correct sentences). But
obviously one third of all mistakes happen in the very first or last word of a
sentence. Is there any known reason for this effect and is there a possibility
to reduce this mistakes ?
...mixed up words: we trained the accoustic model.
Hello
Please use help forum to ask for help
As for boundaries, it's mandatory to have silence around utterance during
training and during adaptation as well as use force alignment to detect
silences inside the utterances. Missing silence could cause bad effects.
The nature of speech recognition errors could be very different and the
debugging process is actually complex and undocumented. First you need to
separate various aspects to support your hypothesis that utterance boundary is
problematic. You need to decode with very wide beams in order to understand if
pruning is the reason of failures. Then you need to try unigram langauge model
in order to find out if langauge model has any effect on speech recognition
errors. If the issue is an acoustic model it might be worth to try phonetic
recognizer accuracy to find out which senones were not correctly trained.