I am new to sphinx and to speech recognition in general but I need to create a small model for digits recognition (in english) that is robust to background noise. I currently have thousands of audio files (from different speakers with different levels of noise) with their traduction into text but I don't really know how I should format them to use them for model traning.
Each audio file contains the sound from 5 to 12 digits. Do I need to split them into individual digit audio files (one digit/audio file) or is it unnecessary?
Also do you think it is better to perform an adaptation of the default sphinx english model to improve its accuracy at recognising my digits with noise (I already tested the default english model whithout any adaptation but the performance was poor due to the background noise) or should I train a completely new model?
And in the case of a new model training, is there different methods/type of models that can be used? If yes, which one is preferable for my case (small langage (only 10 digits) but with noise)?
Finally, could someone explain me the different steps I should follow in detail to train my model? I have looked at the tutorial but I am unsure of what my Phonetic dictionary, Phoneset file, Language model and List of fillers files should contain...
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Each audio file contains the sound from 5 to 12 digits. Do I need to split them into individual digit audio files (one digit/audio file) or is it unnecessary?
Also do you think it is better to perform an adaptation of the default sphinx english model to improve its accuracy at recognising my digits with noise (I already tested the default english model whithout any adaptation but the performance was poor due to the background noise) or should I train a completely new model?
If you have thousands of files it is better to train a new model.
Finally, could someone explain me the different steps I should follow in detail to train my model? I have looked at the tutorial but I am unsure of what my Phonetic dictionary, Phoneset file, Language model and List of fillers files should contain.
OK thanks for the reply.
I have a question regarding the Phonetic Dictionary. Is it possible to speciffy different pronounciations for a same word? for exemple the digit "two" can have this phonetic transcription:
T_two OO_two
but it can also be:
T_two UH_two
So is it possible to specify that both pronounciation are acceptable and if yes, how should I format my .dict file to have both? (in the tutorial there is always only one phonetic transcription for a word)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello everyone
I am new to sphinx and to speech recognition in general but I need to create a small model for digits recognition (in english) that is robust to background noise. I currently have thousands of audio files (from different speakers with different levels of noise) with their traduction into text but I don't really know how I should format them to use them for model traning.
Each audio file contains the sound from 5 to 12 digits. Do I need to split them into individual digit audio files (one digit/audio file) or is it unnecessary?
Also do you think it is better to perform an adaptation of the default sphinx english model to improve its accuracy at recognising my digits with noise (I already tested the default english model whithout any adaptation but the performance was poor due to the background noise) or should I train a completely new model?
And in the case of a new model training, is there different methods/type of models that can be used? If yes, which one is preferable for my case (small langage (only 10 digits) but with noise)?
Finally, could someone explain me the different steps I should follow in detail to train my model? I have looked at the tutorial but I am unsure of what my Phonetic dictionary, Phoneset file, Language model and List of fillers files should contain...
Thanks
Data preparation is covered in our tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam You do not need to split.
If you have thousands of files it is better to train a new model.
You can check here:
https://github.com/cmusphinx/sphinxtrain/tree/master/templates/tidigits/etc
OK thanks for the reply.
I have a question regarding the Phonetic Dictionary. Is it possible to speciffy different pronounciations for a same word? for exemple the digit "two" can have this phonetic transcription:
T_two OO_two
but it can also be:
T_two UH_two
So is it possible to specify that both pronounciation are acceptable and if yes, how should I format my .dict file to have both? (in the tutorial there is always only one phonetic transcription for a word)
It is possible to specify alternative pronunciations but for digits there is no sense to do that.