I have a large collection of audio files with their transcripts in a foreign language.
I want to be able to recognize whether the user recites the right words from the text.
How do I start approaching this using CMU Sphinx? Do I need a language model, acoustic model?
I would like some guidance please and where to start from.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a large collection of audio files with their transcripts in a foreign language.
I want to be able to recognize whether the user recites the right words from the text.
How do I start approaching this using CMU Sphinx? Do I need a language model, acoustic model?
I would like some guidance please and where to start from.
You already asked this question at http://stackoverflow.com/questions/43967550/detecting-speech-based-on-a-collection-of-audio#43967550