Stijn Frishert - 2019-01-22

Hi all,

I just started looking into speech recognition software for a small side project I'm doing. Sphinx seems to be a really good solution, but I'd like to know if the following seems feasible (before I get false expectations; I realize speech recognition is not an easy task to do):

Recognizing the 7 basic musical syllables do, re, mi, fa, so, la, ti, in real-time, with the ocassional uhm or silence in between. So users would utter sentences that consist of only those 7 syllables. Is that something (pocket)sphinx would be capable of doing, with high levels of accuracy?

I've been playing around with some example projects by adjusting the factory dictionary and such, but without much result yet. If this is at all feasible (it seems others are reaching the same amounts of accuracy with digit recognition), what would the steps be with sphinx to get this to work (what language model, do I need to train an acoustic model, etc.)? Or are there other solutions that would most likely work better for this task (neural nets come to mind)?

(I hope this does not come across as looking for someone to finish this task for me - I'm just a bit overwhelmed by the options in Sphinx and need some pointers if this is doable.)

Cheers,
Stijn

 

Last edit: Stijn Frishert 2019-01-22