Kristoffer - 2018-04-12

I have read your tutorial.

Consider a mobile game where you can move in different directions. Let's say that you want to be able to recognize different types of moves, as shorter sentences (not commands). Around 50 words in different configurations. E.g. "Go left, then right, followed by right."

Thus, a very limited vocabulary but with many different combinations. I want to train an acoustic model (for PocketSphinx obviously) to get the best recognition and CPU/RAM performance possible. Now, let's say I have around 100 speakers. What should I aim to have in the sentences to be spoken?

What sentence structure should I choose?
1a) Thousands of arbitrary sentences where the words appear here and there, together with thousands of arbitrary words?
1b) Fewer sentences where all word pairs are guaranteed to appear at least once? No need to have multipe occurrences of "left, right"?
1c) Fewer sentences where all word tries appear at least once? No need to have multiple occurrences of "left, right, up"?

2a) Short spoken sentences (much less than the mentioned "5 seconds"). E.g. "Go left, right, up."
2b) Longer spoken sentences (5-30 seconds). E.g. "Go left, then northwest, then north. Go up, right, up, right, up, left. Continue to the left, ... blah blah blah..."

All in all, I think the documentation is lacking in this matter. Certainly, there ought to be some guidelines for the text to be spoken?

PS. Special thanks to you Nickolay, for all your help!

 

Last edit: Kristoffer 2018-04-12