CMU Sphinx / Forums / Help: Corpus size for acoustic model training

floboc - 2012-09-26

Hello,
I have to train a speaker-independent acoustic model for children between 7-13.
The amount of words to recognize is very small (arround 10) but the results with common (adult) acoustic models are really bad with children.
I wanted to know how to estimate the amount of data needed for the training (number of speaker, total number of samples per phone/senone) for such a task according to the size of the vocabulary.

We only want to recognize the words in the small dictionary, nothing more.

I would like to know too what is the amount of at a needed for digits recognition.

Also we thought of training one acoustic model for each task (=to recognie a word among 10 words) instead of using a unique acoustic model for each task so that the models are specific to each task. What do you think about that ?

EDIT:
example of task: to recognize one of the first 10 numbers: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
In my language this vocabulary is made of 10 words, 17 phones and around 50 senones (not really sure how to count them but for each word I grouped the phones by 3, and added the first and last 2 together, and the first on and last one so that the final number of senones is the number of triphones+4).
So each phone appears once or twice in the vocabulary and each senone as well.
How should I train my acoustic model? Should I record only words in my vocabulary? Should each speaker say the word once or twice? Is it better to have separated words or several words in the same record file (even if the grammar will only recognize isolated words, no sentences)? How many speakers are needed for speaker-independency ?

Thank you,

Florent

Last edit: floboc 2012-09-26

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2012-09-26

I wanted to know how to estimate the amount of data needed for the training (number of speaker, total number of samples per phone/senone) for such a task according to the size of the vocabulary.

Please read the tutorial first

http://cmusphinx.sourceforge.net/wiki/tutorialam

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

floboc - 2012-09-27

I read it entirely several times before, It is said:

The approximate number of senones and number of densities is provided in the table below:

Vocabulary Hours in db Senones Densities Example
20 5 200 8 Tidigits Digits Recognition
100 20 2000 8 RM1 Command and Control
5000 30 4000 16 WSJ1 5k Small Dictation
20000 80 4000 32 WSJ1 20k Big Dictation
60000 200 6000 16 HUB4 Broadcast News
60000 2000 12000 64 Fisher Rich Telephone Transcription
Of course you also need to understand that only senones present in transcription could be trained. It means that if your transcription isn't generic enough, for example it's the same single word spoken by 10000 speakers 10000 times you still have just a few senones no matter how many hours of speech did you record. In that case you just need a few senones in the model, not few thousands of them.

My question is how do you get those numbers according to the complexity of the task ?

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2012-09-27

My question is how do you get those numbers according to the complexity of the task ?

Values are selected during experiments which people run for decades.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

floboc - 2012-09-27

ok.
What about the number of speaker for speaker-independency ?

In the tutorial it is said:

1 hour of recording for command and control for single speaker
5 hour of recordings of 200 speakers for command and control for many speakers
10 hours of recordings for single speaker dictation
50 hours of recordings of 200 speakers for many speakers dictation

Obviously what I want to do is command and control over small vocabulary (about 10 words). Do I still need 200 speakers ? Or could 50 be enough ?
The thing is that my company want to perform small test on children to see if we can obtain results as good as with adults and if the answer is yes, then record a lot of kids on a lot of words.
In that case, Would 50 kids be enough ?

And what about my other questions:
- Is it better to have separated words or several words in the same record file (even if the grammar will only recognize isolated words, no sentences)?
- Should each speaker say the word once or more?

Thank you for your help

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

floboc - 2012-09-27

Because according to the 5hours in db, if each word is about 2s (I think it's a good approximation) and that I have 10 words in my vocabulary it means that I need 1800 samples per word, which means that if I can record 100 kids, each one will have to say 18 times the same word. Am I right ?

If the vocabulary as 10 words and not 20, the number of senones will be smaller too, maybe 2hours of data will be enough ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2012-09-27

Obviously what I want to do is command and control over small vocabulary (about 10 words). Do I still need 200 speakers ? Or could 50 be enough ?
The thing is that my company want to perform small test on children to see if we can obtain results as good as with adults and if the answer is yes, then record a lot of kids on a lot of words.
In that case, Would 50 kids be enough ?

Yes, you still need 200 speakers or even more. If 50 were enough, 50 will be in tutorial, but there is 200. 200 is required. If you perform tests with 50 speakers the results are not guaranteed.

Is it better to have separated words or several words in the same record file (even if the grammar will only recognize isolated words, no sentences)?

If grammar will recognize only isolated words, recordings should have isolated words. The tutorial says on this subject:

A database should be a good representation of what speech you are going to recognize. For example if you are going to recognize telephone speech its prefered to use telephone recordings. If you want mobile speech, you should better find mobile recordings. Speech is significantly different across various recording channels. Broadcast news is different from telephone. Speech decoded from mp3 is significantly different from the microphone recording. However, if you do not have enough speech recorded in required condition you should definitely use other speech you have. For example you can use broadcast recordings. Sometimes it make sense to stream through the telephone codec to make audio similar. It's often possible to add noise to training data too.

So you should just follow it

Should each speaker say the word once or more?

Yes

If the vocabulary as 10 words and not 20, the number of senones will be smaller too, maybe 2hours of data will be enough ?

No. If you do not have enough data just don't do the training. Perform adaptation instead.

It's nice that you question every item in the tutorial, but believe me or not it was written for a reason, not just because we wanted to startle you.

Last edit: Nickolay V. Shmyrev 2012-09-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

floboc - 2012-09-28

Thank you for your answers.
If I am questioning that much it's because it's not written how you came to these numbers so I just wanted to know where they come from and not just follow one more tutorial on the internet without knowing how it has been made.

Moreover, the data you give for training is only for 5 different cases, I thought that there could be a huge difference in the corpus size between a vocabulary of 10words and 20words that would make us gain a lot of time

Thank you again for your help

Last edit: floboc 2012-09-28

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ayda - 2018-11-16

I saw the upper dialog and its more of the same in my case, but what is differeent here is i dont have acoustic model to adopt, and its going to be experment to recognize digits. but their is no much data to be used as well, am doing it by my own as project which is conducted outside the language speakers so some how i couldnt find any native speaker to collect the data, what seems to be the solution?
second if i want to record digits as a sample and record it for hour , which is repeating the same digit 10 or more times can i still use the sphinx ?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2019-02-12
  
  You can download tiditigs database samples to check the database design for digits, people repeat same digits again and again.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Corpus size for acoustic model training

Speech Recognition Toolkit

Forums

Help

Corpus size for acoustic model training document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Corpus size for acoustic model training