Hello,
The tutorial regarding training a new acoustic model for dictation does not say how many different words or tri-phones the audio files should contain. For example, two databases may have the same number of hours and speakers, but the first one has 1000 different words, and the second has 10,000 different words.
I have a database of 130 hours, 1100 speakers. How many different words/tri-phones do I need?
Is there a formula which I can use in order to calculate it for other databases?
Does more different words will give better accuracy?
How tri-phones compare to words in this respect?
The same question regarding adapting the default English model for 5 minutes.
Thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
A database may contain 5 wav files. In each of them you hear the sentence "it is here". This database has 3 different words.
Another database contains 2 wav files. At the first one you hear "it is here". At the second one you hear "How are you". This database has 6 different words.
Last edit: Oren G. 2016-11-13
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK. Since my corpus has 130 hours and only about 5000 words, should I use only part of the corpus? (i.e, the table say 5000 words=30 hours). And I want to use PTM, not continuous.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
The tutorial regarding training a new acoustic model for dictation does not say how many different words or tri-phones the audio files should contain. For example, two databases may have the same number of hours and speakers, but the first one has 1000 different words, and the second has 10,000 different words.
I have a database of 130 hours, 1100 speakers. How many different words/tri-phones do I need?
Is there a formula which I can use in order to calculate it for other databases?
Does more different words will give better accuracy?
How tri-phones compare to words in this respect?
The same question regarding adapting the default English model for 5 minutes.
Thanks
In case I'm not clear about "different words":
A database may contain 5 wav files. In each of them you hear the sentence "it is here". This database has 3 different words.
Another database contains 2 wav files. At the first one you hear "it is here". At the second one you hear "How are you". This database has 6 different words.
Last edit: Oren G. 2016-11-13
Didn't you see the table in tutorial?
http://cmusphinx.sourceforge.net/wiki/tutorialam#configure_model_type_and_model_parameters
Vocabulary Hours in db Senones Densities Example
20 5 200 8 Tidigits Digits Recognition
100 20 2000 8 RM1 Command and Control
5000 30 4000 16 WSJ1 5k Small Dictation
20000 80 4000 32 WSJ1 20k Big Dictation
60000 200 6000 16 HUB4 Broadcast News
60000 2000 12000 64 Fisher Rich Telephone Transcription
OK. Since my corpus has 130 hours and only about 5000 words, should I use only part of the corpus? (i.e, the table say 5000 words=30 hours). And I want to use PTM, not continuous.
Yes