After compiling a corpus of 10 minutes duration from english literature (novels), at the second step of acoustic model creation the baulm welch script gives the following dump:
-------------------------------------------------
'INFO: main.c(188): Reading /usr/local/share/time/SphinxTrain/SCRIPTURES/model_architecture/SCRIPTURES.ci.mdef
'
78 'INFO: model_def_io.c(593): Model definition info:
'
79 'INFO: model_def_io.c(594): 41 total models defined (41 base, 0 tri)
'
80 'INFO: model_def_io.c(595): 246 total states
'
81 'INFO: model_def_io.c(596): 205 total tied states
'
82 'INFO: model_def_io.c(597): 205 total tied CI states
'
83 'INFO: model_def_io.c(598): 41 total tied transition matrices
'
84 'INFO: model_def_io.c(599): 6 max state/model
'
85 'INFO: model_def_io.c(600): 20 min state/model
'
86 'INFO: s3mixw_io.c(122): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/mixture_weights [205x4x256 array]
'
87 'WARNING: "mod_inv.c", line 358: Model inventory n_density not set; setting to value in mixw file, 256.
88 'INFO: s3tmat_io.c(121): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/transition_matrices [41x5x6 array]
'
89 'INFO: mod_inv.c(286): inserting tprob floor 1.000000e-04 and renormalizing
'
90 'INFO: s3gau_io.c(158): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/means [1x4x256 array]
'
91 'INFO: s3gau_io.c(158): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/variances [1x4x256 array]
'
92 'INFO: gauden.c(173): 1 total mgau
'
93 'INFO: gauden.c(145): 4 feature streams (|0|=12 |1|=24 |2|=3 |3|=12 )
'
94 'INFO: gauden.c(184): 256 total densities
'
95 'INFO: gauden.c(92): min_var=1.000000e-04
'
96 'INFO: gauden.c(162): compute 4 densities/frame
'
97 'INFO: main.c(284): Will reestimate mixing weights
'
98 'INFO: main.c(286): Will reestimate means
'
99 'INFO: main.c(288): Will reestimate variances
'
100 'INFO: main.c(290): Will NOT reestimate MLLR multiplicative term
'
101 'INFO: main.c(292): Will NOT reestimate MLLR additive term
'
102 'INFO: main.c(300): Will reestimate transition matrices
'
103 'INFO: main.c(315): Reading main lexicon: /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.dic
'
104 'INFO: lexicon.c(237): 913 entries added from /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.dic
'
105 'INFO: main.c(326): Reading filler lexicon: /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.filler
'
106 'INFO: lexicon.c(237): 3 entries added from /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.filler
'
107 'INFO: corpus.c(1236): Will process all remaining utts starting at 0
'
107 'INFO: corpus.c(1236): Will process all remaining utts starting at 0
'
108 'INFO: main.c(529): Reestimation: Baum-Welch
'
109 'column defns
'
110 "\cI<seq>\cJ"
111 "\cI<id>\cJ"
112 "\cI<n_frame_in>\cJ"
113 "\cI<n_frame_del>\cJ"
114 "\cI<n_state_shmm>\cJ"
115 "\cI<avg_states_alpha>\cJ"
116 "\cI<avg_states_beta>\cJ"
117 "\cI<avg_states_reest>\cJ"
118 "\cI<avg_posterior_prune>\cJ"
119 "\cI<frame_log_lik>\cJ"
120 "\cI<utt_log_lik>\cJ"
121 "\cI... timing info ... \cJ"
122 "utt> 0 BT1 874 0WARNING: \"mk_phone_list.c\", line 179: Unable to lookup (BT1)\cM in the lexicon\cJ"
123 'WARNING: "next_utt_states.c", line 83: Unable to produce CI phones for utt
'
124 'bw: baum_welch.c:172: baum_welch_update: Assertion `n_state > 0\' failed.
'
125 ' 0'
any sort of remedies..ideas..opinions would be welcome.
regards BILAL AHMED
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Bilal,
First of all, you have already told us that you want to train a set of digits model. We also told you that it is better to carry out this by first collecting data or buy a data set.
AGAIN, the script included in SphinxTrain has a specific purpose. It is most suitable for acoustic model training with large vocabulary and need tweeking to work in small vocabulary.
There is no quick remedy for what you did because it started off from a wrong direction. I do have a lot of opinions.
1) Training a speech recognizer with a large vocabulary can be difficult. Think whether you want to do it. Keep it simple and stupid.
2) I also don't know how you transcribe your audio samples (translate to written text) when you do first training. If you try to do something difficult to understand, it will be difficult for us to supp ort you.
3) Please learn what you are doing, rather than doing something without knowing what is going on.
Regards,
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well thankyou for the directions but what i cannot percieve is what is meant by a large vocabulary.
Here is how i proceeded:
1) I collected text pieces from many different and diverse locations.
2) I had these scripts recorded in raw format.
3) As i had these scripts i simply made a program which inserted the <s> </s> and file names.
4) Then i made the dictionary using festival.
and finally i proceeded on with the scripts as directed by the SphinxTrain software.
Where have I gone wrong....? What is the basic which is wrong and what are the remedies, if you could elabor
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well thankyou for the directions but what i cannot percieve is what is meant by a large vocabulary.
Here is how i proceeded:
1) I collected text pieces from many different and diverse locations.
2) I had these scripts recorded in raw format.
3) As i had these scripts i simply made a program which inserted the <s> </s> and file names.
4) Then i made the dictionary using festival.
and finally i proceeded on with the scripts as directed by the SphinxTrain software.
Where have I gone wrong....? What is the basic which is wrong and what are the remedies, if you could elaborate please do so. and as far as buying the TIDIGITS is concerned it costs $1900 which is about 58*1900 in terms of my native currency and is too much, so if you can please tell me the mistake which i have made.
With regards
BILAL AHMED
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I see. Bilal, I am sorry. Alright, we will try but I am not sure what will happened if you do training in this way. First thing, it seems to me, bw is looking for something call BT1 and it tries to look it up from the dictionary. BT1 doesn't sound like a word to me. What is BT1, why is it there in the first place?
If I were you, I will try to
grep -r BT1
from your current directory. Chances are you specify something incorrectly in your command-line. Do you think you can look at that part first.
You also mentioned that you use festival to generate the dictionary. Is it because festival has generated some artifacts in your dictionary so that bw doesn't work? Festival and Sphinx are not necessarily to be compatible.
Now let us turn back bout what is large vocabulary. Well. iI you train a recognizer using an audio novel, your dictionary will have at least 20000 words. In those cases, we can already call it large vocabulary. The term "large vocabulary" was to describe speech recognizer which assume the number of words it can recognizer is larger than > N . This N changes as technology advances . Around 15 years ago, this number if 1000, nowadays, 1000 words are usually called medium vocabulary. Things like 10000 becomes large vocabulary.
Now when you try to build a system like that it actually take a lot of time to just tune the system and make it kind of works. Even you succeed in passing bw, you will find that you will have a big headache in training the language model. Internal to CMU, we need to spend a staff or a student to do each of the AM and LM training. You will also find that if use one audio novel, the performance can be very poor because there is only one speaker in the database. Another difficulty is, it may be very hard to align the text material with the audio data.
My point is that there is certain difficulty on processin g data like that. That's why I strongly recommended you not to start your first project in this way.
From the LDC web page, I found that TIDIGITS is just $250 U.S. I also understand that that might still be too expensive. If you want a free database, you could try CMU's AN4 database.
That is free and you can download it freely. It is around 64MB so it may take you some time to transfer. This is something I will recommend you to try first.
Could you take a look? I hope this will give you an easier time to work on . :-) I will think that it is still your choise whether you want to insist to use audio novel as your first task. However, my help will be limited starting from this point because I really don't know how to make that work.
Regards,
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thankyou very much for the response and your help and advice have proven very fruitful.
I have resolved the matter and it was very trivial, there was a problem with the transcription file.:)
Now i would like to ask a question my bw has started working fine, the questions are successfully made by the treebuilder. But when it comes to making the unpruned trees fatal errors are generated for only a subset of the total phones (HH, JH, AW, NG, OY, P , SH) what is the cause of this, specially in the case of HH errors are produced only for 1-4 states not for the 0th state ( and no .dtree file is produced for the Oth state).
What could be the main cause for this phenomenon.
with regards
BILAL AHMED
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As I have explain to you in the few previous mails, you may not have enough data for a particular phonemes if you choose an arbitrary training corpus. -Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
After compiling a corpus of 10 minutes duration from english literature (novels), at the second step of acoustic model creation the baulm welch script gives the following dump:
-------------------------------------------------
'INFO: main.c(188): Reading /usr/local/share/time/SphinxTrain/SCRIPTURES/model_architecture/SCRIPTURES.ci.mdef
'
78 'INFO: model_def_io.c(593): Model definition info:
'
79 'INFO: model_def_io.c(594): 41 total models defined (41 base, 0 tri)
'
80 'INFO: model_def_io.c(595): 246 total states
'
81 'INFO: model_def_io.c(596): 205 total tied states
'
82 'INFO: model_def_io.c(597): 205 total tied CI states
'
83 'INFO: model_def_io.c(598): 41 total tied transition matrices
'
84 'INFO: model_def_io.c(599): 6 max state/model
'
85 'INFO: model_def_io.c(600): 20 min state/model
'
86 'INFO: s3mixw_io.c(122): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/mixture_weights [205x4x256 array]
'
87 'WARNING: "mod_inv.c", line 358: Model inventory n_density not set; setting to value in mixw file, 256.
88 'INFO: s3tmat_io.c(121): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/transition_matrices [41x5x6 array]
'
89 'INFO: mod_inv.c(286): inserting tprob floor 1.000000e-04 and renormalizing
'
90 'INFO: s3gau_io.c(158): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/means [1x4x256 array]
'
91 'INFO: s3gau_io.c(158): Read /usr/local/share/time/SphinxTrain/SCRIPTURES/model_parameters/SCRIPTURES.ci_semi_flatinitial/variances [1x4x256 array]
'
92 'INFO: gauden.c(173): 1 total mgau
'
93 'INFO: gauden.c(145): 4 feature streams (|0|=12 |1|=24 |2|=3 |3|=12 )
'
94 'INFO: gauden.c(184): 256 total densities
'
95 'INFO: gauden.c(92): min_var=1.000000e-04
'
96 'INFO: gauden.c(162): compute 4 densities/frame
'
97 'INFO: main.c(284): Will reestimate mixing weights
'
98 'INFO: main.c(286): Will reestimate means
'
99 'INFO: main.c(288): Will reestimate variances
'
100 'INFO: main.c(290): Will NOT reestimate MLLR multiplicative term
'
101 'INFO: main.c(292): Will NOT reestimate MLLR additive term
'
102 'INFO: main.c(300): Will reestimate transition matrices
'
103 'INFO: main.c(315): Reading main lexicon: /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.dic
'
104 'INFO: lexicon.c(237): 913 entries added from /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.dic
'
105 'INFO: main.c(326): Reading filler lexicon: /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.filler
'
106 'INFO: lexicon.c(237): 3 entries added from /usr/local/share/time/SphinxTrain/SCRIPTURES/etc/SCRIPTURES.filler
'
107 'INFO: corpus.c(1236): Will process all remaining utts starting at 0
'
107 'INFO: corpus.c(1236): Will process all remaining utts starting at 0
'
108 'INFO: main.c(529): Reestimation: Baum-Welch
'
109 'column defns
'
110 "\cI<seq>\cJ"
111 "\cI<id>\cJ"
112 "\cI<n_frame_in>\cJ"
113 "\cI<n_frame_del>\cJ"
114 "\cI<n_state_shmm>\cJ"
115 "\cI<avg_states_alpha>\cJ"
116 "\cI<avg_states_beta>\cJ"
117 "\cI<avg_states_reest>\cJ"
118 "\cI<avg_posterior_prune>\cJ"
119 "\cI<frame_log_lik>\cJ"
120 "\cI<utt_log_lik>\cJ"
121 "\cI... timing info ... \cJ"
122 "utt> 0 BT1 874 0WARNING: \"mk_phone_list.c\", line 179: Unable to lookup (BT1)\cM in the lexicon\cJ"
123 'WARNING: "next_utt_states.c", line 83: Unable to produce CI phones for utt
'
124 'bw: baum_welch.c:172: baum_welch_update: Assertion `n_state > 0\' failed.
'
125 ' 0'
any sort of remedies..ideas..opinions would be welcome.
regards BILAL AHMED
Bilal,
First of all, you have already told us that you want to train a set of digits model. We also told you that it is better to carry out this by first collecting data or buy a data set.
AGAIN, the script included in SphinxTrain has a specific purpose. It is most suitable for acoustic model training with large vocabulary and need tweeking to work in small vocabulary.
There is no quick remedy for what you did because it started off from a wrong direction. I do have a lot of opinions.
1) Training a speech recognizer with a large vocabulary can be difficult. Think whether you want to do it. Keep it simple and stupid.
2) I also don't know how you transcribe your audio samples (translate to written text) when you do first training. If you try to do something difficult to understand, it will be difficult for us to supp ort you.
3) Please learn what you are doing, rather than doing something without knowing what is going on.
Regards,
Arthur
Well thankyou for the directions but what i cannot percieve is what is meant by a large vocabulary.
Here is how i proceeded:
1) I collected text pieces from many different and diverse locations.
2) I had these scripts recorded in raw format.
3) As i had these scripts i simply made a program which inserted the <s> </s> and file names.
4) Then i made the dictionary using festival.
and finally i proceeded on with the scripts as directed by the SphinxTrain software.
Where have I gone wrong....? What is the basic which is wrong and what are the remedies, if you could elabor
Well thankyou for the directions but what i cannot percieve is what is meant by a large vocabulary.
Here is how i proceeded:
1) I collected text pieces from many different and diverse locations.
2) I had these scripts recorded in raw format.
3) As i had these scripts i simply made a program which inserted the <s> </s> and file names.
4) Then i made the dictionary using festival.
and finally i proceeded on with the scripts as directed by the SphinxTrain software.
Where have I gone wrong....? What is the basic which is wrong and what are the remedies, if you could elaborate please do so. and as far as buying the TIDIGITS is concerned it costs $1900 which is about 58*1900 in terms of my native currency and is too much, so if you can please tell me the mistake which i have made.
With regards
BILAL AHMED
I see. Bilal, I am sorry. Alright, we will try but I am not sure what will happened if you do training in this way. First thing, it seems to me, bw is looking for something call BT1 and it tries to look it up from the dictionary. BT1 doesn't sound like a word to me. What is BT1, why is it there in the first place?
If I were you, I will try to
grep -r BT1
from your current directory. Chances are you specify something incorrectly in your command-line. Do you think you can look at that part first.
You also mentioned that you use festival to generate the dictionary. Is it because festival has generated some artifacts in your dictionary so that bw doesn't work? Festival and Sphinx are not necessarily to be compatible.
Now let us turn back bout what is large vocabulary. Well. iI you train a recognizer using an audio novel, your dictionary will have at least 20000 words. In those cases, we can already call it large vocabulary. The term "large vocabulary" was to describe speech recognizer which assume the number of words it can recognizer is larger than > N . This N changes as technology advances . Around 15 years ago, this number if 1000, nowadays, 1000 words are usually called medium vocabulary. Things like 10000 becomes large vocabulary.
Now when you try to build a system like that it actually take a lot of time to just tune the system and make it kind of works. Even you succeed in passing bw, you will find that you will have a big headache in training the language model. Internal to CMU, we need to spend a staff or a student to do each of the AM and LM training. You will also find that if use one audio novel, the performance can be very poor because there is only one speaker in the database. Another difficulty is, it may be very hard to align the text material with the audio data.
My point is that there is certain difficulty on processin g data like that. That's why I strongly recommended you not to start your first project in this way.
From the LDC web page, I found that TIDIGITS is just $250 U.S. I also understand that that might still be too expensive. If you want a free database, you could try CMU's AN4 database.
http://fife.speech.cs.cmu.edu/databases/
That is free and you can download it freely. It is around 64MB so it may take you some time to transfer. This is something I will recommend you to try first.
Could you take a look? I hope this will give you an easier time to work on . :-) I will think that it is still your choise whether you want to insist to use audio novel as your first task. However, my help will be limited starting from this point because I really don't know how to make that work.
Regards,
Arthur
Thankyou very much for the response and your help and advice have proven very fruitful.
I have resolved the matter and it was very trivial, there was a problem with the transcription file.:)
Now i would like to ask a question my bw has started working fine, the questions are successfully made by the treebuilder. But when it comes to making the unpruned trees fatal errors are generated for only a subset of the total phones (HH, JH, AW, NG, OY, P , SH) what is the cause of this, specially in the case of HH errors are produced only for 1-4 states not for the 0th state ( and no .dtree file is produced for the Oth state).
What could be the main cause for this phenomenon.
with regards
BILAL AHMED
As I have explain to you in the few previous mails, you may not have enough data for a particular phonemes if you choose an arbitrary training corpus. -Arthur