Im back. I have the same problems with the script 02 ( perl script) when I run the script the BW run fine, but the NORM script dont run because in the perl code one variable dont have anything (is empty) and need hace the path for the /root/modeloacusticomex/bwaccumdir/modeloacusticomex_buff_1
I fix this error put the path but I dont know if this can be a problem in the future, this is a peace of wrong code:
die "USAGE: $0 <iter>" if ($#ARGV != $index);
$iter = $ARGV[$index];
can you tell me if this is a bug or can be because I have worn settings in the (configuration file). But all paths in the configuration file are correct I check before
Somebody have a same problem (or had).
this are my specifications:
I try to train a model for the spanish-mexican
I try to train a continous model for the Sphinx-4
I use Linux Red-Hat 9
I have a 103 utterances
If you need other data please tell me, I need your help
Omar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
2, I noticed that you are using your root access to run the process. I hope to remind you this is very dangerous in a security purpose.
3, Now going back to your problem. If you hack the script in this way, you will have problem in step 4 because it get the accumulator file from a different place.
Regards,
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am glad that at last you are asking the right question.
If you want to build context-dependent phone model, a rough estimate is that you probably need 10-20 hours of native speech.
If you want to train whole word model, what you need is for each word, 10-50 occurences of word must be present.
Let's assume you have an isolated word system and given the amount of word, then that is to say you need minimally 10*1072 utterances in your training set to get good results.
That's why I said the amount of data you have is far from enough. The first thing you should consider when you do training is whether you have enough training data. The data can be obtained by
1, From Linguistic Data Consortium.
2, From your own recording. (It actually takes months to gather this data and transcribe them.)
Now if you don't have the data,
1, But you still want to do some training, try simple database like AN4 and RM, they are essentially free.
2, Think of ways to gather data from friends.
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The problem in your script is another thing. What I tried to say is you could probably complete the training with that amount of data. However, the models will be totaly unusable. (Say give you gibberish everytime.) That's why I just want to stop you before you start.
Arthur
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Recently (i.e. in the last 3 weeks) I made some changes to the SphinxTrain scripts. The particular problem you're referring to, I think, is solved by the changes. Specifically, instead of the loop you have above,
$bwaccumdir = "";
opendir(ACCUMDIR, "${CFG_BASE_DIR}/bwaccumdir")
or die "Could not open ${CFG_BASE_DIR}/bwaccumdir\n";
@bwaccumdirs = grep /${CFG_EXPTNAME}buff/, readdir(ACCUMDIR);
closedir(ACCUMDIR);
for (@bwaccumdirs) {
$bwaccumdir .= " \"${CFG_BASE_DIR}/bwaccumdir/$_\"";
}
The change you made essentially ignores the loop.
Regarding your question about the amount of data, the more data you have, the better your system will be. Arthur provided some rules of thumb, which are useful to give you an idea of where to start, but nothing really replaces experimenting with data.
If you want to decode from a vocabulary of 1000 words, you will need a fairly large amount of data for context dependent phone models. As a comparison, the AN4 database has a vocab of 100 words, and it has 1000 utterances of training data (about 35 minutes of audio). RM1 has a vocab of about 1000 words, but it has something around 4000 utterances. I don't know your task, but you may want these simplifying assumptions:
-Can you use word models? If so, you'll probably create context independent models (step 2 in the scripts). You probably can use word models if you have a small vocab (such as digits, for example).
-Can you use sentence models? If the number of sentences that you want to recognize is small, then you can just pretend the whole sentence is a single model. It can be useful in very special circumstances, but I thought I could mention.
-Do you need to train speaker independent models? If you can train speaker dependent models (and record the utterances yourself), then maybe 10 or less tokens of each word may be enough. Experiment with it, so that you start getting a sense of how much data you need.
Hope this helps,
--Evandro
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-06-02
With respect to the script problem encountered in 02.ci_schmm/norm.pl, the script as originally written in Omar's slightly older version should work on Linux (it has for me). Therefore, IMHO he needs to insert debugging statements to find out why it isn't working.
In either version, the script expects one or more directories named ${CFG_BASE_DIR}/bwaccumdir/${CFG_EXPTNAME}buff* to exist, since they are created (and data written into them) in the baum_welch.pl script (since he's doing the baum_welch computation in only one part, there will be only one such directory). The variable $bwaccumdir is supposed to be set to a space-separated string of those directories (or in Omar's case, just one directory). Similar code exists in script modules 04 and 07, so finding the problem here in 02 should help in those subsequent steps as well.
Either (1) that $bwaccumdir directory doesn't exist, (2) it exists but has the wrong name, or (3) something is wrong with the values of $CFG_BASE_DIR or $CFG_EXPTNAME -- we can't tell from here, so I suggest that Omar must find out why.
cheers,
jerry
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, only I need a system recognition very similar with TDIGITS, please tell me Wath is the type of utterances for this system? And I need a list of phones, And is necesary use all the scripts for the continous models (00,02, 03,04,05,06,07)?
I need so much this help, I try to train a continous acustic model since 3 months ago and I read all the manual, but I m very confuse.
Omar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, only I need a system recognition very similar with TDIGITS (small vocaulary in spanish-mexican), please tell me what is the type of utterances for this system? And I need a list of phones, And is necesary use all the scripts for the continous models (00,02, 03,04,05,06,07)?
I think that the utterances are similar with:
HOLA COMO ESTAS
ESTAS COMO HOLA
COMO HOLA ESTAS
..
the last example is fine??
I need so much this help, I try to train a continous acustic model since 3 months ago and I read all the manuals, but I m very confuse. Becuase I have many errors
Can somebody tell me how can I make a system like TDIGITS (same amount of data), please ...
Omar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I try tot rain the acustic model and I hae this type of errors:
ERROR: "gauden.c", line 1418: var (mgau=0, feat=0, density=3,component=38)<0
Is a serious error?? Please tellme not, because I run the 00,02,03,04,05,06 and 07 for the continous model and I finish today, but if this is a serious error Im have to star again, and I am disesperade, This are the only errors and I finish "suseccsfully" th train.
I read the post "Mdef(senones) versus mgau(senones)" but I dont know if is a serious error, I record my training data 3 times, and the last is medium size (102 utterances).
Please help me with this problem or answer the last message. thanks a lot
Omar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Anonymous
-
2005-06-17
Omar -- I believe that your question about "ERROR: "gauden.c", line 1418" is answered in the SphinxTrain FAQ document at http://www-2.cs.cmu.edu/~rsingh/sphinxman/FAQ.html#10 (the source line is different, but the error message is the same otherwise). The answer there is that this is probably not a serious error.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Indeed, what Jerry pointed out is correct. The fix was because the original script didn't work in Windows. The current version should work in linux and windows, but the previous one should work in linux.
Omar, first of, building a speech recognition system isn't trivial: there are some many variables, and so many ways of doing things, that you can't expect to learn everything you need in just a couple of months.
For example, does your system have to be speaker independent, or would a speaker dependent suffice? Who's going to use it? Which conditions? Do you know the microphone and the environment in which it will be used? How many words do you want to use? Keep in mind that the more you constrain your task, the easier it is to build the system.
tidigits is small vocab, but it's speaker independent. The training set has about 8500 utterances of adult speakers (plus a couple thousand of children), with a vocabulary of 11 words, and works fine with a flat (uniform) unigram language model.
Now, in this thread you started saying you had a vocabulary of 1000+ words. Now you're saying it's similar to tidigits. It may seem like a small detail, but the vocabulary size implies completely different possibilities.
Tidigits may not be the best example, because your system has non digit words, and you probably want these sentences to be grammatical: "123" and "321" are acceptable, whereas "hola como estas" is acceptable but ""como hola estas" is not. This affects training in that grammatical sentences will sound more natural: speakers sound artificial when uttering fake ungrammatical utterances.
Small vocab (specially something like 10 words) opens up the possibility of using word models, in which case you could use context independent models (up to step 02) If your vocabulary is larger, then you're better off using phone models, in which case you'll need to go through all steps.
A speaker dependent system normally requires less data for the same accuracy, but the user has to record however many words they'll need, and whoever many repetitions.
Sequences of words (as opposed to digits) has implications in the decoder: you'll probably need a language model, which you don't really need for tidigits
What's the best for your system? Well, you need to define what you really need, and then experiment with it. The people in this forum can tell you things to try, but no fail safe answers.
--Evandro
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for all the anwsers, and I can tell you now more clearly (I hope) th systme that I need to make.
I need recognize words about 20 words
I need a recognition system with Sphinx 4, speaker independent ( continous model)
Im work in Linux (Red-Hat 9.2)
The words must be in spanish-mexican
The words are: HOLA COMO ESTAS YO ESTOY MUY BIEN QUISIERAS SER MI AMIGO POR FAVOR SOY MEXICANO COMPAERO HUESO DEL LLANO AZUL
Can you tell me what is the less amount of data for this system (number of utterances, number of words/utterance, minimal length for any utterance (min), etc...)?
I need a language model?
I need run all the scripts, or only 00 and 02?
You are really good helpers, thanks a lot and, for this moment only need this system but I make a more big system (more vocabulary) in the future, please help me with this system.
Omar
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
Im back. I have the same problems with the script 02 ( perl script) when I run the script the BW run fine, but the NORM script dont run because in the perl code one variable dont have anything (is empty) and need hace the path for the /root/modeloacusticomex/bwaccumdir/modeloacusticomex_buff_1
I fix this error put the path but I dont know if this can be a problem in the future, this is a peace of wrong code:
die "USAGE: $0 <iter>" if ($#ARGV != $index);
$iter = $ARGV[$index];
$modelname="${CFG_EXPTNAME}.ci_${CFG_DIRLABEL}";
$processpart="02.ci_schmm";
$bwaccumdir = "";
for (<"${CFG_BASE_DIR}/bwaccumdir/${CFG_EXPTNAME}buff*">) {
$bwaccumdir .= " \"$_\"";
}
$hmmdir = "${CFG_BASE_DIR}/model_parameters/$modelname";
and this is my correction:
die "USAGE: $0 <iter>" if ($#ARGV != $index);
$iter = $ARGV[$index];
$modelname="${CFG_EXPTNAME}.ci_${CFG_DIRLABEL}";
$processpart="02.ci_schmm";
$bwaccumdir = "";
for (<"${CFG_BASE_DIR}/bwaccumdir/${CFG_EXPTNAME}buff*">) {
$bwaccumdir .= " \"$_\"";
}
$bwaccumdir = "/root/modeloacusticomex/bwaccumdir/modeloacusticomex_buff_1";
$hmmdir = "${CFG_BASE_DIR}/model_parameters/$modelname";
can you tell me if this is a bug or can be because I have worn settings in the (configuration file). But all paths in the configuration file are correct I check before
Somebody have a same problem (or had).
this are my specifications:
If you need other data please tell me, I need your help
Omar
1, Before I go forth to answer you. I want you to know the amount of data you have is far far from enough training a model for one language.
Please kindly the "10 common pitfalls" document of training about data requirement of SphinxTrain.
http://www.cs.cmu.edu/~archan/10CommonPitfalls_ST.html
2, I noticed that you are using your root access to run the process. I hope to remind you this is very dangerous in a security purpose.
3, Now going back to your problem. If you hack the script in this way, you will have problem in step 4 because it get the accumulator file from a different place.
Regards,
Arthur
I dont know what is the amount of my training data, can you tell me how calculate?
And I only want to create a aplication for speech recognition for a small amount of words of mexican-spanish not for all the language.
I have 102 waveforms and 1072 differents words in my dictionary, I hope this represent the amount of data.
Omar
I am glad that at last you are asking the right question.
If you want to build context-dependent phone model, a rough estimate is that you probably need 10-20 hours of native speech.
If you want to train whole word model, what you need is for each word, 10-50 occurences of word must be present.
Let's assume you have an isolated word system and given the amount of word, then that is to say you need minimally 10*1072 utterances in your training set to get good results.
That's why I said the amount of data you have is far from enough. The first thing you should consider when you do training is whether you have enough training data. The data can be obtained by
1, From Linguistic Data Consortium.
2, From your own recording. (It actually takes months to gather this data and transcribe them.)
Now if you don't have the data,
1, But you still want to do some training, try simple database like AN4 and RM, they are essentially free.
2, Think of ways to gather data from friends.
Arthur
And you think that all the problems that I have with the scripts (training) was because I dont have the suficiente amount of data?
If only I have to reconize 50 words than I need at least 50*10 (where 10 is the number of ocurrences of the word in the uterances) utterances??
Then If I only have to train word model I need utterances when these words?? for example i I want to recognize HOLA ADIOS MAMA CORRER ....
My utterances must be:
ADIOS MAMA CORRER MAMA ...
MAMA ADIOS MAMA CORRER CORRER ...
....
???
Omar
The problem in your script is another thing. What I tried to say is you could probably complete the training with that amount of data. However, the models will be totaly unusable. (Say give you gibberish everytime.) That's why I just want to stop you before you start.
Arthur
Omar,
Recently (i.e. in the last 3 weeks) I made some changes to the SphinxTrain scripts. The particular problem you're referring to, I think, is solved by the changes. Specifically, instead of the loop you have above,
$bwaccumdir = "";
for (<"${CFG_BASE_DIR}/bwaccumdir/${CFG_EXPTNAME}buff*">) {
$bwaccumdir .= " \"$_\"";
}
the script now has:
$bwaccumdir = "";
opendir(ACCUMDIR, "${CFG_BASE_DIR}/bwaccumdir")
or die "Could not open ${CFG_BASE_DIR}/bwaccumdir\n";
@bwaccumdirs = grep /${CFG_EXPTNAME}buff/, readdir(ACCUMDIR);
closedir(ACCUMDIR);
for (@bwaccumdirs) {
$bwaccumdir .= " \"${CFG_BASE_DIR}/bwaccumdir/$_\"";
}
The change you made essentially ignores the loop.
Regarding your question about the amount of data, the more data you have, the better your system will be. Arthur provided some rules of thumb, which are useful to give you an idea of where to start, but nothing really replaces experimenting with data.
If you want to decode from a vocabulary of 1000 words, you will need a fairly large amount of data for context dependent phone models. As a comparison, the AN4 database has a vocab of 100 words, and it has 1000 utterances of training data (about 35 minutes of audio). RM1 has a vocab of about 1000 words, but it has something around 4000 utterances. I don't know your task, but you may want these simplifying assumptions:
-Can you use word models? If so, you'll probably create context independent models (step 2 in the scripts). You probably can use word models if you have a small vocab (such as digits, for example).
-Can you use sentence models? If the number of sentences that you want to recognize is small, then you can just pretend the whole sentence is a single model. It can be useful in very special circumstances, but I thought I could mention.
-Do you need to train speaker independent models? If you can train speaker dependent models (and record the utterances yourself), then maybe 10 or less tokens of each word may be enough. Experiment with it, so that you start getting a sense of how much data you need.
Hope this helps,
--Evandro
With respect to the script problem encountered in 02.ci_schmm/norm.pl, the script as originally written in Omar's slightly older version should work on Linux (it has for me). Therefore, IMHO he needs to insert debugging statements to find out why it isn't working.
In either version, the script expects one or more directories named ${CFG_BASE_DIR}/bwaccumdir/${CFG_EXPTNAME}buff* to exist, since they are created (and data written into them) in the baum_welch.pl script (since he's doing the baum_welch computation in only one part, there will be only one such directory). The variable $bwaccumdir is supposed to be set to a space-separated string of those directories (or in Omar's case, just one directory). Similar code exists in script modules 04 and 07, so finding the problem here in 02 should help in those subsequent steps as well.
Either (1) that $bwaccumdir directory doesn't exist, (2) it exists but has the wrong name, or (3) something is wrong with the values of $CFG_BASE_DIR or $CFG_EXPTNAME -- we can't tell from here, so I suggest that Omar must find out why.
cheers,
jerry
My taks is reconize at least 50 words, speaker independent. How many utterances I need?
Sorry but I dont know what type of utterances I need. I have utterances as:
OMAR ES UNA BUENA PERSONA
GABRIELA ES UNA BUENA PERSONA
UNA BUENA PERSONA ES GABRIELA
UNA BUENA PERSONA ES OMAR
.....
Only with the words that I need.
Is fine this type of utterances??
Can you tell me if need all the scripts for the continous model?: 00,02, 03,04,05,06,07
Omar
Hi,
Well, only I need a system recognition very similar with TDIGITS, please tell me Wath is the type of utterances for this system? And I need a list of phones, And is necesary use all the scripts for the continous models (00,02, 03,04,05,06,07)?
I need so much this help, I try to train a continous acustic model since 3 months ago and I read all the manual, but I m very confuse.
Omar
thanks for all,
Well, only I need a system recognition very similar with TDIGITS (small vocaulary in spanish-mexican), please tell me what is the type of utterances for this system? And I need a list of phones, And is necesary use all the scripts for the continous models (00,02, 03,04,05,06,07)?
I think that the utterances are similar with:
HOLA COMO ESTAS
ESTAS COMO HOLA
COMO HOLA ESTAS
..
the last example is fine??
I need so much this help, I try to train a continous acustic model since 3 months ago and I read all the manuals, but I m very confuse. Becuase I have many errors
Can somebody tell me how can I make a system like TDIGITS (same amount of data), please ...
Omar
Hi,
Im hope to be ansewred the last post.
I try tot rain the acustic model and I hae this type of errors:
ERROR: "gauden.c", line 1418: var (mgau=0, feat=0, density=3,component=38)<0
Is a serious error?? Please tellme not, because I run the 00,02,03,04,05,06 and 07 for the continous model and I finish today, but if this is a serious error Im have to star again, and I am disesperade, This are the only errors and I finish "suseccsfully" th train.
I read the post "Mdef(senones) versus mgau(senones)" but I dont know if is a serious error, I record my training data 3 times, and the last is medium size (102 utterances).
Please help me with this problem or answer the last message. thanks a lot
Omar
Omar -- I believe that your question about "ERROR: "gauden.c", line 1418" is answered in the SphinxTrain FAQ document at http://www-2.cs.cmu.edu/~rsingh/sphinxman/FAQ.html#10 (the source line is different, but the error message is the same otherwise). The answer there is that this is probably not a serious error.
Indeed, what Jerry pointed out is correct. The fix was because the original script didn't work in Windows. The current version should work in linux and windows, but the previous one should work in linux.
Omar, first of, building a speech recognition system isn't trivial: there are some many variables, and so many ways of doing things, that you can't expect to learn everything you need in just a couple of months.
For example, does your system have to be speaker independent, or would a speaker dependent suffice? Who's going to use it? Which conditions? Do you know the microphone and the environment in which it will be used? How many words do you want to use? Keep in mind that the more you constrain your task, the easier it is to build the system.
tidigits is small vocab, but it's speaker independent. The training set has about 8500 utterances of adult speakers (plus a couple thousand of children), with a vocabulary of 11 words, and works fine with a flat (uniform) unigram language model.
Now, in this thread you started saying you had a vocabulary of 1000+ words. Now you're saying it's similar to tidigits. It may seem like a small detail, but the vocabulary size implies completely different possibilities.
Tidigits may not be the best example, because your system has non digit words, and you probably want these sentences to be grammatical: "123" and "321" are acceptable, whereas "hola como estas" is acceptable but ""como hola estas" is not. This affects training in that grammatical sentences will sound more natural: speakers sound artificial when uttering fake ungrammatical utterances.
Small vocab (specially something like 10 words) opens up the possibility of using word models, in which case you could use context independent models (up to step 02) If your vocabulary is larger, then you're better off using phone models, in which case you'll need to go through all steps.
A speaker dependent system normally requires less data for the same accuracy, but the user has to record however many words they'll need, and whoever many repetitions.
Sequences of words (as opposed to digits) has implications in the decoder: you'll probably need a language model, which you don't really need for tidigits
What's the best for your system? Well, you need to define what you really need, and then experiment with it. The people in this forum can tell you things to try, but no fail safe answers.
--Evandro
HI
Thanks for all the anwsers, and I can tell you now more clearly (I hope) th systme that I need to make.
Can you tell me what is the less amount of data for this system (number of utterances, number of words/utterance, minimal length for any utterance (min), etc...)?
I need a language model?
I need run all the scripts, or only 00 and 02?
You are really good helpers, thanks a lot and, for this moment only need this system but I make a more big system (more vocabulary) in the future, please help me with this system.
Omar