I am very interesting in bavieca speech recogntion toolkits and I would like to make Korean acoustic model using bavieca.
I am trying to test training, but unfortunately, I encountered the following error message in the process of feature extraction.
~~~ mlfAll.txt at line 1, unexpected lexical unit found or wrong format of feature file name.
This is my mater label file.
"/female/fcb1jkh00s200/set200001.fea"
gU
gjvl_gwa
da_Um_gwa
gat_Un
gjvl_ron_Ul
vd_Ul
su
iS_vS_da
"/female/fcb1jkh00s200/set200002.fea"
gU
sa_ram_Un
i_ze
jv_gi
wa_sv
gU_rvn
gvl
sal
il_Un
vbs_Ul
gv_je_jo
"/female/fcb1jkh00s200/set200003.fea"
gU_ga
na_rUl
dol_a_bo_mjv
zo_joN_hi
ib_Ul
jvl_vS_da
....
Would you please tell me why this error happens?
I wonder the difference between master label file and MLF segments.
I would be very pleased if you could inform me training receipe(for example, directory structure, MLF file, MLF segments files etc.).
Thanks in advance for your help.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Your MLF seems fine. Please, check that there are no white spaces after the .fea" in the first line, just the end of line. The code that parses the MLF is not very tolerant to different formatting. If that does not solve the issue please send me yor MLF and I will take a closer look.
the MLF contains all the data used for training, The MLF segments are just different parts in which the original MLF is divided. The segments are needed for parallel processing, as typically each core processes a different MLF segment. Which should be the same size. For example if you have 1000 utterances for training the "global" MLF should contain them all, and each MLF segment should contain 250 in case you are training on a 4 core machine.
Please take a look at the WSJ training scripts in the repository, that should help you a lot.
Dani
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I took a closer look at your MLF and it is fine. That made me reread your original mail and realize that you are having this issue on the feature extraction, which is the "param" tool. That tool does not take any MLF as input, but a batch file. Please make sure you have the latest version of the toolkit from the git repository. Also, please take a look at the generic training and feature extraction scripts that come with it.
Regarding your question about the MLF segments, ideally all the segments should contain the same amount of speech, rather than the same amount of utterances. So if you have four hours of speech, you can create four master label files that are one hour each. The idea behind this is that, for each reestimation iteration during training, we want all cores to finish processing their MLF at the same time more or less, This maximizes CPU usage.
hope this helps
Dani
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To extract all features was successful, but it gave an error in next step "initialize the HMM parameters to the global distribution of the data (flat-start)".
This is the error I met.
hmminitializer (version: 0014, author: Daniel Bolanos)
Error: load ../common/estimation/MLFFile.cpp 74 loading MLF at line /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/mlf/mlfAll.txt at line 1, unexpected lexical unit found or wrong format of feature file name
load ../common/estimation/MLFFile.cpp 74 loading MLF at line /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/mlf/mlfAll.txt at line 1, unexpected lexical unit found or wrong format of feature file name
Lexicon file was also attached to to take a look at this problem.
I would appreciate you letting me know of any problems.
The app work well and the MLF is loaded, using the debugger i can see that it loads 41666 utterances. I cannot reproduce the issue that you are having. Please take a look at what is going on inside the mlf.load() method (MLFFile.cpp). That is where the Error you see is coming from. Put a couple of printf on that method so we can see what is going on.
Hello.
I am Dae Lim Choi, Korea.
I am very interesting in bavieca speech recogntion toolkits and I would like to make Korean acoustic model using bavieca.
I am trying to test training, but unfortunately, I encountered the following error message in the process of feature extraction.
~~~ mlfAll.txt at line 1, unexpected lexical unit found or wrong format of feature file name.
This is my mater label file.
"/female/fcb1jkh00s200/set200001.fea"
gU
gjvl_gwa
da_Um_gwa
gat_Un
gjvl_ron_Ul
vd_Ul
su
iS_vS_da
"/female/fcb1jkh00s200/set200002.fea"
gU
sa_ram_Un
i_ze
jv_gi
wa_sv
gU_rvn
gvl
sal
il_Un
vbs_Ul
gv_je_jo
"/female/fcb1jkh00s200/set200003.fea"
gU_ga
na_rUl
dol_a_bo_mjv
zo_joN_hi
ib_Ul
jvl_vS_da
....
Would you please tell me why this error happens?
I wonder the difference between master label file and MLF segments.
I would be very pleased if you could inform me training receipe(for example, directory structure, MLF file, MLF segments files etc.).
Thanks in advance for your help.
Hello Dae Lim Choi,
Your MLF seems fine. Please, check that there are no white spaces after the .fea" in the first line, just the end of line. The code that parses the MLF is not very tolerant to different formatting. If that does not solve the issue please send me yor MLF and I will take a closer look.
the MLF contains all the data used for training, The MLF segments are just different parts in which the original MLF is divided. The segments are needed for parallel processing, as typically each core processes a different MLF segment. Which should be the same size. For example if you have 1000 utterances for training the "global" MLF should contain them all, and each MLF segment should contain 250 in case you are training on a 4 core machine.
Please take a look at the WSJ training scripts in the repository, that should help you a lot.
Dani
Dear Daniel,
Thank you very much for your kind explanations.
Although there are no white spaces after the .fea" in my MLF, still the same error occurs.
Please find the attached my MLF.
Global MLF was split into 4 MLF segments according to the number of cpu cores.
I have another question.
Should each MLF segment contain exactly the same number of utterances?
Thank you so much again for your all help.
Dae Lim.
Last edit: Dae Lim Choi 2013-07-10
Hello Dae Lim,
I took a closer look at your MLF and it is fine. That made me reread your original mail and realize that you are having this issue on the feature extraction, which is the "param" tool. That tool does not take any MLF as input, but a batch file. Please make sure you have the latest version of the toolkit from the git repository. Also, please take a look at the generic training and feature extraction scripts that come with it.
Regarding your question about the MLF segments, ideally all the segments should contain the same amount of speech, rather than the same amount of utterances. So if you have four hours of speech, you can create four master label files that are one hour each. The idea behind this is that, for each reestimation iteration during training, we want all cores to finish processing their MLF at the same time more or less, This maximizes CPU usage.
hope this helps
Dani
Thank you so much, Daniel.
I'm so sorry to bother you again.
To extract all features was successful, but it gave an error in next step "initialize the HMM parameters to the global distribution of the data (flat-start)".
This is the error I met.
hmminitializer (version: 0014, author: Daniel Bolanos)
hmminitializer -fea /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/fea
-cfg /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/config/features.cfg
-pho /home/dlchoi/bavieca
-code/tasks/dict01/scripts/train/config/lexicon/phoneset.txt
-lex /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/config/lexicon/lexicon.txt
-mlf /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/mlf/mlfAll.txt -met flatStart
-mod /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/AM/init/models00.bin
Error: load ../common/estimation/MLFFile.cpp 74 loading MLF at line /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/mlf/mlfAll.txt at line 1, unexpected lexical unit found or wrong format of feature file name
load ../common/estimation/MLFFile.cpp 74 loading MLF at line /home/dlchoi/bavieca-code/tasks/dict01/scripts/train/mlf/mlfAll.txt at line 1, unexpected lexical unit found or wrong format of feature file name
Lexicon file was also attached to to take a look at this problem.
I would appreciate you letting me know of any problems.
Dae Lim
Last edit: Dae Lim Choi 2013-07-11
Dae Lim,
I created a mini app to try to reproduce the problem that you have. I created a phoneset for that purpose using your lexicon file (it is enclosed).
This is how the app looks like:
PhoneSet phoneset1("phoneset.txt");
phoneset1.load();
LexiconManager lexiconManager1("lexicon.txt",&phoneset1);
lexiconManager1.load();
MLFFile mlf(&lexiconManager1,"mlfAll.txt",MODE_READ);
mlf.load();
The app work well and the MLF is loaded, using the debugger i can see that it loads 41666 utterances. I cannot reproduce the issue that you are having. Please take a look at what is going on inside the mlf.load() method (MLFFile.cpp). That is where the Error you see is coming from. Put a couple of printf on that method so we can see what is going on.
Dani
Hello, Daniel.
I'm still experiencing mlf & lexicon loading error.
First of all,
I would like to know linux environments for compile & install bavieca(Distribution and version, gcc compiler version, etc.)
Could you please let me know these informations in detail?
I think it would be helpful to find the cause of the problem.
Thank you very much for your help.
Dae Lim
Hello Dae Lim,
I'm sorry you are still dealing with that problem. Did you try to put a couple of printf to see what is going on?
The toolkit should run on pretty much any version of Linux. I have successfully used it with:
(gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
and
(gcc version 4.7.2 20121109 (Red Hat 4.7.2-8) (GCC) )
and also in CentOS
what version of Linux/gcc are you using?
Daniel
Last edit: Daniel Bolanos 2013-07-16