I am trying to build a small acoustic model.
When I run sphinxtrain run I receive the following error:
*MODULE: DECODE Decoding using models previously trained
Decoding 8 segments starting at 0 (part 1 of 1)
0% ERROR: FATAL: "batch.c", line 822: PocketSphinx decoder init failed
ERROR: This step had 1 ERROR messages and 0 WARNING messages. Please check the log file for details.
ERROR: Failed to start pocketsphinxbatch
Aligning results to find error rate
*
The log in logDir/Decode folder shows the following error:
*INFO: feat.c(715): Initializing feature stream to type: '1scddd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
ERROR: "acmod.c", line 79: Folder [...]sphinxtrain/scripts/an4/model_parameters/an4.cd_cont_256' does not contain acoustic model definition 'mdef'
FATAL: "batch.c", line 822: PocketSphinx decoder init failed
I tried checking similar messages in the forum but nothing seemed to work for my case.
Dear Nickolay, Thank you for refering me to the source where I could solve my previous bug.
I succeded now to train my acoustic model, but now i have a new problem: I have 0 % accuracy.
Following the tutorials, I build a phoneme-based language model, since I want to try to recognize some given phrase (check the user's fluency in reading hebrew).
I am converting the hebrew text to phonemes to train the language model.
I also trained my acoustic model, which I expect to recognize only myself now, using only words that I have used in the training.
What I am doing wrong ?
Again a link to my an4 folder: https://drive.google.com/open?id=0B01ecFOycElEUjU2azJrTWFOVjA
When i run:
$ pocketsphinx_continuous -hmm model_parameters/an4.ci_cont -lm etc/an4.lm.bin -dict etc/an4.dic -infile wav/Ayelet/file_2.wav - I can't see the expected phrase converted in phonemes.
and when I run:
$ sphinxtrain -s decode run
I get 100% error rate in both sentences and words.
I am sure missing something in the process.
Thank you very much for the patience !
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I found the following error in the file an4.g2p.evaluate.log:
IOError: [Errno 2] No such file or directory: '/.../cmusphinx-code/sphinxtrain/scripts/an4/g2p/an4.test'
Who / what was suppossed to generate this file ?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried to do everything from the beginning but Iam still getting a 0% accuracy and I am not sure what I am doing wrong, or maybe I just need much more data ?
But I want to build a small model, using just predefined words, meaning, I want to recognize only a relatively small number of words (~200).
My objective is to recognize if a person is reading correctly a phrase that is displayed to him/her (in hebrew). So the program knows what it is expected to recognize.
So what I did is the following:
1) I created a phonetic language model, as described in http://cmusphinx.sourceforge.net/wiki/phonemerecognition
I used SRILM to build my model using the command: ngram-count -text phoneticTranslationText.txt -lm myPhoneticModel.lm
when phoneticTranslationText.txt is the hebrew text writen as phonemes, for example the first sentence is: "SIL O RE N TA MI Y D RA TZA H KE LE V SIL" which is the phoneme writing to the phrase ארן תמיד רצה כלב.
I have 86 sentences like this.
My phonetic model looks like:
\data\
ngram 1=81
ngram 2=582
ngram 3=199
\1-grams:
-1.223562
-99 -1.884235
-1.442057 A -0.2517919
-2.857031 B -0.626858
-1.534811 BA -0.07855582
-2.680939 BE -0.2802781
-2.857031 BI -0.3330438
2) Then I built the acoustic model:
Following the tutorial in http://cmusphinx.sourceforge.net/wiki/tutorialam I built the following files:
- transcription file (for train and test)
- dictionary - with the hebrew words and their phonemes (e.g. אֹרֶן O RE N)
- the filler dictionary
- the fileids (both for train and test)
- The file listing all the phonemes (A, BA, ...)
And I edited the "sphinx_train.cfg" as described.
Because it such a small model, I understand I have to do it context independent. (Is it indeed a requirement here?). So I changed the $CFG_CD_TRAIN to 'no'. And i learned from your previous post that I need also to change $DEC_CFG_MODEL_NAME to "$CFG_EXPTNAME.ci_cont" in this case.
I trained the model succesfully, without any error messages, but I get 0% accuracy.
I tried having 8 sentences in the test, then having 1 sentence in the test, but still nothing improved.
I am not sure what I am missing and your help or direction to sources where I can learn more to understand what I am doing wrong would be greatly appreciatted.
Thank you,
Ayelet
Last edit: Ayelet Goldstein 2016-12-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for the fast reply!
But is indeed necessary so much data even if my requirements are so low ?
I want to recognize always the same speaker I used for the train, the phrases I am going to recognize are exactly the same ones as the phrases I am training, , I know which phrase is being displayed, and in hebrew there are no different ways of pronouncing the same letters (for example the "א with a trace below will always be pronouced like "A", different from english that is sometimes A sometimes O, etc.).
Maybe there is an easier way to do see if a phrase is being read correctly without having to build a robustic acoustic model ?
Last edit: Ayelet Goldstein 2016-12-27
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Ok, I have now 0.5h of speech according to sphinxtrain. I am having only 9 different phrases, each phrase I recorded multiple times. I just want to make sure I am on the right way, to see if the accuracy gets somehow improved.
If my vocabulary is relatively small, then I might need less data - I am right ? (just as is stated in the manual that for "commands, single speaker, we need about 1 hour of data")
I trained the model on about 20-40 recordings of each sentence, and tested in one sentence of each type (a total of 9 sentences tested).
I still get a round ZERO in my accuracy.
I tried transliterating all the text, still no improvement in accuracy.
I tried training with context-dependent set to YES, still no improvement.
When training, I get a message saying I have 2 errors, but I can't see them in the log:
Phase 3: Forward-Backward
Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ERROR: This step had 2 ERROR messages and 0 WARNING messages. Please check the log file for details.
A few questions:
1) Does my zero-percent accuracy is still due to not enough data in the training, or I am missing something else here ?
2) Where can i see these 2 errors mentioned during the training ?
3) In general, should I perform "alignment" during training ?
4) Should I run with context dependent set to yes ?
5) Is it necessary to perform transliteration of the hebrew text or it is not required (meaning, converting all the hebrew words to their correspondent english form) ?
6) To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to be recognized by multiple speakers ?
7) In the language model (using SRILM), should I use word-to-phoneme or sentence-to-phoneme ? ngram-count -text (filename of file containing the sentences written using their phonemes) OR ngram-count -vocab (filename of file containing words mapped to their specific phonemes)
And should the text used to train the language model also contain all the duplicated sentences or only distinct sentences ? (if I want to recognize only let's say 5 sentences, should the model contain duplications of these senteces, or only the 5 specific sentences ?)
1) Does my zero-percent accuracy is still due to not enough data in the training, or I am missing something else here ?
Default training process is not supposed to evaluate phonetic decoding accuracy, it evaluates word decoding accuracy. You have to modify the test reference to list phonemes and decoding script to use allphone decoding mode. Or you can modify scoring script to score phonemes and modify decoding script to use allphone decoding mode.
2) Where can i see these 2 errors mentioned during the training ?
In logs in logdir folder
3) In general, should I perform "alignment" during training ?
It depends on the quality of your data.
4) Should I run with context dependent set to yes ?
Context-dependent decoding is usually slow, it depends if you want to tolerate that.
5) Is it necessary to perform transliteration of the hebrew text or it is not required (meaning, converting all the hebrew words to their correspondent english form) ?
No
6) To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to >be recognized by multiple speakers ?
7) In the language model (using SRILM), should I use word-to-phoneme or sentence-to-phoneme ?
ngram-count -text (filename of file containing the sentences written using their phonemes) OR ngram-count -vocab (filename of file containing words mapped to their specific phonemes)
Phonetic model trainign is covered in details on the wiki:
And should the text used to train the language model also contain all the duplicated sentences or only distinct sentences ? (if I want to recognize only let's say 5 sentences, should the model contain duplications of these senteces, or only the 5 specific sentences ?)
This question is easy to answer if you try to understand how language models and machine learning work, there are many tutorials you can find in the web, you need to read them first. Without that knowledge there is not much sense to continue, you will wander blindly.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
"Default training process is not supposed to evaluate phonetic decoding accuracy, it evaluates word decoding accuracy. You have to modify the test reference to list phonemes and decoding script to use allphone decoding mode. Or you can modify scoring script to score phonemes and modify decoding script to use allphone decoding mode."
Can you guide me on how I can modify them ? Or where I can find instructions / manual on how to modify them ?
"should I perform "alignment" during training ?" - It depends on the quality of your data.
So if the quality is good alignment is not required ? Or where i can find more information about what is the alignment process and its effects ?
"To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to >be recognized by multiple speakers ?" - This question is answered in the first line of the acoustic model training tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam > I trained the acoustic model based on this tutorial, and I found there that for single speaker I need 1 hour of recording for "command and control", 10 hours for dictation. But I didn't find anything written regarding a few sentences. Like 5-10 sentences, how many hours are required ?
I read the training acoustic model tutorial and also the building a language model tutorial and followed all its steps.
I am still having problems and would be very happy if you could help me or direct me to additional sources where I can find how I can progress here. Are additional tutorials that explain how to use the CMU sphix library ?
Thank you,
Ayelet
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to build a small acoustic model.
When I run sphinxtrain run I receive the following error:
*MODULE: DECODE Decoding using models previously trained
Decoding 8 segments starting at 0 (part 1 of 1)
0% ERROR: FATAL: "batch.c", line 822: PocketSphinx decoder init failed
ERROR: This step had 1 ERROR messages and 0 WARNING messages. Please check the log file for details.
ERROR: Failed to start pocketsphinxbatch
Aligning results to find error rate
*
The log in logDir/Decode folder shows the following error:
*INFO: feat.c(715): Initializing feature stream to type: '1scddd', ceplen=13, CMN='batch', VARNORM='no', AGC='none'
ERROR: "acmod.c", line 79: Folder [...]sphinxtrain/scripts/an4/model_parameters/an4.cd_cont_256' does not contain acoustic model definition 'mdef'
FATAL: "batch.c", line 822: PocketSphinx decoder init failed
I tried checking similar messages in the forum but nothing seemed to work for my case.
I need help (!)
Thank you very much in advance !
ps. the link to my logDir folder is https://drive.google.com/open?id=0B01ecFOycElEaWJkYklNUHFoUUE
pss. Here the link to my an4 folder: https://drive.google.com/open?id=0B01ecFOycElENTlHZnlpeDROSFk
Last edit: Ayelet Goldstein 2016-12-23
Duplicate of https://sourceforge.net/p/cmusphinx/discussion/help/thread/c53a3944/
Thank you !!
Dear Nickolay, Thank you for refering me to the source where I could solve my previous bug.
I succeded now to train my acoustic model, but now i have a new problem: I have 0 % accuracy.
Following the tutorials, I build a phoneme-based language model, since I want to try to recognize some given phrase (check the user's fluency in reading hebrew).
I am converting the hebrew text to phonemes to train the language model.
I also trained my acoustic model, which I expect to recognize only myself now, using only words that I have used in the training.
What I am doing wrong ?
Again a link to my an4 folder: https://drive.google.com/open?id=0B01ecFOycElEUjU2azJrTWFOVjA
When i run:
$ pocketsphinx_continuous -hmm model_parameters/an4.ci_cont -lm etc/an4.lm.bin -dict etc/an4.dic -infile wav/Ayelet/file_2.wav - I can't see the expected phrase converted in phonemes.
and when I run:
$ sphinxtrain -s decode run
I get 100% error rate in both sentences and words.
I am sure missing something in the process.
Thank you very much for the patience !
I found the following error in the file an4.g2p.evaluate.log:
IOError: [Errno 2] No such file or directory: '/.../cmusphinx-code/sphinxtrain/scripts/an4/g2p/an4.test'
Who / what was suppossed to generate this file ?
I tried to do everything from the beginning but Iam still getting a 0% accuracy and I am not sure what I am doing wrong, or maybe I just need much more data ?
But I want to build a small model, using just predefined words, meaning, I want to recognize only a relatively small number of words (~200).
My objective is to recognize if a person is reading correctly a phrase that is displayed to him/her (in hebrew). So the program knows what it is expected to recognize.
So what I did is the following:
1) I created a phonetic language model, as described in http://cmusphinx.sourceforge.net/wiki/phonemerecognition
I used SRILM to build my model using the command:
ngram-count -text phoneticTranslationText.txt -lm myPhoneticModel.lm
when phoneticTranslationText.txt is the hebrew text writen as phonemes, for example the first sentence is: "SIL O RE N TA MI Y D RA TZA H KE LE V SIL" which is the phoneme writing to the phrase ארן תמיד רצה כלב.
I have 86 sentences like this.
My phonetic model looks like:
\data\ ngram 1=81
ngram 2=582
ngram 3=199
\1-grams:
-1.223562
-99
-1.884235-1.442057 A -0.2517919
-2.857031 B -0.626858
-1.534811 BA -0.07855582
-2.680939 BE -0.2802781
-2.857031 BI -0.3330438
2) Then I built the acoustic model:
Following the tutorial in http://cmusphinx.sourceforge.net/wiki/tutorialam I built the following files:
- transcription file (for train and test)
- dictionary - with the hebrew words and their phonemes (e.g. אֹרֶן O RE N)
- the filler dictionary
- the fileids (both for train and test)
- The file listing all the phonemes (A, BA, ...)
And I edited the "sphinx_train.cfg" as described.
Because it such a small model, I understand I have to do it context independent. (Is it indeed a requirement here?). So I changed the $CFG_CD_TRAIN to 'no'. And i learned from your previous post that I need also to change $DEC_CFG_MODEL_NAME to "$CFG_EXPTNAME.ci_cont" in this case.
I trained the model succesfully, without any error messages, but I get 0% accuracy.
I tried having 8 sentences in the test, then having 1 sentence in the test, but still nothing improved.
I am not sure what I am missing and your help or direction to sources where I can learn more to understand what I am doing wrong would be greatly appreciatted.
Thank you,Ayelet
Last edit: Ayelet Goldstein 2016-12-27
You do not have enough data, data requirements are listed in the beginning of the acoustic model training tutorial.
Thank you for the fast reply!
But is indeed necessary so much data even if my requirements are so low ?
I want to recognize always the same speaker I used for the train, the phrases I am going to recognize are exactly the same ones as the phrases I am training, , I know which phrase is being displayed, and in hebrew there are no different ways of pronouncing the same letters (for example the "א with a trace below will always be pronouced like "A", different from english that is sometimes A sometimes O, etc.).
Maybe there is an easier way to do see if a phrase is being read correctly without having to build a robustic acoustic model ?
Last edit: Ayelet Goldstein 2016-12-27
Yes
No
Thank you again for the fast reply !
Seems I will have to continue fighting my way towards a robust model then :)
Ok, I have now 0.5h of speech according to sphinxtrain. I am having only 9 different phrases, each phrase I recorded multiple times. I just want to make sure I am on the right way, to see if the accuracy gets somehow improved.
If my vocabulary is relatively small, then I might need less data - I am right ? (just as is stated in the manual that for "commands, single speaker, we need about 1 hour of data")
I trained the model on about 20-40 recordings of each sentence, and tested in one sentence of each type (a total of 9 sentences tested).
I still get a round ZERO in my accuracy.
I tried transliterating all the text, still no improvement in accuracy.
I tried training with context-dependent set to YES, still no improvement.
When training, I get a message saying I have 2 errors, but I can't see them in the log:
Phase 3: Forward-Backward
Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
ERROR: This step had 2 ERROR messages and 0 WARNING messages. Please check the log file for details.
A few questions:
1) Does my zero-percent accuracy is still due to not enough data in the training, or I am missing something else here ?
2) Where can i see these 2 errors mentioned during the training ?
3) In general, should I perform "alignment" during training ?
4) Should I run with context dependent set to yes ?
5) Is it necessary to perform transliteration of the hebrew text or it is not required (meaning, converting all the hebrew words to their correspondent english form) ?
6) To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to be recognized by multiple speakers ?
7) In the language model (using SRILM), should I use word-to-phoneme or sentence-to-phoneme ?
ngram-count -text (filename of file containing the sentences written using their phonemes) OR ngram-count -vocab (filename of file containing words mapped to their specific phonemes)
And should the text used to train the language model also contain all the duplicated sentences or only distinct sentences ? (if I want to recognize only let's say 5 sentences, should the model contain duplications of these senteces, or only the 5 specific sentences ?)
Thank you very much,
Ayelet
Ps. A link to my DB: https://drive.google.com/open?id=0B01ecFOycElEUWxZZktDRHBVUW8
Last edit: Ayelet Goldstein 2017-01-02
Default training process is not supposed to evaluate phonetic decoding accuracy, it evaluates word decoding accuracy. You have to modify the test reference to list phonemes and decoding script to use allphone decoding mode. Or you can modify scoring script to score phonemes and modify decoding script to use allphone decoding mode.
In logs in logdir folder
It depends on the quality of your data.
Context-dependent decoding is usually slow, it depends if you want to tolerate that.
No
This question is answered in the first line of the acoustic model training tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
Phonetic model trainign is covered in details on the wiki:
http://cmusphinx.sourceforge.net/wiki/phonemerecognition#training_phonetic_language_model_for_decoding
This question is easy to answer if you try to understand how language models and machine learning work, there are many tutorials you can find in the web, you need to read them first. Without that knowledge there is not much sense to continue, you will wander blindly.
"Default training process is not supposed to evaluate phonetic decoding accuracy, it evaluates word decoding accuracy. You have to modify the test reference to list phonemes and decoding script to use allphone decoding mode. Or you can modify scoring script to score phonemes and modify decoding script to use allphone decoding mode."
"should I perform "alignment" during training ?" - It depends on the quality of your data.
"To be able to recognize X predefined sentences,(single speaker), what's the size of my training data, or more specifically, how many times I have to record each sentence ? And what if I want it to >be recognized by multiple speakers ?" - This question is answered in the first line of the acoustic model training tutorial http://cmusphinx.sourceforge.net/wiki/tutorialam
> I trained the acoustic model based on this tutorial, and I found there that for single speaker I need 1 hour of recording for "command and control", 10 hours for dictation. But I didn't find anything written regarding a few sentences. Like 5-10 sentences, how many hours are required ?
I read the training acoustic model tutorial and also the building a language model tutorial and followed all its steps.
I am still having problems and would be very happy if you could help me or direct me to additional sources where I can find how I can progress here. Are additional tutorials that explain how to use the CMU sphix library ?
Thank you,
Ayelet