I want to build a pocketsphinx based speech recognition system for Arabic continuous speech. I have more than 10 hours corpus, I am not sure if it is better to use adapting technique or to train the system from scratch.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you.
After started training, it gave me the following warnings,t the training was not complete.
WARNING: Utterance ID mismatch on line 6143: 202/202-96 vs
WARNING: Bad line in transcript: ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒ ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒ ▒▒▒▒▒ ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒
...
WARNING: This phone (Z) occurs in the phonelist (/home/.../trial1/etc/trial1.phone), but not in any word in the transcription (/home/.../trial1/etc/trial1_train.transcription)
Any help.
Last edit: Zain 2017-01-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Just finished training. It also gives the results which are extremely low. However, there are some errors as shown below:
What is the reason of this error which frequently appeard during taining?
Is this resutl resonable?
ERROR: This step had 11206 ERROR messages and 0 WARNING messages. Please check the log file for details.
Normalization for iteration: 6
Current Overall Likelihood Per Frame = -145.917409811125
Training for 8 Gaussian(s) completed after 6 iterations
MODULE: 60 Lattice Generation
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 61 Lattice Pruning
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 62 Lattice Format Conversion
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 65 MMIE Training
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 90 deleted interpolation
Skipped for continuous models
MODULE: DECODE Decoding using models previously trained
Decoding 400 segments starting at 0 (part 1 of 1)
0%
Aligning results to find error rate
SENTENCE ERROR: 90.2% (361/400) WORD ERROR RATE: 52.4% (1877/3585)
Last edit: Zain 2017-01-08
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Our models are all trained with sphinxtrain, you probably mean the model trained for sphinx4/sphinx3 (continuos) and pocketsphinx(semi-continuous and ptm). continuous models are expected to be more accurate, but they are also slower to decode. You can learn more on wiki:
How to know the best choice of CFG_N_TIED_STATES and CFG_FINAL_NUM_DENSITIES for a particular speech collection? Could you please send me a tutorial link for preparing language models for CMU Sphinx.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Neither of them requires dictionary. You need to be more precise in description of your problems. The more details you provide the faster you get an answer.
I performed some experiment and found that ptm and semi-continuous have same WER while the continuous acoustic model has higher WER, is this reasonable?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For computational purpose it is helpful to detect parts of triphones instead of triphones as a whole, for example, to create a detector for a beginning of triphone and share it across many triphones. The whole variety of sound detectors can be represented by a small amount of distinct short sound detectors. Usually we use 4000 distinct short sound detectors to compose detectors for triphones. We call those detectors senones. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I want to build a pocketsphinx based speech recognition system for Arabic continuous speech. I have more than 10 hours corpus, I am not sure if it is better to use adapting technique or to train the system from scratch.
You need to train and you need much more than 10 hours of data. Ideally you need 200-300 hours.
Thank you.
After started training, it gave me the following warnings,t the training was not complete.
WARNING: Utterance ID mismatch on line 6143: 202/202-96 vs
WARNING: Bad line in transcript:
▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒ ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒ ▒▒▒▒▒ ▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒▒▒ ▒▒▒▒▒▒...
WARNING: This phone (Z) occurs in the phonelist (/home/.../trial1/etc/trial1.phone), but not in any word in the transcription (/home/.../trial1/etc/trial1_train.transcription)
Any help.
Last edit: Zain 2017-01-08
You need to prepare the data in required format as described in tutorial. If format has errors, training will not proceed.
If you need further help you can share your database.
Just finished training. It also gives the results which are extremely low. However, there are some errors as shown below:
What is the reason of this error which frequently appeard during taining?
Is this resutl resonable?
ERROR: This step had 11206 ERROR messages and 0 WARNING messages. Please check the log file for details.
Normalization for iteration: 6
Current Overall Likelihood Per Frame = -145.917409811125
Training for 8 Gaussian(s) completed after 6 iterations
MODULE: 60 Lattice Generation
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 61 Lattice Pruning
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 62 Lattice Format Conversion
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 65 MMIE Training
Skipped: $ST::CFG_MMIE set to 'no' in sphinx_train.cfg
MODULE: 90 deleted interpolation
Skipped for continuous models
MODULE: DECODE Decoding using models previously trained
Decoding 400 segments starting at 0 (part 1 of 1)
0%
Aligning results to find error rate
SENTENCE ERROR: 90.2% (361/400) WORD ERROR RATE: 52.4% (1877/3585)
Last edit: Zain 2017-01-08
Answered in troubleshooting section of tutorial
Yes
I noticed that the system trained using Sphinx 3 gives better accuracy more than Pocketsphinx. Is there any reason for such difference?
Our models are all trained with sphinxtrain, you probably mean the model trained for sphinx4/sphinx3 (continuos) and pocketsphinx(semi-continuous and ptm). continuous models are expected to be more accurate, but they are also slower to decode. You can learn more on wiki:
http://cmusphinx.sourceforge.net/wiki/acousticmodeltypes
I used the following command to check a speech file. Could you please let me know how to interpret the results? What does the below 92.9%?
$ for i in .wav; do play $i; done
arctic_0001.wav:
File Size: 176k Bit Rate: 256k
Encoding: Signed PCM
Channels: 1 @ 16-bit
Samplerate: 16000Hz
Replaygain: off
Duration: 00:00:05.51
In:92.9% 00:00:05.12 [00:00:00.39] Out:81.9k [ =====|===== ] Hd:5.6 Clip:0 Segmentation fault (core dumped)
Play command crashed due to the bug, maybe a bug in driver, maybe something else. It is not really related to pocketsphinx.
How to know the best choice of CFG_N_TIED_STATES and CFG_FINAL_NUM_DENSITIES for a particular speech collection? Could you please send me a tutorial link for preparing language models for CMU Sphinx.
Covered in a table in http://cmusphinx.sourceforge.net/wiki/tutorialam#configure_model_type_and_model_parameters
http://cmusphinx.sourceforge.net/wiki/tutoriallm
Is it ok for CMU sphinx training dictionary to have small letters or it can contain both small and capital letters for phoneme representation?
It is ok but not recommended.
I have a problem in preparing the language model that gives the following error message:
hash_add: Error: [AistiqTaAbi] hash conflict
There are two entries in the dictionary for [AistiqTaAbi]
Please change or remove one of them and re-run.
However, this entry belongs to two different words as shown in the dictionary:
AistiqTAabi A i s t i q T A a b i
AistiqTaAbi A i s t i q T aA b i
It seems that this problem related to case sensitive, any help.
Last edit: Zain 2017-01-13
It is not quite clear what software do you run.
cmuclmtk and lm3g2dmp.
Neither of them requires dictionary. You need to be more precise in description of your problems. The more details you provide the faster you get an answer.
http://catb.org/~esr/faqs/smart-questions.html
Is there any method to find the execution time of the training and decoding?
I performed some experiment and found that ptm and semi-continuous have same WER while the continuous acoustic model has higher WER, is this reasonable?
How to know the number of triphones used in my pocketsphinx system?
Open the mdef file in a text editor.
Could you please let me know if $CFG_N_TIED_STATES means the triphones.
The idea of tied states is explained in our tutorial:
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts
For computational purpose it is helpful to detect parts of triphones instead of triphones as a whole, for example, to create a detector for a beginning of triphone and share it across many triphones. The whole variety of sound detectors can be represented by a small amount of distinct short sound detectors. Usually we use 4000 distinct short sound detectors to compose detectors for triphones. We call those detectors senones. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way.
Is it possible to run CMU PocketSphinx using the “any-word” language model such as:
$WORD = (X | Y | Z );
(SENT-START <$WORD> SENT-END)
That is, I want to evaluate the performance using this language model instead of the probabilistic N-Grams.