I'm training a new acoustic model using latest sphinxbase-0.8, pocketsphinx-0.8, and sphinxtrain-1.0.8. My recognition task is for isolated English words and I have roughly 325 words in the dictionary. As suggested in the "Building an Acoustic Model" page: http://cmusphinx.sourceforge.net/wiki/tutorialam, I am building a word-dependent phone dictionary for isolated word recognition task i.e.:
Dictionary looks like:
WORDA WORDA_1 WORDA_2
WORDB WORDB_1 WORDB_2
WORDC WORDC_1 WORDC_2
... and so on.
Phoneset looks like:
WORDA_1
WORDA_2
WORDB_1
WORDB_2
WORDC_1
WORDC_2
.. and so on.
While running the training using "sphinxtrain run", it successfully completes the training until Context-Independent Module stage without any error or warning, and then gets stuck on Context-Dependent module at the "Initialization stage". At this point, there are no errors, but it just does not proceed. Would you know why this is happening?
As an alternative, I tried using the output of CI training in "model_parameters" folder to decode using pocketsphinx_batch, but that failed giving the error as follows:
INFO: acmod.c(246): Parsed model-specific feature parameters from model_parameters/wordModel.ci_semi/feat.params
INFO: feat.c(713): Initializing feature stream to type: 's2_4x', ceplen=13, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..12]= 0.0
INFO: mdef.c(517): Reading model definition: model_parameters/wordModel.ci_semi/mdef
ERROR: "bin_mdef.c", line 91: Number of phones exceeds limit: 699 > 255
INFO: bin_mdef.c(336): Reading binary model definition: model_parameters/wordModel.ci_semi/mdef
ERROR: "bin_mdef.c", line 359: File format version 1634887022 for model_parameters/wordModel.ci_semi/mdef is newer than library
ERROR: "acmod.c", line 93: Failed to read acoustic model definition from model_parameters/wordModel.ci_semi/mdef
FATAL_ERROR: "batch.c", line 819: PocketSphinx decoder init failed
Any insights into how to resolve either of the above?
All my data and configuration files from the etc and wav folder can be accessed at: http://db.tt/qA1yEjcM (~110 MB)
Thanks very much.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Why are you using just two phonemes when a word has > 2 phones?
e.g. ACCIDENT ACCIDENT_1 ACCIDENT_2
The tutorial asks to build a word dependent phone dictionary. So it could
be -
ACCIDENT AE_ACCIDENT K_ACCIDENT S_ACCIDENT AH_ACCIDENT D_ACCIDENT AH_
ACCIDENT N_ACCIDENT T_ACCIDENT
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, I should use the number of phones that are needed per-word; however, if they are word-dependent phones, the total number of phones for 325 words exceeds the maximum allowed i.e. 255.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there a theoretical limit on the number of phones that can be used? Why is that?
In this case, how should I cut the number of phones: I have 325 words in the dictionary, and for a word-based model, in the minimum case, I will have 325 phones. This will still be more than the limit of 255 phones.
Does this imply that a word model is not appropriate in this case? The reasons why I was building a word-model are: (A) I have a small amount of data (~2 hrs), (B) my data is for users and usage context that is completely different than a lot of other "off-the-shelf" models i.e. children's speech, non-native speakers, 8KHz data, noisy background, etc. So, adapting an existing acoustic model was ruled out, and (C) I have training recordings for all the words that are in the test set.
Thanks once again.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I'm training a new acoustic model using latest sphinxbase-0.8, pocketsphinx-0.8, and sphinxtrain-1.0.8. My recognition task is for isolated English words and I have roughly 325 words in the dictionary. As suggested in the "Building an Acoustic Model" page: http://cmusphinx.sourceforge.net/wiki/tutorialam, I am building a word-dependent phone dictionary for isolated word recognition task i.e.:
Dictionary looks like:
WORDA WORDA_1 WORDA_2
WORDB WORDB_1 WORDB_2
WORDC WORDC_1 WORDC_2
... and so on.
Phoneset looks like:
WORDA_1
WORDA_2
WORDB_1
WORDB_2
WORDC_1
WORDC_2
.. and so on.
While running the training using "sphinxtrain run", it successfully completes the training until Context-Independent Module stage without any error or warning, and then gets stuck on Context-Dependent module at the "Initialization stage". At this point, there are no errors, but it just does not proceed. Would you know why this is happening?
As an alternative, I tried using the output of CI training in "model_parameters" folder to decode using pocketsphinx_batch, but that failed giving the error as follows:
Any insights into how to resolve either of the above?
All my data and configuration files from the etc and wav folder can be accessed at: http://db.tt/qA1yEjcM (~110 MB)
Thanks very much.
Hi,
Why are you using just two phonemes when a word has > 2 phones?
e.g. ACCIDENT ACCIDENT_1 ACCIDENT_2
The tutorial asks to build a word dependent phone dictionary. So it could
be -
ACCIDENT AE_ACCIDENT K_ACCIDENT S_ACCIDENT AH_ACCIDENT D_ACCIDENT AH_
ACCIDENT N_ACCIDENT T_ACCIDENT
Yes, I should use the number of phones that are needed per-word; however, if they are word-dependent phones, the total number of phones for 325 words exceeds the maximum allowed i.e. 255.
Please read the messages the software outputs for you. The message you posted says
You need to reduce the number of phones in a phoneset
Is there a theoretical limit on the number of phones that can be used? Why is that?
In this case, how should I cut the number of phones: I have 325 words in the dictionary, and for a word-based model, in the minimum case, I will have 325 phones. This will still be more than the limit of 255 phones.
Does this imply that a word model is not appropriate in this case? The reasons why I was building a word-model are: (A) I have a small amount of data (~2 hrs), (B) my data is for users and usage context that is completely different than a lot of other "off-the-shelf" models i.e. children's speech, non-native speakers, 8KHz data, noisy background, etc. So, adapting an existing acoustic model was ruled out, and (C) I have training recordings for all the words that are in the test set.
Thanks once again.
There is a practical limit, not a theoretical one
Word-dependent models only make sense for ten words or less. For 325 words you should use simple phone-based models
Yes
You need to have about 50 samples for each word. So your database must be about 30 hours, not 2 hours