If the following information is their we can do directly forced alignment i.e. from sample number 2210 to 5080 corresponds to 'she' and 0 to 2209 correspond to 'SIL'.
2210 5080 she
5080 9370 had
9370 10760 your
10760 15840 dark
15840 19258 suit
19258 21360 in
21360 27864 greasy
27864 34464 wash
34464 38642 water
39477 43180 all
43180 48569 year
After training and testing the resuilts we are getting like this
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
Words: 11 Correct: 11 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%
How can this will be speech to text alignment, is there any way to get sample number information or timing information.
Please give me clarity.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have seen the configuration file in that there is one option for forced alignment initially it is set no but now I make it yes. But I am getting the following error please help me.
`
sitecsp@acl-pg-06:~/DYSARTHRIC/an4$ sphinxtrain run
Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
Running the training
MODULE: 000 Computing feature from audio files
Extracting features from segments starting at (part 1 of 1)
Extracting features from segments starting at (part 1 of 1)
Feature extraction is done
MODULE: 00 verify training files
Phase 1: Checking to see if the dict and filler dict agrees with the phonelist file.
Found 30 words using 25 phones
Phase 2: Checking to make sure there are not duplicate entries in the dictionary
Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.647533333333333
This is a small amount of data, no comment at this time
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 27
Words in filler dictionary: 3
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
MODULE: 0000 train grapheme-to-phoneme model
Skipped (set $CFG_G2P_MODEL = 'yes' to enable)
MODULE: 01 Train LDA transformation
Skipped (set $CFG_LDA_MLLT = 'yes' to enable)
MODULE: 02 Train MLLT transformation
Skipped (set $CFG_LDA_MLLT = 'yes' to enable)
MODULE: 05 Vector Quantization
Skipped for continuous models
MODULE: 10 Training Context Independent models for forced alignment and VTLN
Phase 1: Cleaning up directories:
accumulator...logs...qmanager...models...
Phase 2: Flat initialize
Phase 3: Forward-Backward
Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 1
Current Overall Likelihood Per Frame = -161.308512646282
Baum welch starting for 1 Gaussian(s), iteration: 2 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 2
Current Overall Likelihood Per Frame = -158.718598785133
Convergence Ratio = 2.58991386114866
Baum welch starting for 1 Gaussian(s), iteration: 3 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 3
Current Overall Likelihood Per Frame = -155.527943649405
Convergence Ratio = 3.19065513572841
Baum welch starting for 1 Gaussian(s), iteration: 4 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 4
Current Overall Likelihood Per Frame = -153.912110916641
Convergence Ratio = 1.61583273276406
Baum welch starting for 1 Gaussian(s), iteration: 5 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 5
Current Overall Likelihood Per Frame = -153.477470057312
Convergence Ratio = 0.434640859329505
Baum welch starting for 1 Gaussian(s), iteration: 6 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 6
Current Overall Likelihood Per Frame = -153.290349703147
Convergence Ratio = 0.187120354165017
Baum welch starting for 1 Gaussian(s), iteration: 7 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 7
Current Overall Likelihood Per Frame = -153.185850578263
Convergence Ratio = 0.1044991248842
Baum welch starting for 1 Gaussian(s), iteration: 8 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 8
Current Overall Likelihood Per Frame = -153.134587666015
Training completed after 8 iterations
MODULE: 11 Force-aligning transcripts
Skipped: No sphinx3_align(.exe) found in /usr/local/libexec/sphinxtrain
If you wish to do force-alignment, please copy or link the
sphinx3_align binary from Sphinx 3 to /usr/local/libexec/sphinxtrain
and either define $CFG_MODEL_DIR in sphinx_train.cfg or
run context-independent training first.
`
After that I made following changes in configuration file
# (yes/no) Train multiple-gaussian context-independent models (useful
# for alignment, use 'no' otherwise) in the models created
# specifically for forced alignment
$CFG_FALIGN_CI_MGAU = 'yes';
# (yes/no) Train multiple-gaussian context-independent models (useful
# for alignment, use 'no' otherwise)
$CFG_CI_MGAU = 'yes';
# (yes/no) Train context-dependent models
$CFG_CD_TRAIN = 'yes';
# Number of tied states (senones) to create in decision-tree clustering
$CFG_N_TIED_STATES = 200;
# How many parts to run Forward-Backward estimatinon in
$CFG_NPART = 1;
# (yes/no) Train a single decision tree for all phones (actually one
# per state) (useful for grapheme-based models, use 'no' otherwise)
$CFG_CROSS_PHONE_TREES = 'no';
# Use force-aligned transcripts (if available) as input to training
$CFG_FORCEDALIGN = 'yes';
Even then I am getting the same error please help me. What might the output of forced alignment
sir, from the forced alignment should I get sample number information
If the following information is their we can do directly forced alignment i.e. from sample number 2210 to 5080 corresponds to 'she' and 0 to 2209 correspond to 'SIL'.
2210 5080 she
5080 9370 had
9370 10760 your
10760 15840 dark
15840 19258 suit
19258 21360 in
21360 27864 greasy
27864 34464 wash
34464 38642 water
39477 43180 all
43180 48569 year
After training and testing the resuilts we are getting like this
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
Words: 11 Correct: 11 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%
How can this will be speech to text alignment, is there any way to get sample number information or timing information.
Please give me clarity.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In order to infer the speech to text alignment results the sample numbers of word boundaries(onset offsets) are required. But the CMU sphinx is not giving the desired alignment results in the form mentioned above which is not quite intuitive. Instead it is giving following result
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
Which I am unable to infer result.
Is there any way to get timing information from the alignment result.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
OK, now it's clear, you seem to speak about result/an4.align file. This is not the speech to audio alignment but reference to hypothesis string alignment to compute error rate.
What you seem to need is just time information from the decoder. In pocketsphinx this is achieved with -ctm option. CTM format gives you time per each word
Not that when you do forced alignment, it is a totally different procedure. Forced alignment means you have a ground truth transcript and your goal is to get time information. You do not use decoder for that, but the aligner
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes sir, that is what I required. For a given transcription which part of the word corresponds to 'she' like that.If I get timing information from the forced alignment I can easily do this. Sir, can you please explain me detailed how can I do
this. From where I should option of ctm please tell me
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sir, I am currently using pocket sphinx sphinx3 align .exe is not their but for the first time I have used sphinx 3 in that within a build directory their sphinx3_align .exe file can I use the same thing orCan you please provide the link so that I can download.
Thanks in advance
Last edit: Diwakar.G 2016-12-06
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
But I am getting frame numberwise information i.e. RUBOUT corresponds from 26 to 95 frame. Is there is any way to get either sample number information or timing information i.e. RUBOUT corresponds to samples from 16000 to 36000 or 1.5 to 2.5sec in the audio signal.
sitecsp@acl-pg-06:~/Documents/ALIGN/an4$sphinx3_align-hmm/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd-dictan4.dic-fdictan4.filler-ctlan4_train.fileids-insentan4_train.transcription-cepdir/home/sitecsp/Documents/ALIGN/an4/feat-phsegdir/home/sitecsp/Documents/ALIGN/an4/pdsegd-wdsegdir/home/sitecsp/Documents/ALIGN/an4/wdsegdINFO:info.c(65):Host:'acl-pg-06'INFO:info.c(69):Directory:'/home/sitecsp/Documents/ALIGN/an4'INFO:info.c(73):sphinx3_alignCompiledon:Dec222013,AT:15:13:45INFO:cmd_ln.c(691):Parsingcommandline:sphinx3_align\-hmm/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd\-dictan4.dic\-fdictan4.filler\-ctlan4_train.fileids\-insentan4_train.transcription\-cepdir/home/sitecsp/Documents/ALIGN/an4/feat\-phsegdir/home/sitecsp/Documents/ALIGN/an4/pdsegd\-wdsegdir/home/sitecsp/Documents/ALIGN/an4/wdsegdCurrentconfiguration:[NAME][DEFLT][VALUE]-adchdr00-adcinnono-agcnonenone-agcthresh2.02.000000e+00-beam1e-641.000000e-64-cb2mllr.1cls..1cls.-cepdir/home/sitecsp/Documents/ALIGN/an4/feat-cepext.mfc.mfc-ceplen1313-ci_pbeam1e-801.000000e-80-cmncurrentcurrent-cmninit8.08.0-cond_dsnono-ctlan4_train.fileids-ctlcount10000000001000000000-ctloffset00-ctl_mllr-dictan4.dic-dist_dsnono-ds11-fdictan4.filler-feat1s_c_d_dd1s_c_d_dd-featparams-frate100100-gs-gs4gsyesyes-hmm/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd-hyp-hypseg-insentan4_train.transcription-insert_sil11-kdmaxbbi-1-1-kdmaxdepth00-kdtree-lambda-lda-ldadim00-log3tableyesyes-logbase1.00031.000300e+00-logfn-lts_mismatchnono-maxcdsenpf100000100000-mdef-mean-mixw-mixwfloor0.00000011.000000e-07-mllr-outsent-phlabdir-phsegdir/home/sitecsp/Documents/ALIGN/an4/pdsegd-s2cdsennono-s2stsegdir-senmgau.cont..cont.-stsegdir-subvq-subvqbeam3.0e-33.000000e-03-svq4svqnono-svspec-tighten_factor0.55.000000e-01-tmat-tmatfloor0.00011.000000e-04-topn44-var-varfloor0.00011.000000e-04-varnormnono-vqeval33-wdsegdir/home/sitecsp/Documents/ALIGN/an4/wdsegdINFO:InitializationofthelogaddtableINFO:Log-Addtablesize=29350x2>>0INFO:INFO:feat.c(713):Initializingfeaturestreamtotype:'1s_c_d_dd',ceplen=13,CMN='current',VARNORM='no',AGC='none'INFO:cmn.c(142):mean[0]=12.00,mean[1..12]=0.0INFO:ReadingHMMinSphinx3ModelformatINFO:ModelDefinitionFile:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mdefINFO:MeanFile:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/meansINFO:VarianceFile:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/variancesINFO:MixtureWeightFile:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weightsINFO:TransitionMatricesFile:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/transition_matricesINFO:mdef.c(682):Readingmodeldefinition:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mdefINFO:Initializationofmdef_t,report:INFO:48CI-phone,133500CD-phone,3emitstate/phone,144CI-sen,6144Sen,32639Sen-SeqINFO:INFO:kbcore.c(288):UsingoptimizedGMMcomputationforContinuousHMM,-topnwillbeignoredINFO:cont_mgau.c(163):Readingmixturegaussianfile'/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/means'INFO:cont_mgau.c(422):6144mixtureGaussians,8components,1streams,veclen39INFO:cont_mgau.c(163):Readingmixturegaussianfile'/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/variances'INFO:cont_mgau.c(422):6144mixtureGaussians,8components,1streams,veclen39INFO:cont_mgau.c(510):Readingmixtureweightsfile'/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/mixture_weights'ERROR:"cont_mgau.c",line653:Weightnormalizationfailedfor3senonesINFO:cont_mgau.c(665):Read6144x8mixtureweightsINFO:cont_mgau.c(693):RemovinguninitializedGaussiandensities678WARNING:"cont_mgau.c",line767:24densitiesremoved(3mixturesremovedentirely)INFO:cont_mgau.c(783):ApplyingvariancefloorINFO:cont_mgau.c(801):0variancevaluesflooredINFO:cont_mgau.c(849):PrecomputingMahalanobisdistanceinvariantsINFO:tmat.c(169):ReadingHMMtransitionprobabilitymatrices:/home/sitecsp/Documents/FORCE/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd/transition_matricesWARNING:"tmat.c",line242:Normalizationfailedfortmat2fromstate0WARNING:"tmat.c",line242:Normalizationfailedfortmat2fromstate1WARNING:"tmat.c",line242:Normalizationfailedfortmat2fromstate2INFO:Initializationoftmat_t,report:INFO:Read48transitionmatricesofsize3x4INFO:INFO:dict.c(475):Readingmaindictionary:an4.dicERROR:"dict.c",line263:Line2:Badciphone:AX;wordA(2)ignoredERROR:"dict.c",line263:Line5:Badciphone:AX;wordAAWignoredERROR:"dict.c",line263:Line6:Badciphone:AXR;wordABERDEENignoredERROR:"dict.c",line263:Line7:Badciphone:AX;wordABOARDignoredERROR:"dict.c",line263:Line8:Badciphone:AX;wordABOVEignoredERROR:"dict.c",line263:Line10:Badciphone:DX;wordADDEDignoredERROR:"dict.c",line263:Line11:Badciphone:DX;wordADDINGignoredERROR:"dict.c",line263:Line12:Badciphone:AX;wordAFFECTignoredERROR:"dict.c",line263:Line13:Badciphone:AXR;wordAFTERignoredERROR:"dict.c",line263:Line14:Badciphone:AX;wordAGAINignoredERROR:"dict.c",line263:Line16:Badciphone:IX;wordAJAX'S ignoredERROR: "dict.c", line 263: Line 17: Bad ciphone: AX; word ALASKA ignoredERROR: "dict.c", line 263: Line 18: Bad ciphone: AX; word ALERT ignoredERROR: "dict.c", line 263: Line 19: Bad ciphone: AX; word ALERTS ignoredERROR: "dict.c", line 263: Line 20: Bad ciphone: IX; word ALEXANDRIA ignoredERROR: "dict.c", line 263: Line 25: Bad ciphone: AX; word AN(4) ignoredERROR: "dict.c", line 263: Line 26: Bad ciphone: AXR; word ANCHORAGE ignoredERROR: "dict.c", line 263: Line 27: Bad ciphone: AX; word AND ignoredERROR: "dict.c", line 263: Line 29: Bad ciphone: DX; word ANYBODY ignoredERROR: "dict.c", line 263: Line 30: Bad ciphone: DX; word ANYBODY(2) ignoredERROR: "dict.c", line 263: Line 31: Bad ciphone: AX; word APALACHICOLA ignoredERROR: "dict.c", line 263: Line 32: Bad ciphone: AX; word APALACHICOLA'SignoredERROR:"dict.c",line263:Line33:Badciphone:AX;wordAPRILignoredERROR:"dict.c",line263:Line34:Badciphone:AXR;wordARABIANignoredERROR:"dict.c",line263:Line35:Badciphone:DX;wordARCTICignoredERROR:"dict.c",line263:Line36:Badciphone:IX;wordARCTIC(2)ignoredERROR:"dict.c",line263:Line38:Badciphone:AXR;wordARE(2)ignoredERROR:"dict.c",line263:Line39:Badciphone:AX;wordAREAignoredERROR:"dict.c",line263:Line40:Badciphone:AX;wordAREASignoredERROR:"dict.c",line263:Line41:Badciphone:AX;wordAREN'T ignoredFATAL_ERROR: "dict.c", line 208: Missing base word for: AREN'T(2)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sir, what I have done is for every database I am giving same hmm directory path(i.e. sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd) is this right path.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sir, how actually it is giving frame number without any training. In that hmm model for the same variance and means can be used for segmentation of any words how actually it is giving frame information. Is there any material to clearly understand this.
In this manually I have to convert from frame number to get sample number or time information. Is there any option in sphinx3_align so that I can directly get time information.
Thank you
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If the following information is their we can do directly forced alignment i.e. from sample number 2210 to 5080 corresponds to 'she' and 0 to 2209 correspond to 'SIL'.
2210 5080 she
5080 9370 had
9370 10760 your
10760 15840 dark
15840 19258 suit
19258 21360 in
21360 27864 greasy
27864 34464 wash
34464 38642 water
39477 43180 all
43180 48569 year
After training and testing the resuilts we are getting like this
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
Words: 11 Correct: 11 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%
How can this will be speech to text alignment, is there any way to get sample number information or timing information.
Please give me clarity.
I have seen the configuration file in that there is one option for forced alignment initially it is set no but now I make it yes. But I am getting the following error please help me.
`
sitecsp@acl-pg-06:~/DYSARTHRIC/an4$ sphinxtrain run
Sphinxtrain path: /usr/local/lib/sphinxtrain
Sphinxtrain binaries path: /usr/local/libexec/sphinxtrain
Running the training
MODULE: 000 Computing feature from audio files
Extracting features from segments starting at (part 1 of 1)
Extracting features from segments starting at (part 1 of 1)
Feature extraction is done
MODULE: 00 verify training files
Phase 1: Checking to see if the dict and filler dict agrees with the phonelist file.
Found 30 words using 25 phones
Phase 2: Checking to make sure there are not duplicate entries in the dictionary
Phase 3: Check general format for the fileids file; utterance length (must be positive); files exist
Phase 4: Checking number of lines in the transcript file should match lines in fileids file
Phase 5: Determine amount of training data, see if n_tied_states seems reasonable.
Estimated Total Hours Training: 0.647533333333333
This is a small amount of data, no comment at this time
Phase 6: Checking that all the words in the transcript are in the dictionary
Words in dictionary: 27
Words in filler dictionary: 3
Phase 7: Checking that all the phones in the transcript are in the phonelist, and all phones in the phonelist appear at least once
MODULE: 0000 train grapheme-to-phoneme model
Skipped (set $CFG_G2P_MODEL = 'yes' to enable)
MODULE: 01 Train LDA transformation
Skipped (set $CFG_LDA_MLLT = 'yes' to enable)
MODULE: 02 Train MLLT transformation
Skipped (set $CFG_LDA_MLLT = 'yes' to enable)
MODULE: 05 Vector Quantization
Skipped for continuous models
MODULE: 10 Training Context Independent models for forced alignment and VTLN
Phase 1: Cleaning up directories:
accumulator...logs...qmanager...models...
Phase 2: Flat initialize
Phase 3: Forward-Backward
Baum welch starting for 1 Gaussian(s), iteration: 1 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 1
Current Overall Likelihood Per Frame = -161.308512646282
Baum welch starting for 1 Gaussian(s), iteration: 2 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 2
Current Overall Likelihood Per Frame = -158.718598785133
Convergence Ratio = 2.58991386114866
Baum welch starting for 1 Gaussian(s), iteration: 3 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 3
Current Overall Likelihood Per Frame = -155.527943649405
Convergence Ratio = 3.19065513572841
Baum welch starting for 1 Gaussian(s), iteration: 4 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 4
Current Overall Likelihood Per Frame = -153.912110916641
Convergence Ratio = 1.61583273276406
Baum welch starting for 1 Gaussian(s), iteration: 5 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 5
Current Overall Likelihood Per Frame = -153.477470057312
Convergence Ratio = 0.434640859329505
Baum welch starting for 1 Gaussian(s), iteration: 6 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 6
Current Overall Likelihood Per Frame = -153.290349703147
Convergence Ratio = 0.187120354165017
Baum welch starting for 1 Gaussian(s), iteration: 7 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 7
Current Overall Likelihood Per Frame = -153.185850578263
Convergence Ratio = 0.1044991248842
Baum welch starting for 1 Gaussian(s), iteration: 8 (1 of 1)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Normalization for iteration: 8
Current Overall Likelihood Per Frame = -153.134587666015
Training completed after 8 iterations
MODULE: 11 Force-aligning transcripts
Skipped: No sphinx3_align(.exe) found in /usr/local/libexec/sphinxtrain
If you wish to do force-alignment, please copy or link the
sphinx3_align binary from Sphinx 3 to /usr/local/libexec/sphinxtrain
and either define $CFG_MODEL_DIR in sphinx_train.cfg or
run context-independent training first.
`
After that I made following changes in configuration file
Even then I am getting the same error please help me. What might the output of forced alignment
It seems self-explaining: you are missing sphin3_align tool that you want to use for forced alignment.
You should try to download and install https://github.com/skerit/cmusphinx/tree/master/sphinx3
sir how can I download it ?
sir, from the forced alignment should I get sample number information
If the following information is their we can do directly forced alignment i.e. from sample number 2210 to 5080 corresponds to 'she' and 0 to 2209 correspond to 'SIL'.
2210 5080 she
5080 9370 had
9370 10760 your
10760 15840 dark
15840 19258 suit
19258 21360 in
21360 27864 greasy
27864 34464 wash
34464 38642 water
39477 43180 all
43180 48569 year
After training and testing the resuilts we are getting like this
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
Words: 11 Correct: 11 Errors: 0 Percent correct = 100.00% Error = 0.00% Accuracy = 100.00%
How can this will be speech to text alignment, is there any way to get sample number information or timing information.
Please give me clarity.
In fact I do not understand your question. Forced alignment is used when you do not have time information (only have sentence trancript).
In your example you have time explicitly written. So why at all you need the alignmen? Try to re-formulate your question probably
In order to infer the speech to text alignment results the sample numbers of word boundaries(onset offsets) are required. But the CMU sphinx is not giving the desired alignment results in the form mentioned above which is not quite intuitive. Instead it is giving following result
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
she had your dark suit in greasy wash water all year (FAKS0-FAKS0-SA1)
Which I am unable to infer result.
Is there any way to get timing information from the alignment result.
OK, now it's clear, you seem to speak about result/an4.align file. This is not the speech to audio alignment but reference to hypothesis string alignment to compute error rate.
What you seem to need is just time information from the decoder. In pocketsphinx this is achieved with -ctm option. CTM format gives you time per each word
Not that when you do forced alignment, it is a totally different procedure. Forced alignment means you have a ground truth transcript and your goal is to get time information. You do not use decoder for that, but the aligner
Yes sir, that is what I required. For a given transcription which part of the word corresponds to 'she' like that.If I get timing information from the forced alignment I can easily do this. Sir, can you please explain me detailed how can I do
this. From where I should option of ctm please tell me
sphinx3_align tool has -wdsegdir option, not ctm option to dump word times.
The command line is:
Sir, I am currently using pocket sphinx sphinx3 align .exe is not their but for the first time I have used sphinx 3 in that within a build directory their sphinx3_align .exe file can I use the same thing orCan you please provide the link so that I can download.
Thanks in advance
Last edit: Diwakar.G 2016-12-06
Sir, finally I got
But I am getting frame numberwise information i.e. RUBOUT corresponds from 26 to 95 frame. Is there is any way to get either sample number information or timing information i.e. RUBOUT corresponds to samples from 16000 to 36000 or 1.5 to 2.5sec in the audio signal.
Thank you.
Sir, when I tried to run for some other mfc files I am getting the following error. Please help me
sphinx3_align gives frame number only for an4 database. When I tried rm1 or some other database it throws error. Can somebody please help me.
Sure, as soon as you provide error details.
For rm1 database, I am getting following error
Dictionary should match the acoustic model and contain all reference words, your dictionary mismatches and word AREN'T is missing.
When I tried with timit database in that DR2 folder I am getting the following error
Please help me.
This means you made a mistake extracting the features.
Sir, what I have done is for every database I am giving same hmm directory path(i.e. sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd) is this right path.
Sir, how actually it is giving frame number without any training. In that hmm model for the same variance and means can be used for segmentation of any words how actually it is giving frame information. Is there any material to clearly understand this.
In this manually I have to convert from frame number to get sample number or time information. Is there any option in sphinx3_align so that I can directly get time information.
Thank you