I am a speech technology enthusiast.
I obtained pocketsphinx keyword spotting software and built it in msvc. It is quite a nice application. Before I dig deep into kws software,I would like to do some background reading for key word spotting, particularly related to the techniques used in pocketsphinx. Can someone point to good papers (journal etc) or docs which describe such techniques? Or, any doc which describes the pocketsphinx keyword spotting logic?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the very nice document. Really helpful.
I was wondering if I can process one frame worth of speech at a time for kws task. That is, a buffer_size = frame_size in every call to ps_process_raw(..). Will it impact VAD performance (hangover parameters etc) in any way and/or performance of keyword spotting (viterbi search etc)?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to understand some concepts in pocketsphinx kws.
Are phone loop hmms similar to filler hmms, that is,are they used as non-keyword model?
What role do non-keyword triphone models play in keyword spotting, it seems that pocketsphinx loads all the triphones?
thanks
amit
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.
thanks
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.
You can train such models yourself from existing speech database like TEDLIUM.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
what happens if a triphone required for a keyword is not present in the model (mdef). Does pocketsphinx finds the nearest triphone or uses CI phones?
Where in the code can I get to see this behavior?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am observing that the audios which are spoken as "good day sunshine" are being detected as "good day". Is this normal?
How do you think? If you told it to detect "good day" and it detects "good day". If you don't want to detect "good day" don't add it.
Any thing to be done to properly detect the right keyword?
If you want to detect both "good day" and "good day sunshine" and discriminate between them you might want to modify algorithm to introduce 1 second delay in detection to decide which phrase was actually detected.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am trying to train a model with veclen=12 using sphinxtrain program. It works fine for veclen=13 (default). When I change to 12, I get the following message in VQ k-means clustering module.
Current configuration: [NAME][DEFLT][VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-cachesz 200 200
-cb2mllrfn .1cls. .1cls.
-cepdir /mnt/hgfs/Voice_trigger/speechcorpus/timit/feat
-cepext mfc mfc
-ceplen 13 12
-cmn current current
-cmninit 8.0 8.0
-cntfn
-ctlfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/etc/timit_train.fileids
-dictfn
-example no no
-fdictfn
-feat 1s_c_d_dd 1s_c_d
-help no no
-lda
-ldadim 0 0
-lsnfn
-mllrctlfn
-mllrdir
-moddeffn
-npart 0
-nskip 0 0
-part 0
-runlen -1 -1
-segdir
-segdmpdirs /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1,
-segdmpfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1/timit.dmp
-segext v8_seg v8_seg
-segidxfn
-segtype st all
-sentdir
-sentext
-stride 1 1
-svspec
-ts2cbfn
-varnorm no no
INFO: main.c(169): No lexical transcripts provided
INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
INFO: corpus.c(1086): Will process all remaining utts starting at 0
INFO: main.c(288): Will produce feature dump
INFO: main.c(427): Writing frames to one file
FATAL_ERROR: "corpus.c", line 1368: Expected mfcc vector len of 12, got 3 (3783)
It seems the value 3783 is a multiple of 13, yet I have set veclength of 12.
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So it looks it missed some parameter on extracting features. Probably -ncep 12. We recently added a fix for that in sphinxtrain, but you can just check feature extraction log if it contains -ncep 12
Btw, for semi-continuous models it's better to train deltas and means as separate streams. For that you need to add CFG_SVSPEC='0-11/12-23' to sphinx_train.cfg.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One more question on feature extraction.
Can I change the feature extraction parameters, like FFT size,window size, shift etc. for training acoustic model? Where do i have to make such changes?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.
Any possible mistakes from my side?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.
If you modified script in sources you probably forgot to run make install. Sphinxtrain uses scripts from installed location. You can find details on what feature extraction parameters were using in logdir/001.comp_feat logs
Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?
sphinxtrain -s comp_feat run
You can find the list of stages in tutorial as well as in sphinxtrain/scripts folder.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If my keyword consists of multiple words, e.g. "hello all world", can it be used in pocketsphinx?
Should my dictionary contain one word as "hello all world" => (phonemes), or three separate words as hello => (phonemes), all => (phonemes), world => (phonemes)?
How to handle silences between the words in such case?
regards
asm
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I am a speech technology enthusiast.
I obtained pocketsphinx keyword spotting software and built it in msvc. It is quite a nice application. Before I dig deep into kws software,I would like to do some background reading for key word spotting, particularly related to the techniques used in pocketsphinx. Can someone point to good papers (journal etc) or docs which describe such techniques? Or, any doc which describes the pocketsphinx keyword spotting logic?
regards
asm
You can check
ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING
http://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf
The implementation is in pocketsphinx/src/libpocketsphinx/kws_search.c
Thanks for the very nice document. Really helpful.
I was wondering if I can process one frame worth of speech at a time for kws task. That is, a buffer_size = frame_size in every call to ps_process_raw(..). Will it impact VAD performance (hangover parameters etc) in any way and/or performance of keyword spotting (viterbi search etc)?
regards
asm
Yes you can
Hi,
I am trying to understand some concepts in pocketsphinx kws.
Are phone loop hmms similar to filler hmms, that is,are they used as non-keyword model?
What role do non-keyword triphone models play in keyword spotting, it seems that pocketsphinx loads all the triphones?
thanks
amit
in ps non-keyword model is a phone-loop that contains all filler hmms and CI-phones (context independent)
They do nothing. But they should be around if the one would like to switch to another search
Hi,
I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.
thanks
asm
You can train such models yourself from existing speech database like TEDLIUM.
hi,
Can I use feature type "1s_12c_12d_3p_12dd" in pocketsphinx_kws project?
asm
Yes, add -feat 1s_12c_12d_3p_12dd to command line.
hi,
what happens if a triphone required for a keyword is not present in the model (mdef). Does pocketsphinx finds the nearest triphone or uses CI phones?
Where in the code can I get to see this behavior?
regards
asm
There is backoff to different word positions and to SIL context too. You can find details in function
bin_mdef_phone_id_nearest(bin_mdef_t * m, int32 b, int32 l, int32 r, int32 pos)
in bin_mdef.c
current implementation won't let you deal with keyphrase and sub-keyphrase. Once 'good day' is detected, all keyphrases propogations are reset.
How do you think? If you told it to detect "good day" and it detects "good day". If you don't want to detect "good day" don't add it.
If you want to detect both "good day" and "good day sunshine" and discriminate between them you might want to modify algorithm to introduce 1 second delay in detection to decide which phrase was actually detected.
Hi,
I am trying to train a model with veclen=12 using sphinxtrain program. It works fine for veclen=13 (default). When I change to 12, I get the following message in VQ k-means clustering module.
INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
INFO: main.c(520): No mdef files. Assuming 1-class init
INFO: main.c(1345): 1-class dump file
INFO: main.c(1383): Corpus 0: sz==1 frames
INFO: main.c(1392): Convergence ratios are abs(cur - prior) / abs(prior)
INFO: main.c(236): alloc'ing 0Mb obs buf
SYSTEM_ERROR: "main.c", line 263: Can't read dump file
It seems that the variable sz is set to 1 rather than 1396024, as in the case of veclen=13. The buffer size if also being set to 0 here.
Is there anything else I need to do other than setting veclen = 12 in sphinxtrain_cfg file?
regards
asm
There was error on previous agg_seg step which must create dump file, probably it just crashed. You can check agg_seg log for details.
Yes, here is what I get in agg_seg
Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-cachesz 200 200
-cb2mllrfn .1cls. .1cls.
-cepdir /mnt/hgfs/Voice_trigger/speechcorpus/timit/feat
-cepext mfc mfc
-ceplen 13 12
-cmn current current
-cmninit 8.0 8.0
-cntfn
-ctlfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/etc/timit_train.fileids
-dictfn
-example no no
-fdictfn
-feat 1s_c_d_dd 1s_c_d
-help no no
-lda
-ldadim 0 0
-lsnfn
-mllrctlfn
-mllrdir
-moddeffn
-npart 0
-nskip 0 0
-part 0
-runlen -1 -1
-segdir
-segdmpdirs /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1,
-segdmpfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1/timit.dmp
-segext v8_seg v8_seg
-segidxfn
-segtype st all
-sentdir
-sentext
-stride 1 1
-svspec
-ts2cbfn
-varnorm no no
INFO: main.c(169): No lexical transcripts provided
INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
INFO: corpus.c(1086): Will process all remaining utts starting at 0
INFO: main.c(288): Will produce feature dump
INFO: main.c(427): Writing frames to one file
FATAL_ERROR: "corpus.c", line 1368: Expected mfcc vector len of 12, got 3 (3783)
It seems the value 3783 is a multiple of 13, yet I have set veclength of 12.
asm
So it looks it missed some parameter on extracting features. Probably -ncep 12. We recently added a fix for that in sphinxtrain, but you can just check feature extraction log if it contains -ncep 12
Btw, for semi-continuous models it's better to train deltas and means as separate streams. For that you need to add CFG_SVSPEC='0-11/12-23' to sphinx_train.cfg.
Thanks Nikolay,
One more question on feature extraction.
Can I change the feature extraction parameters, like FFT size,window size, shift etc. for training acoustic model? Where do i have to make such changes?
regards
asm
Yes you can
In make_feats.pl script which invokes sphinx_fe to compute features.
hi,
I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.
Any possible mistakes from my side?
regards
asm
Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?
regards
asm
If you modified script in sources you probably forgot to run make install. Sphinxtrain uses scripts from installed location. You can find details on what feature extraction parameters were using in logdir/001.comp_feat logs
sphinxtrain -s comp_feat run
You can find the list of stages in tutorial as well as in sphinxtrain/scripts folder.
Hi,
If my keyword consists of multiple words, e.g. "hello all world", can it be used in pocketsphinx?
Should my dictionary contain one word as "hello all world" => (phonemes), or three separate words as hello => (phonemes), all => (phonemes), world => (phonemes)?
How to handle silences between the words in such case?
regards
asm
yes
three separate words
inner silences shouldn't be a problem