CMU Sphinx / Forums / Help: docs on kws with pocketsphinx

asm - 2014-07-29

Hi,

I am a speech technology enthusiast.
I obtained pocketsphinx keyword spotting software and built it in msvc. It is quite a nice application. Before I dig deep into kws software,I would like to do some background reading for key word spotting, particularly related to the techniques used in pocketsphinx. Can someone point to good papers (journal etc) or docs which describe such techniques? Or, any doc which describes the pocketsphinx keyword spotting logic?

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-07-29
  
  You can check
  
  ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING
  http://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf
  
  The implementation is in pocketsphinx/src/libpocketsphinx/kws_search.c
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - asm - 2014-07-31
    
    Thanks for the very nice document. Really helpful.
    
    I was wondering if I can process one frame worth of speech at a time for kws task. That is, a buffer_size = frame_size in every call to ps_process_raw(..). Will it impact VAD performance (hangover parameters etc) in any way and/or performance of keyword spotting (viterbi search etc)?
    
    regards
    asm
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2014-07-31
      
      I was wondering if I can process one frame worth of speech at a time for kws task.
      
      Yes you can
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-08-08

Hi,

I am trying to understand some concepts in pocketsphinx kws.

Are phone loop hmms similar to filler hmms, that is,are they used as non-keyword model?
What role do non-keyword triphone models play in keyword spotting, it seems that pocketsphinx loads all the triphones?

thanks
amit

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

bic-user - 2014-08-08

Are phone loop hmms similar to filler hmms, that is,are they used as non-keyword model?

in ps non-keyword model is a phone-loop that contains all filler hmms and CI-phones (context independent)

What role do non-keyword triphone models play in keyword spotting, it seems that pocketsphinx loads all the triphones?

They do nothing. But they should be around if the one would like to switch to another search

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-08-21

Hi,

I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.

thanks
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-08-22

I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.

You can train such models yourself from existing speech database like TEDLIUM.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-08-25

hi,

Can I use feature type "1s_12c_12d_3p_12dd" in pocketsphinx_kws project?

asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-08-25
  
  Yes, add -feat 1s_12c_12d_3p_12dd to command line.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-02

hi,

what happens if a triphone required for a keyword is not present in the model (mdef). Does pocketsphinx finds the nearest triphone or uses CI phones?
Where in the code can I get to see this behavior?

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-09-02

what happens if a triphone required for a keyword is not present in the model (mdef). Does pocketsphinx finds the nearest triphone or uses CI phones?

There is backoff to different word positions and to SIL context too. You can find details in function

bin_mdef_phone_id_nearest(bin_mdef_t * m, int32 b, int32 l, int32 r, int32 pos)

in bin_mdef.c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bic-user - 2014-11-18
  
  current implementation won't let you deal with keyphrase and sub-keyphrase. Once 'good day' is detected, all keyphrases propogations are reset.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-11-18
  
  I am observing that the audios which are spoken as "good day sunshine" are being detected as "good day". Is this normal?
  
  How do you think? If you told it to detect "good day" and it detects "good day". If you don't want to detect "good day" don't add it.
  
  Any thing to be done to properly detect the right keyword?
  
  If you want to detect both "good day" and "good day sunshine" and discriminate between them you might want to modify algorithm to introduce 1 second delay in detection to decide which phrase was actually detected.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-05

Hi,

I am trying to train a model with veclen=12 using sphinxtrain program. It works fine for veclen=13 (default). When I change to 12, I get the following message in VQ k-means clustering module.

INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
INFO: main.c(520): No mdef files. Assuming 1-class init
INFO: main.c(1345): 1-class dump file
INFO: main.c(1383): Corpus 0: sz==1 frames
INFO: main.c(1392): Convergence ratios are abs(cur - prior) / abs(prior)
INFO: main.c(236): alloc'ing 0Mb obs buf
SYSTEM_ERROR: "main.c", line 263: Can't read dump file

It seems that the variable sz is set to 1 rather than 1396024, as in the case of veclen=13. The buffer size if also being set to 0 here.

Is there anything else I need to do other than setting veclen = 12 in sphinxtrain_cfg file?

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-06
  
  There was error on previous agg_seg step which must create dump file, probably it just crashed. You can check agg_seg log for details.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-09

Yes, here is what I get in agg_seg

Current configuration:
[NAME] [DEFLT] [VALUE]
-agc none none
-agcthresh 2.0 2.000000e+00
-cachesz 200 200
-cb2mllrfn .1cls. .1cls.
-cepdir /mnt/hgfs/Voice_trigger/speechcorpus/timit/feat
-cepext mfc mfc
-ceplen 13 12
-cmn current current
-cmninit 8.0 8.0
-cntfn
-ctlfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/etc/timit_train.fileids
-dictfn
-example no no
-fdictfn
-feat 1s_c_d_dd 1s_c_d
-help no no
-lda
-ldadim 0 0
-lsnfn
-mllrctlfn
-mllrdir
-moddeffn
-npart 0
-nskip 0 0
-part 0
-runlen -1 -1
-segdir
-segdmpdirs /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1,
-segdmpfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1/timit.dmp
-segext v8_seg v8_seg
-segidxfn
-segtype st all
-sentdir
-sentext
-stride 1 1
-svspec
-ts2cbfn
-varnorm no no

INFO: main.c(169): No lexical transcripts provided
INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
INFO: corpus.c(1086): Will process all remaining utts starting at 0
INFO: main.c(288): Will produce feature dump
INFO: main.c(427): Writing frames to one file
FATAL_ERROR: "corpus.c", line 1368: Expected mfcc vector len of 12, got 3 (3783)

It seems the value 3783 is a multiple of 13, yet I have set veclength of 12.

asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-09
  
  So it looks it missed some parameter on extracting features. Probably -ncep 12. We recently added a fix for that in sphinxtrain, but you can just check feature extraction log if it contains -ncep 12
  
  Btw, for semi-continuous models it's better to train deltas and means as separate streams. For that you need to add CFG_SVSPEC='0-11/12-23' to sphinx_train.cfg.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-09

Thanks Nikolay,

One more question on feature extraction.
Can I change the feature extraction parameters, like FFT size,window size, shift etc. for training acoustic model? Where do i have to make such changes?

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-09
  
  Can I change the feature extraction parameters, like FFT size,window size, shift etc. for training acoustic model?
  
  Yes you can
  
  Where do i have to make such changes?
  
  In make_feats.pl script which invokes sphinx_fe to compute features.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-10

hi,

I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.

Any possible mistakes from my side?

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-10

Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?</stage1>

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-09-10
  
  I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.
  
  If you modified script in sources you probably forgot to run make install. Sphinxtrain uses scripts from installed location. You can find details on what feature extraction parameters were using in logdir/001.comp_feat logs
  
  Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?</stage1>
  
  sphinxtrain -s comp_feat run
  
  You can find the list of stages in tutorial as well as in sphinxtrain/scripts folder.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

asm - 2014-09-12

Hi,

If my keyword consists of multiple words, e.g. "hello all world", can it be used in pocketsphinx?

Should my dictionary contain one word as "hello all world" => (phonemes), or three separate words as hello => (phonemes), all => (phonemes), world => (phonemes)?

How to handle silences between the words in such case?

regards
asm

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- bic-user - 2014-09-12
  
  can it be used in pocketsphinx?
  
  yes
  
  Should my dictionary contain one word or three separate words?
  
  three separate words
  
  How to handle silences between the words in such case?
  
  inner silences shouldn't be a problem
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

docs on kws with pocketsphinx

Speech Recognition Toolkit

Forums

Help

docs on kws with pocketsphinx document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

docs on kws with pocketsphinx