Menu

docs on kws with pocketsphinx

Help
asm
2014-07-29
2014-12-16
1 2 3 > >> (Page 1 of 3)
  • asm

    asm - 2014-07-29

    Hi,

    I am a speech technology enthusiast.
    I obtained pocketsphinx keyword spotting software and built it in msvc. It is quite a nice application. Before I dig deep into kws software,I would like to do some background reading for key word spotting, particularly related to the techniques used in pocketsphinx. Can someone point to good papers (journal etc) or docs which describe such techniques? Or, any doc which describes the pocketsphinx keyword spotting logic?

    regards
    asm

     
    • Nickolay V. Shmyrev

      You can check

      ACOUSTIC KEYWORD SPOTTING IN SPEECH WITH APPLICATIONS TO DATA MINING
      http://eprints.qut.edu.au/37254/1/Albert_Thambiratnam_Thesis.pdf

      The implementation is in pocketsphinx/src/libpocketsphinx/kws_search.c

       
      • asm

        asm - 2014-07-31

        Thanks for the very nice document. Really helpful.

        I was wondering if I can process one frame worth of speech at a time for kws task. That is, a buffer_size = frame_size in every call to ps_process_raw(..). Will it impact VAD performance (hangover parameters etc) in any way and/or performance of keyword spotting (viterbi search etc)?

        regards
        asm

         
        • Nickolay V. Shmyrev

          I was wondering if I can process one frame worth of speech at a time for kws task.

          Yes you can

           
  • asm

    asm - 2014-08-08

    Hi,

    I am trying to understand some concepts in pocketsphinx kws.

    Are phone loop hmms similar to filler hmms, that is,are they used as non-keyword model?
    What role do non-keyword triphone models play in keyword spotting, it seems that pocketsphinx loads all the triphones?

    thanks
    amit

     
  • bic-user

    bic-user - 2014-08-08

    Are phone loop hmms similar to filler hmms, that is,are they used as non-keyword model?

    in ps non-keyword model is a phone-loop that contains all filler hmms and CI-phones (context independent)

    What role do non-keyword triphone models play in keyword spotting, it seems that pocketsphinx loads all the triphones?

    They do nothing. But they should be around if the one would like to switch to another search

     
  • asm

    asm - 2014-08-21

    Hi,

    I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.

    thanks
    asm

     
  • Nickolay V. Shmyrev

    I am currently experimenting with en-us-semi models available from the sphinx website. These models are using 6138 senones and 512 mixtures per senone. I would like to know if the models of lower complexity available, for example with 32/64/128 mixtures.

    You can train such models yourself from existing speech database like TEDLIUM.

     
  • asm

    asm - 2014-08-25

    hi,

    Can I use feature type "1s_12c_12d_3p_12dd" in pocketsphinx_kws project?

    asm

     
    • Nickolay V. Shmyrev

      Yes, add -feat 1s_12c_12d_3p_12dd to command line.

       
  • asm

    asm - 2014-09-02

    hi,

    what happens if a triphone required for a keyword is not present in the model (mdef). Does pocketsphinx finds the nearest triphone or uses CI phones?
    Where in the code can I get to see this behavior?

    regards
    asm

     
  • Nickolay V. Shmyrev

    what happens if a triphone required for a keyword is not present in the model (mdef). Does pocketsphinx finds the nearest triphone or uses CI phones?

    There is backoff to different word positions and to SIL context too. You can find details in function

    bin_mdef_phone_id_nearest(bin_mdef_t * m, int32 b, int32 l, int32 r, int32 pos)

    in bin_mdef.c

     
    • bic-user

      bic-user - 2014-11-18

      current implementation won't let you deal with keyphrase and sub-keyphrase. Once 'good day' is detected, all keyphrases propogations are reset.

       
    • Nickolay V. Shmyrev

      I am observing that the audios which are spoken as "good day sunshine" are being detected as "good day". Is this normal?

      How do you think? If you told it to detect "good day" and it detects "good day". If you don't want to detect "good day" don't add it.

      Any thing to be done to properly detect the right keyword?

      If you want to detect both "good day" and "good day sunshine" and discriminate between them you might want to modify algorithm to introduce 1 second delay in detection to decide which phrase was actually detected.

       
  • asm

    asm - 2014-09-05

    Hi,

    I am trying to train a model with veclen=12 using sphinxtrain program. It works fine for veclen=13 (default). When I change to 12, I get the following message in VQ k-means clustering module.

    INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
    INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
    INFO: main.c(520): No mdef files. Assuming 1-class init
    INFO: main.c(1345): 1-class dump file
    INFO: main.c(1383): Corpus 0: sz==1 frames
    INFO: main.c(1392): Convergence ratios are abs(cur - prior) / abs(prior)
    INFO: main.c(236): alloc'ing 0Mb obs buf
    SYSTEM_ERROR: "main.c", line 263: Can't read dump file

    It seems that the variable sz is set to 1 rather than 1396024, as in the case of veclen=13. The buffer size if also being set to 0 here.

    Is there anything else I need to do other than setting veclen = 12 in sphinxtrain_cfg file?

    regards
    asm

     
    • Nickolay V. Shmyrev

      There was error on previous agg_seg step which must create dump file, probably it just crashed. You can check agg_seg log for details.

       
  • asm

    asm - 2014-09-09

    Yes, here is what I get in agg_seg

    Current configuration:
    [NAME] [DEFLT] [VALUE]
    -agc none none
    -agcthresh 2.0 2.000000e+00
    -cachesz 200 200
    -cb2mllrfn .1cls. .1cls.
    -cepdir /mnt/hgfs/Voice_trigger/speechcorpus/timit/feat
    -cepext mfc mfc
    -ceplen 13 12
    -cmn current current
    -cmninit 8.0 8.0
    -cntfn
    -ctlfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/etc/timit_train.fileids
    -dictfn
    -example no no
    -fdictfn
    -feat 1s_c_d_dd 1s_c_d
    -help no no
    -lda
    -ldadim 0 0
    -lsnfn
    -mllrctlfn
    -mllrdir
    -moddeffn
    -npart 0
    -nskip 0 0
    -part 0
    -runlen -1 -1
    -segdir
    -segdmpdirs /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1,
    -segdmpfn /mnt/hgfs/Voice_trigger/speechcorpus/timit/bwaccumdir/timit_buff_1/timit.dmp
    -segext v8_seg v8_seg
    -segidxfn
    -segtype st all
    -sentdir
    -sentext
    -stride 1 1
    -svspec
    -ts2cbfn
    -varnorm no no

    INFO: main.c(169): No lexical transcripts provided
    INFO: feat.c(713): Initializing feature stream to type: '1s_c_d', ceplen=12, CMN='current', VARNORM='no', AGC='none'
    INFO: cmn.c(142): mean[0]= 12.00, mean[1..11]= 0.0
    INFO: corpus.c(1086): Will process all remaining utts starting at 0
    INFO: main.c(288): Will produce feature dump
    INFO: main.c(427): Writing frames to one file
    FATAL_ERROR: "corpus.c", line 1368: Expected mfcc vector len of 12, got 3 (3783)

    It seems the value 3783 is a multiple of 13, yet I have set veclength of 12.

    asm

     
    • Nickolay V. Shmyrev

      So it looks it missed some parameter on extracting features. Probably -ncep 12. We recently added a fix for that in sphinxtrain, but you can just check feature extraction log if it contains -ncep 12

      Btw, for semi-continuous models it's better to train deltas and means as separate streams. For that you need to add CFG_SVSPEC='0-11/12-23' to sphinx_train.cfg.

       
  • asm

    asm - 2014-09-09

    Thanks Nikolay,

    One more question on feature extraction.
    Can I change the feature extraction parameters, like FFT size,window size, shift etc. for training acoustic model? Where do i have to make such changes?

    regards
    asm

     
    • Nickolay V. Shmyrev

      Can I change the feature extraction parameters, like FFT size,window size, shift etc. for training acoustic model?

      Yes you can

      Where do i have to make such changes?

      In make_feats.pl script which invokes sphinx_fe to compute features.

       
  • asm

    asm - 2014-09-10

    hi,

    I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.

    Any possible mistakes from my side?

    regards
    asm

     
  • asm

    asm - 2014-09-10

    Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?

    regards
    asm

     
    • Nickolay V. Shmyrev

      I did make a change to make_feats.pl on line 117. I changed -wlen => 0.015 and -nfft => 256. After I ran sphinxtrain, the model mean and variance parameters were identical to the original model (i.e without fft 256). Original model has fftsize 512 and wlen 0.025625.

      If you modified script in sources you probably forgot to run make install. Sphinxtrain uses scripts from installed location. You can find details on what feature extraction parameters were using in logdir/001.comp_feat logs

      Sorry, I missed to ask, how can i run only feature extraction stage with sphinxtrain. I assume I will need to run sphinxtrain -s <stage1>. But what is to be replaced with "stage1"?

      sphinxtrain -s comp_feat run

      You can find the list of stages in tutorial as well as in sphinxtrain/scripts folder.

       
  • asm

    asm - 2014-09-12

    Hi,

    If my keyword consists of multiple words, e.g. "hello all world", can it be used in pocketsphinx?

    Should my dictionary contain one word as "hello all world" => (phonemes), or three separate words as hello => (phonemes), all => (phonemes), world => (phonemes)?

    How to handle silences between the words in such case?

    regards
    asm

     
    • bic-user

      bic-user - 2014-09-12

      can it be used in pocketsphinx?

      yes

      Should my dictionary contain one word or three separate words?

      three separate words

      How to handle silences between the words in such case?

      inner silences shouldn't be a problem

       
1 2 3 > >> (Page 1 of 3)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.