Menu

docs on kws with pocketsphinx

Help
asm
2014-07-29
2014-12-16
<< < 1 2 3 > >> (Page 2 of 3)
  • asm

    asm - 2014-09-24

    Hi,

    I would like to get an information about the data on which en-us-semi model is trained. How many hours of data is used? is it collected from clean lab environment, mobile environment? is it trained on any standard dataset(s)?

    thanks
    asm

     
  • Nickolay V. Shmyrev

    How many hours of data is used?

    400 hours

    is it collected from clean lab environment, mobile environment?

    Mixed environments

    is it trained on any standard dataset(s)?

    As a close approximation you can use tedlium database

     
  • asm

    asm - 2014-10-14

    How to convert sendump file, generated from sphinxtrain-1.0.8, to text format? I am trying to use printp executable as "printp -mixwfn sendump" but its throwing an error as

    INFO: main.c(419): Reading sendumpand normalizing.
    ERROR: "s3io.c", line 164: No SPHINX-III file ID at beginning of file
    ERROR: "s3io.c", line 265: Error reading header for sendump

    What should I use to get the text version of sendump file?

    regards
    asm

     
    • Nickolay V. Shmyrev

      There is no such tool unfortunately.

      There is sendump.py in sphinxtrain to convert sendump to mixture_weights file.

       
  • asm

    asm - 2014-10-28

    hi,

    There are two files for generic english acoustic model on sphinx website, available for download. en-us-semi and en-us-semi-full. I would like to know what is the diference between the two models.

    regards
    asm

     
    • Nickolay V. Shmyrev

      full version includes mixture_weights file which is uncompressed sendump file. It is useful for adaptation.

       
  • asm

    asm - 2014-11-03

    hi,

    I am doing some tests using en-us-semi model and finding it to be having a rather large footprint than my requirement. Is there any way I can trim down the number of mixtures (to say 64) in that model? Is it possible for sphinx guys to produce en-us-semi model of various complexities?

    regards
    asm

     
    • Nickolay V. Shmyrev

      What is your requirement then?

       
  • asm

    asm - 2014-11-03

    Hi

    I would like to have en-us-semi model with 64 mixtures (instead of 512).

    thanks
    asm

     
    • Nickolay V. Shmyrev

      That would be very inaccurate. If you want to create a model of specific size you should better provide the size. You should also provide the information about the application you want to implement.

       
  • asm

    asm - 2014-11-03

    The project is low-memory word recognition. That is, to recognize couple of one/two worded menu/navigational commands.
    The memory limit in my framework for means and variances is around 20kB. Now, my problem is that the means/variances are 80kB each for en-us-semi model. I will need to have a same quality model (as that of en-us-semi) with lower number of mixtures to fit in the allowed memory space.

    What would be my options in such scenario? Can I trim the existing en-us-semi model to 64 mixtures without affecting the accuracy, as I have not got the data to train the model of quality of en-us-semi.

    regards
    asm

     
    • Nickolay V. Shmyrev

      Means are tiny part of the model, sendump / mixture weight is 3mb compared to 40kb means. It makes way more sense to reduce mixtures size than means size. I don't quite get you because of that.

       
      • malei

        malei - 2017-11-14

        hi,

        I encountered a troublesome question, zh_broadcastnews_16k_ptm256_8000.tar.bz2 is so large can not meet my device, which have only 6Mb flash。 except system image about 4Mb, only 2Mb leave to word recognition application。 can you give some advice to trim zh_broadcastnews_16k_ptm256_8000 acoustic model, thank you very much!

         
  • asm

    asm - 2014-11-03

    Yes you are right about the sendump. I will also have to look into that in future. While deciding on the each component of model(mean, var, mixtures etc) I have a limit placed on the memory for it. Currently i can afford upto 10k for means and vars and roughly 1.5M for mixtures. I thought working on means/vars would be more important as the ratio of reduction is much larger (from 80k to 10k) as compared to sendump (3M to 1M). I also figured out that 64 gaussians would fit the requirement for mean/vars.

    regards

     
    • Nickolay V. Shmyrev

      From 3m to 1m you save 2m. From 80k to 10k you save 70k only. You should probably think twice on that ;)

      I can train you PTM model which fits 2mb total, let me know if it's ok for you.

       
  • asm

    asm - 2014-11-03

    Yes thats quite right about savings.

    I am considering sendump as a configurable parameter in the proto and mean/vars as a fixed component in my design.

    Configurable information will be taken care by app layer where as, the fixed parameters (mean/var) will be coded in the lib section.

    I am focussing on the fixed section which includes Gaussian parameters. So i need en-us-semi model with 64 gaussian params. Let the sendump be of the same complexity (6000 senones) as it is currently present, which is anyways configurable in my design.

    thanks for offering to train.

    asm

     
  • asm

    asm - 2014-11-07

    Hi

    How to use multiple pronunciations in a dictionary? Is the following correct for a word hello?

    hello HH AH L OW
    hello(2) HH AE L OW

    Are the triphone HMMs connected for all these words in parallel, when present in the dictionary?

    regards
    asm

     
  • Nickolay V. Shmyrev

    How to use multiple pronunciations in a dictionary? Is the following correct for a word hello?

    Yes

    Are the triphone HMMs connected for all these words in parallel, when present in the dictionary?

    Yes

     
  • asm

    asm - 2014-11-10

    Hi,

    My keyword is "hello world". As I said before, I have two pronunciations of a word "hello" in the dictionary as follows.

    hello HH AH L OW
    hello(2) HH AE L OW

    The dictionary stores these words as dict->word[0] ("hello") and dict->word[1] (hello(2)) respectively, with associated ciphone entries.

    I run the test with -kws "hello world".

    I am observing that, in the function kws_search_reinit(..), the id for word "hello" is picked up from the dictionary as follows.

    wid = dict_wordid(dict, wrdptr[i]);

    and, then all the phones related to the first entry of "hello" are picked and linked. The question is then, what happens to the alternative entry of hello, i.e hello(2)?

    Am I giving the keyword correctly? Is there any setting to ask PS to look into alternative keyword?

    regards
    asm

     
    • bic-user

      bic-user - 2014-11-10

      Hi, asm.
      Seems you're using slightly outdated pocketsphinx. In current version you can provide file with list of keyphrases via "-kws" or single keyphrase with "-keyphrase".
      Alternative pronuncations are currently ignored in kws_search. It will be greatly appreciated if you provide a patch to fix this omission. Join #cmusphinx irc to find help on implementation issues.
      The other way is to specify two keyphrases, each having special pronuncation case as a separate word.

       
  • asm

    asm - 2014-11-11

    Oh, Didnt realize it. thanks for the pointers.

    One more issue is that when I use the keyword "good day", and recordings actually contain "mood day", it is detected as the keyword. I am using en-us-semi model to test. is it normal to confuse "good" with "mood"?. What can I do to prevent/minimize such false alarms, other than playing with the threshold?

    regards
    asm

     
    • bic-user

      bic-user - 2014-11-11

      is it normal to confuse "good" with "mood"?

      yes

      What can I do to prevent/minimize such false alarms, other than playing with the threshold?

      Not that much. Threshold should help.
      Though some papers suggest to use words that are similar to keywords as part of garbage model. I.e. you can specify both "good day" and "mood day" for spotting, set appropriative thresholds for them to gain additional discrimination properties.

       
  • asm

    asm - 2014-11-11

    I am also observing that, I can detect "good day" accurately when i speak it in normal style. When there is a small (say 1s) silence between the words "good" and "day", it is failing to detect it.
    So, how do I incorporate silences between the words, for detection?

    One thing I am trying is adding SIL to words in the dictionary, as SIL G UH D SIL, i tried adding SIL to either/both ends of the words, but it didnt improve the accuracy.

     
    • bic-user

      bic-user - 2014-11-11

      Share audio you're trying to recognize (add "-rawlogdir /some/directory") and info on models you use

       
  • asm

    asm - 2014-11-12

    Hi,

    What is the significance of word position (b,e,i,s) of triphone for keyword spotting? Does PS actually use this information while linking the triphones? Where in the code do I see that?

    If I need a triphone as "x y z",i and my acoustic model only has "x,y,z",b and "x,y,z",e, then what happens? Does PS use triphone at word position b or e, instead of i? Is it ok in practice to do such change in mdef file directly (i.e changing position b to i, if i is required and only b and e is present).

    thanks
    asm

     
<< < 1 2 3 > >> (Page 2 of 3)

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.