Menu

ps_continuous: inconsistent search times

Help
margomaps
2012-02-28
2012-09-22
  • margomaps

    margomaps - 2012-02-28

    I'm using pocketsphinx_continous from svn (r11331) on Ubuntu 11.10 as follows:

    % pocketsphinx_continous -input mic -lm mymodel.lm -dict mydict.dic

    mymodel.lm and mydict.dic were generated from
    http://www.speech.cs.cmu.edu/tools/lmthttp://www.speech.cs.cmu.edu/tools
    /lmtool-new.htmlool-new.html
    using a corpus file that
    has about 30 words. The intended use is as a command-and-control interface for
    a Qt GUI application. I've already adapted continuous.c into my Qt code and am
    able to use it successfully there. But the issue I will describe occurs in the
    pocketsphinx_continous program as well, so I'll keep that the focus of the
    conversation for clarity.

    What I've noticed is that sometimes the response time is slow -- about 1.5
    seconds. Other times it is quite fast, returning a (correct) hypothesis almost
    instantly. There doesn't seem to be any inbetween: it's either fast or
    (relatively) slow.

    Furthermore, I've noticed that if I speak a word into the mic immediately
    after the previous word is recognized and printed to the screen, the new word
    is almost always recognized quickly. Thus I can successfully have many words
    recognized quickly in succession if I immediately say the next word after
    pocketsphinx_continuous says "READY....", and before it says "Listening...."
    If I wait a moment (until "Listening..." appears or later) to speak, then most
    of the time -- but not always -- the recognition of the word takes ~ 1.5
    seconds.

    I discovered that the first call to ps_process_raw(...) in
    recognize_from_microphone() (right after ps_start_utt(...) is the one
    that is taking up all the time when the recognition goes slowly. I've tested
    the speed perhaps hundreds of times, and ps_process_raw(...) either takes 5ms,
    1290ms +/- 5ms, or 1510ms +/- 5ms on my system. Those three values are
    encountered consistently unless the utterance is 2-3 words, in which case the
    speed is often a multple of one of those, such as 2580ms or 3020ms. It's very
    consistent, which would indicate a deterministic issue rather than a random
    one.

    When recognizing the word "LINK" on two subsequent attempts, here are the INFO
    messages:

    INFO: cmn_prior.c(121): cmn_prior_update: from < 38.01  2.96  1.75 -0.04 -0.89 -0.18  0.45  0.12  0.22  0.24 -0.00 -0.46 -0.30 >
    INFO: cmn_prior.c(139): cmn_prior_update: to   < 38.11  3.05  1.79 -0.03 -0.91 -0.15  0.48  0.12  0.21  0.18 -0.02 -0.47 -0.27 >
    INFO: ngram_search_fwdtree.c(1549):      580 words recognized (8/fr)
    INFO: ngram_search_fwdtree.c(1551):    18932 senones evaluated (274/fr)
    INFO: ngram_search_fwdtree.c(1553):    10416 channels searched (150/fr), 1967 1st, 5960 last
    INFO: ngram_search_fwdtree.c(1557):      861 words for which last channels evaluated (12/fr)
    INFO: ngram_search_fwdtree.c(1560):      523 candidate words for entering last phone (7/fr)
    INFO: ngram_search_fwdtree.c(1562): fwdtree 0.02 CPU 0.029 xRT
    INFO: ngram_search_fwdtree.c(1565): fwdtree 1.35 wall 1.957 xRT
    INFO: ngram_search_fwdflat.c(305): Utterance vocabulary contains 9 words
    INFO: ngram_search_fwdflat.c(940):      138 words recognized (2/fr)
    INFO: ngram_search_fwdflat.c(942):    12064 senones evaluated (175/fr)
    INFO: ngram_search_fwdflat.c(944):     9371 channels searched (135/fr)
    INFO: ngram_search_fwdflat.c(946):      615 words searched (8/fr)
    INFO: ngram_search_fwdflat.c(948):      423 word transitions (6/fr)
    INFO: ngram_search_fwdflat.c(951): fwdflat 0.01 CPU 0.017 xRT
    INFO: ngram_search_fwdflat.c(954): fwdflat 0.01 wall 0.016 xRT
    INFO: ngram_search.c(1214): </s> not found in last frame, using LINK.67 instead
    INFO: ngram_search.c(1266): lattice start node <s>.0 end node LINK.25
    INFO: ngram_search.c(1294): Eliminated 12 nodes before end node
    INFO: ngram_search.c(1399): Lattice has 33 nodes, 17 links
    INFO: ps_lattice.c(1365): Normalizer P(O) = alpha(LINK:25:67) = -378589
    INFO: ps_lattice.c(1403): Joint P(O,S) = -378735 P(S|O) = -146
    INFO: ngram_search.c(888): bestpath 0.00 CPU 0.006 xRT
    INFO: ngram_search.c(891): bestpath 0.00 wall 0.001 xRT
    000000004: LINK
    READY....
    
    
    
    
    
    INFO: cmn_prior.c(121): cmn_prior_update: from < 38.11  3.05  1.79 -0.03 -0.91 -0.15  0.48  0.12  0.21  0.18 -0.02 -0.47 -0.27 >
    INFO: cmn_prior.c(139): cmn_prior_update: to   < 38.20  3.07  1.75 -0.06 -0.93 -0.13  0.52  0.06  0.20  0.10 -0.04 -0.46 -0.24 >
    INFO: ngram_search_fwdtree.c(1549):      556 words recognized (8/fr)
    INFO: ngram_search_fwdtree.c(1551):    18806 senones evaluated (261/fr)
    INFO: ngram_search_fwdtree.c(1553):     9800 channels searched (136/fr), 1927 1st, 5432 last
    INFO: ngram_search_fwdtree.c(1557):      813 words for which last channels evaluated (11/fr)
    INFO: ngram_search_fwdtree.c(1560):      523 candidate words for entering last phone (7/fr)
    INFO: ngram_search_fwdtree.c(1562): fwdtree 0.02 CPU 0.028 xRT
    INFO: ngram_search_fwdtree.c(1565): fwdtree 0.02 wall 0.029 xRT
    INFO: ngram_search_fwdflat.c(305): Utterance vocabulary contains 9 words
    INFO: ngram_search_fwdflat.c(940):      156 words recognized (2/fr)
    INFO: ngram_search_fwdflat.c(942):    11994 senones evaluated (167/fr)
    INFO: ngram_search_fwdflat.c(944):     7890 channels searched (109/fr)
    INFO: ngram_search_fwdflat.c(946):      572 words searched (7/fr)
    INFO: ngram_search_fwdflat.c(948):      405 word transitions (5/fr)
    INFO: ngram_search_fwdflat.c(951): fwdflat 0.01 CPU 0.011 xRT
    INFO: ngram_search_fwdflat.c(954): fwdflat 0.01 wall 0.015 xRT
    INFO: ngram_search.c(1266): lattice start node <s>.0 end node </s>.64
    INFO: ngram_search.c(1294): Eliminated 0 nodes before end node
    INFO: ngram_search.c(1399): Lattice has 37 nodes, 16 links
    INFO: ps_lattice.c(1365): Normalizer P(O) = alpha(</s>:64:70) = -386640
    INFO: ps_lattice.c(1403): Joint P(O,S) = -390168 P(S|O) = -3528
    INFO: ngram_search.c(888): bestpath 0.00 CPU 0.000 xRT
    INFO: ngram_search.c(891): bestpath 0.00 wall 0.000 xRT
    000000005: LINK
    READY....
    

    I noticed some differences in the first (slow) and second (fast) case,
    especially the fwdtree 1.35 wall vs 0.02 wall, which I guess indicates the
    area where things went slowly the first time. I also saw the bit about "
    not found in last frame" on the slower attempt, but I haven't yet figured out
    the significance of that or whether it is related to the slower search.

    The accuracy of the recognition is outstanding, and it was easy to integrate
    into my Qt app -- it took maybe 1 hour. If I can get it to consistently
    recognize the commands quickly rather than a frustrating 1.5s delay, this will
    significantly enhance the user experience for my GUI. I would appreciate any
    hints on what might be causing the issues I'm seeing, or ideas on what I could
    differently in order to achieve faster results for a command & control
    application.

    Thanks!

     
  • margomaps

    margomaps - 2012-02-28

    I should also mention that I'm using a smaller value than
    DEFAULT_SAMPLES_PER_SEC to determine the end of the utterance. I believe the
    default is 16000, which results in a required 1s of silence before the
    utterance is ended. I'm using 1000, which allows the utterance to end after
    1/16th of a second of silence (I think).

     
  • margomaps

    margomaps - 2012-02-29

    A followup, just to save anyone from wasting any time on this. Somehow I
    missed the following message that appeared just before "READY...":

    Warning: Could not find Mic element

    I must have seen it before but forgotten about it/ignored it since the mic did
    appear to work properly. At any rate, when I was playing around with
    pocketsphinx_continuous yesterday, I made an accidental discovery: the
    recognition in my Qt app (based on continuous.c) was very fast when I already
    had an instance of the python/gstreamer demoapp.py (using the gstreamer
    pocketsphinx plugin) running.

    I haven't quite pieced together why two instances of running at the same time
    would speed up the ps_process_raw function, but I suspected it had something
    to do with the state of the audio buffer for my mic. So I looked more closely
    at the error statements and that's when I noticed the "Could not find Mic"
    warning.

    A little more googling led me to try "-adcdev plughw:2,0" instead of "-input
    mic", and that was the key. Using -adcdev results in fast recognition with no
    more long delays. Fantastic!

    In case anyone is wondering, I did:

    % cat /proc/asound/cards

    to determine that my USB device was device #2. That's where the 2 comes from
    in plughw:2,0.

    I hope this information is useful to someone who might run into the same
    issues I did.

    A big thanks to the authors of the software. It works flawlessly for my
    application. I might investigate the possibility of using partial results to
    anticipate and recognize a valid spoken command even before the user is done
    saying it, which would speed things up even more.

     
  • Nickolay V. Shmyrev

    A big thanks to the authors of the software. It works flawlessly for my
    application. I might investigate the possibility of using partial results to
    anticipate and recognize a valid spoken command even before the user is done
    saying it, which would speed things up even more.

    You are welcome. It's nice that your problem got resolved so fast, wish you
    even more adventures in the future.

     
  • margomaps

    margomaps - 2012-02-29

    Your wish came true, re: more adventures in the future.

    After getting sphinxbase and pocketsphinx built on my Mac, I had trouble using
    pocketsphinx_continous with -input mic. Searching led me to https://sourcefor
    ge.net/projects/cmusphinx/forums/forum/5471/topic/4005379
    , which I assume is
    still the case: no live input on OSX.

    Since OSX is one of the supported platforms for my Qt application, I'm
    motivated to figure out the feasibility/effort required to support live mic
    input on OSX. Do you have any suggestions on who I could talk to, where I
    could look in the code, and any other resources I might consider in attempting
    to add this feature? I'm happy to work on this myself, but I could use some
    advice on how to get started.

     
  • Nickolay V. Shmyrev

    I had a code for it, but never intergrated it really. The issue is that ad is
    not really flexible to support mac coreaudio. Check it here:

    http://dl.dropbox.com/u/26073448/mac-
    ad.tar.gz

     

Log in to post a comment.