Menu

Continuous: nonstop talking OK, silence not

Help
Halle
2010-04-03
2012-09-22
1 2 > >> (Page 1 of 2)
  • Halle

    Halle - 2010-04-03

    Hello,

    This is an odd one - I'm again trying to write an ad driver so I can do
    continuous recognition, and I have it working a little so that continuous mode
    recognizes words as long as the speaker never stops talking. If they stop
    talking, or if they aren't talking at the moment that continuous initializes,
    only silence is detected and this block of cont_ad_read() starts to repeat,
    long past the point that the speaker has started speaking again:

     if (seg == NULL) {
    
                assert(r->tail_state == CONT_AD_STATE_SIL);
    
                flen =
                (r->eof) ? r->n_frm : r->n_frm - (r->winsize + r->leader - 1);
                if (flen < 0)
                    flen = 0;
    
            }
    

    Once this starts happening, it becomes progressively less likely that any
    speech will be detected until after between 20-40 loops of utterance_loop() it
    becomes 100% unable to detect speech. If there is no noise while continuous is
    initializing, it will never become able to detect speech during the session.
    Every once in a while, it will emit this instead until memory runs out and it
    crashes:

    INFO: ngram_search.c(407): Resized backpointer table to 10000 entries
    INFO: ngram_search.c(407): Resized backpointer table to 20000 entries
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 3003, best score -534784990
    INFO: ngram_search.c(407): Resized backpointer table to 40000 entries
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 6067, best score -534667231
    INFO: ngram_search.c(407): Resized backpointer table to 80000 entries
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 9152, best score -534799192
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 12234, best score -534766424
    INFO: ngram_search.c(407): Resized backpointer table to 160000 entries
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 15316, best score -534743896
    INFO: ngram_search.c(415): Resized score stack to 200000 entries
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 18392, best score -534858449
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 21486, best score -534783832
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 24555, best score -534703150
    INFO: ngram_search.c(407): Resized backpointer table to 320000 entries
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 27647, best score -534717407
    INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 30718, best score -534769361
    

    Any hints about what to look at? I've changed the buffer size of my device to
    a few different values and futzed with the constants in cont_ad_base.c
    (changing CONT_AD_ADFRMSIZE to 290 fixes this issue about 2 runs out of 3, but
    it seems very random whether it helps or not -- also, when CONT_AD_ADFRMSIZE
    is 290 and it's one of the working runs, it detects non-silence every ~12
    loops whether there is really non-silence or not), so I'm starting to run out
    of ideas. I've changed several of the silence detection values without any
    good results yet, although I can easily believe that I missed something. Any
    advice is appreciated, and thank you. Relatedly, what buffer size for the
    audio device once it has begun recording is optimal for continuous?

     
  • Halle

    Halle - 2010-04-04

    Here is my configuration:

    Current configuration:
    [NAME]      [DEFLT]     [VALUE]
    -agc        none        none
    -agcthresh  2.0     2.000000e+00
    -alpha      0.97        9.700000e-01
    -ceplen     13      13
    -cmn        current     current
    -cmninit    8.0     8.0
    -dither     no      no
    -doublebw   no      no
    -feat       1s_c_d_dd   s2_4x
    -frate      100     100
    -input_endian   little      little
    -lda                
    -ldadim     0       0
    -lifter     0       0
    -logspec    no      no
    -lowerf     133.33334   1.000000e+00
    -ncep       13      13
    -nfft       512     512
    -nfilt      40      20
    -remove_dc  no      yes
    -round_filters  yes     no
    -samprate   16000       1.600000e+04
    -seed       -1      -1
    -smoothspec no      no
    -svspec             
    -transform  legacy      dct
    -unit_area  yes     yes
    -upperf     6855.4976   4.000000e+03
    -varnorm    no      no
    -verbose    no      no
    -warp_params            
    -warp_type  inverse_linear  inverse_linear
    -wlen       0.025625    2.562500e-02
    
     
  • Halle

    Halle - 2010-04-04

    When recognition is working, the value of max at the start of ad_read is
    always exactly 65536, and when the app is stuck and unable to recognize speech
    or crashing while "Renormalizing Scores", the value of max at the start of
    ad_read is always less than 65536.

     
  • Nickolay V. Shmyrev

    Hi

    I could see two reasons for this behavior:

    1. Wrong input format from your audio device, it might be sampling rate or byte order
    2. Some changes you've made in cont_ad. It's not recommended to change anything there because values are actually
      depend on each other.

    To test endpointer, please try sphinx_ad_fileseg program with test.wav data
    from sphinxbase/src/sphinx_adtools. Probably you can compare values from your
    input with this file and get the idea what's wrong. Also please note that
    calibration time should be rather big, like 5 seconds or so.

     
  • Nickolay V. Shmyrev

    Yes, and please try get cont-ad output with

    cont_ad_set_logfp(cont, stdout);

    It would be helpful to look on it.

     
  • Nickolay V. Shmyrev

    is always exactly 65536

    This indeed looks like byte order issue.

     
  • Halle

    Halle - 2010-04-04

    Hi Nickolay,

    Thanks very much for your assistance. Here are links to the three logs showing
    the cont_ad_set_logfp output for the three different potential outcomes (they
    are long, of course):

    This one is for the success state, which is when there is uninterrupted
    speech:
    http://www.robot-commando.com/constant_speech_and_success.log.zip

    This one is for the first failure state, for when it just never gets to the
    point of "Listening...." but doesn't crash
    http://www.robot-
    commando.com/failure_never_gets_to_listening_doesn't_crash.log.zip

    This is the second failure state, for when it gets to "Listening...." while
    there is isn't constant speaking and then starts resizing scores until it runs
    out of memory and crashes:
    http://www.robot-
    commando.com/failure_silence_resizing_scores_then_crash.log.zip

    I have reset everything in cont_ad to its original values to remove any
    potential sources of confusion. Do you think there could be a byte order or
    sample rate issue with the audio format, but one that wouldn't prevent it from
    being able to recognize speech well when there is constant speaking? I tried
    setting a flag on the WAVE format that is being recorded that it should be
    bigendian, but that was an invalid format, so I set it to native endianness
    for the format. Or are you talking about byte order for a different area of
    the code?

     
  • Halle

    Halle - 2010-04-04

    Better URL for the middle log:

    This one is for the first failure state, for when it just never gets to the
    point of "Listening...." but doesn't crash":
    http://www.robot-
    commando.com/failure_never_gets_to_listening_doesnt_crash.log.zip

     
  • Halle

    Halle - 2010-04-04

    Also, just to rule this out - in my audio input, I am using a single buffer
    and it is about a half second in length - I've changed this around in every
    conceivable way (one buffer, three buffers, eight buffers, all kinds of
    different buffer sizes) and it's had almost no effect at all, but just to rule
    out one more thing, does one buffer of a half second sound like an OK value to
    you? I can also set the buffer size to be a byte size, but the reference
    example I used preferred a calculated timespan, so I stayed with that
    approach.

    My format is as follows:

    Linear PCM
    WAVE type
    Format is signed integer
    Format is packed
    Native format endianness
    Channels per frame = 1
    Bits per channel = 16
    Sample rate = 16000
    Bytes per packet = (bits per channel / 8) * channels per frame
    Bytes per frame = bytes per packet
    Frames per packet = 1

     
  • Halle

    Halle - 2010-04-04

    OK, I see now that WAVE shouldn't be bigendian, so that isn't the issue.

     
  • Nickolay V. Shmyrev

    Format is packed

    This looks suspicious. What kind of packing is it? You need to try to dump
    recorded audio into a file. It would be easy to check then. Also you can try
    -rawlogdir option to dump audio.

     
  • Halle

    Halle - 2010-04-04

    I can submit WAV files recorded with the same settings into this function you
    posted, so I don't think the format is too far off if it has an issue:

    https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3514523?me
    ssage=7994617

    About the packedness, I suspect you are on to something. There are three
    options: packed means that the sample bits occupy the entire available bits
    for the channel, or there is the option of setting align high which will place
    the sample bits into the high bits of the channel, or align low which will
    place the sample bits into the low bits of the channel. That sounds related to
    the weirdness I'm experiencing if you say it sounds like it's related to
    ordering. What is kind of maddening is that if I change the packedness flag, I
    get an error every time. But I will keep looking at it. Do you think it should
    be align low or align high?

     
  • Halle

    Halle - 2010-04-04

    OK, here are a couple of raw audio files:

    http://robot-commando.com/Rawfiles.zip

    I don't actually have anything to listen to them with - what do you use to
    analyze them?

     
  • Halle

    Halle - 2010-04-04

    (The files were output with the -rawlogdir argument).

     
  • Nickolay V. Shmyrev

    Looking inside, the format is good, the issue is actually that there is an
    echo of about 120ms length. It looks like you implemented sound input
    incorrectly. Probably you want to show your code.

     
  • Halle

    Halle - 2010-04-04

    Will that echo have also been there in the original recording, or is the sound
    at all processed in the course of the recognition routine before being output
    as raw?

     
  • Halle

    Halle - 2010-04-04

    OK, there is actually a lot of code for this and I need to check up on a
    number of things relating to the echo, but here is probably the most
    interesting thing, which is the contents of ad_read():

        UInt32 length = max; // I haven't had good results setting this to max * sizeof(int16)
    
        OSStatus status = AudioFileReadBytes (  
                                             r->recorder->AudioFileID(),
                                                false, //don't cache
                                                0, // starting position - is zero correct here? starting at the current rec position doesn't work
                                                &length, //in is bytes to read, out is bytes actually read
                                                buf //output 
                                            );
    
        if (status == -39 && r->recording==0) { // status -39 is EOF
            return AD_EOF;  
        } else if (status != 0) { //status 0 is success, other possibilities are an EOF, a parameter error or something else
            if(status = -39 && r->recording==1) {
                return 0;
            } else if (status == -50){
                // rarely, a -50 (bad parameter) error is being returned here 
            } else { // an unknown error, this isn't happening
                printf("status is %d", (int)status);
                return -1;  
            }
    
        } else {
    
                return length;
        }
    
        return 0;
    
     
  • Halle

    Halle - 2010-04-04

    Ugh, sorry that code tag keeps not working.

     
  • Nickolay V. Shmyrev

    The issue I see here is that cont_ad passes you number of samples (each 2
    byte) and expects to get number of samples back (not number of bytes, but
    number of bytes / 2). Please check that.

     
  • Halle

    Halle - 2010-04-05

    Whoops, that a big issue, thanks Nickolay. OK, I have gone back to the various
    ad implementation examples that use byte-reading functions and simplified my
    ad_read contents:

    UInt32 length = max * r->bps;

    printf("length before read (max * 2) is %d\n", (int)length);

    OSStatus status = AudioFileReadBytes (
    r->recorder->GetAudioFileID(), //queue
    false, //cache?
    0, // starting position is the record packet
    &length, //input is how many bytes to read, output is how many read
    buf
    );
    printf("length after read (actual bytes read) %d\n", (int)length);

    if(length > 0) {
    length /= r->bps;
    } else if (length < 0) {
    NSLog(@"status: %d", (int)status);
    return AD_ERR_GEN;
    } else {
    length = 0;
    }
    if ((length == 0) && (r->recording==0)) {
    return AD_EOF;
    }
    printf("length on return (half bytes read) is %d\n\n\n", (int)length);
    return length;

    Now it is more stable but still not working. Do you see any other errors?

    I've investigated the "packing" issue and the format flag seems to just be
    there to describe the canonical packing of linear pcm to lower-level
    functions, not to set it:

    http://wiki.multimedia.cx/index.php?title=PCM#16-bit_PCM

    I've also created a routine so that a copy is made of the WAVE when "Non-zero
    amount of data received; start recognition of new utterance.Listening..."
    happens (I can get this to happen once in a session if I talk while
    cont_ad_calib is happening, though it doesn't return a hyp now) so that I can
    check out the file format, and I've played it back and looked at its settings
    and it looks and sounds good as far as I can tell (I fixed the echoing issue).
    An example of one of these recordings is here:

    http://www.robot-commando.com/recordedFile.wav.zip

    So, I think my sound format and recording is OK, but I'm still unsure about
    the byte ordering thing. I've put some more logging on cont_ad_read_internal
    and ad_read and this is what I'm seeing, maybe it is informative:

    The first thing that happens is that cont_ad_calib is started, in the course
    of which there are approximately 1400 iterations of ad_read. First there are
    ~1200 iterations in which max is 256, length is therefore set to receive as
    many as 512 bytes, actually reads zero bytes, and returns zero bytes. This is
    followed by approximately 200 more iterations of ad_read (still during
    cont_ad_calib) in which max is 256, length is set to 512, has a full 512 bytes
    read into it, and returns 256. I don't know if this is as expected. Again, my
    sound input is set to have a single buffer of a second in a half in duration,
    in case that is interesting.

    It is at this point that cont_ad_calib completes and "READY..." appears and
    then ad_read is called for an endless number of iterations and the ad_read
    values are like this, with no utterance ever being detected:

    2010-04-05 17:31:49.663 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 131072
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:49.764 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 86476
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:49.866 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 41880
    In ad_read, length after read (actual bytes read) 41880
    In ad_read, length on return (half bytes read) is 20940

    2010-04-05 17:31:49.866 Continuous In cont_ad_read_internal, about to do the
    second ad_read
    In ad_read, length before read (max * 2) is 76288
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:49.968 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 86476
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:50.069 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 41880
    In ad_read, length after read (actual bytes read) 41880
    In ad_read, length on return (half bytes read) is 20940

    2010-04-05 17:31:50.070 Continuous In cont_ad_read_internal, about to do the
    second ad_read
    In ad_read, length before read (max * 2) is 76288
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:50.171 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 86476
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:50.273 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 41880
    In ad_read, length after read (actual bytes read) 41880
    In ad_read, length on return (half bytes read) is 20940

    2010-04-05 17:31:50.274 Continuous In cont_ad_read_internal, about to do the
    second ad_read
    In ad_read, length before read (max * 2) is 76288
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:50.375 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 86476
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:50.476 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 41880
    In ad_read, length after read (actual bytes read) 41880
    In ad_read, length on return (half bytes read) is 20940

    2010-04-05 17:31:50.477 Continuous In cont_ad_read_internal, about to do the
    second ad_read
    In ad_read, length before read (max * 2) is 76288
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    2010-04-05 17:31:50.579 Continuous In cont_ad_read_internal, about to do the
    first ad_read
    In ad_read, length before read (max * 2) is 86476
    In ad_read, length after read (actual bytes read) 44596
    In ad_read, length on return (half bytes read) is 22298

    Ad nauseum. Not sure how to proceed. Should I try some byte-reordering? Where
    would be the most efficient place to try it?

    What is weird is that it was working better back when I had the wrong length
    being returned by ad_read. This really does seem to point to a bitrate issue
    but I can't see one in the audio file.

     
  • Halle

    Halle - 2010-04-05

    "my sound input is set to have a single buffer of a second in a half in
    duration" should read "my sound input is set to have a single buffer of a
    second and a half in duration".

     
  • Halle

    Halle - 2010-04-05

    To clarify, ps_start_utt() is never being reached which is why I don't have
    logging of what is happening during a recognition attempt, it's just stuck
    returning zero from cont_ad_read().

     
  • Halle

    Halle - 2010-04-05

    And, in case you haven't gotten tired of my self-replies yet :) , there has
    been one other thing nagging at me throughout, which is that that
    AudioFileReadBytes() function above has an argument for the bit offset to
    start reading at, which none of the driver examples have. I have tried setting
    this to the following settings: 0, NULL, 44 (just to see if it was related to
    the WAVE header) and the current recording packet. 0, NULL and 44 all worked
    slightly, the current recording packet seemed wrong. Do you think there is
    something else I should be doing with this argument?

     
  • Nickolay V. Shmyrev

    Hi

    Offset should be 0 as I see on other examples found on google code search.
    Let's try to fix one issue first:

    actually reads zero bytes, and returns zero bytes.

    It shouldn't be so. Not sure how is it on mac, but common sense is that if you
    are reading from the microphone you can read as many bytes as you request. You
    probably need to add little delay on device initialization to fill the audio
    buffer. Is there any sample of code that shows how to input audio on macos? I
    probably need to look on both your version and this sample because I don't
    quite understand how your code is different from standard one.

     
  • Halle

    Halle - 2010-04-06

    OK, I put a usleep() after I start recording for the same duration as my
    buffer duration and now there are no more returns of zero bytes during
    calibration and the calibration is much shorter, thanks. I had tried this
    earlier assuming the read was starting too early, but I had it in the
    utterance loop rather than the ad_start_rec and I think it wasn't occurring at
    the right moment to be useful. My code doesn't really diverge significantly
    from the Apple example code (I don't think official example code for Core
    Audio is online anywhere, unfortunately) but I will attempt to see if I can
    find something to point you to.

     
1 2 > >> (Page 1 of 2)

Log in to post a comment.