CMU Sphinx / Forums / Help: Continuous: nonstop talking OK, silence not

Hello,

This is an odd one - I'm again trying to write an ad driver so I can do
continuous recognition, and I have it working a little so that continuous mode
recognizes words as long as the speaker never stops talking. If they stop
talking, or if they aren't talking at the moment that continuous initializes,
only silence is detected and this block of cont_ad_read() starts to repeat,
long past the point that the speaker has started speaking again:

 if (seg == NULL) {

            assert(r->tail_state == CONT_AD_STATE_SIL);

            flen =
            (r->eof) ? r->n_frm : r->n_frm - (r->winsize + r->leader - 1);
            if (flen < 0)
                flen = 0;

        }

Once this starts happening, it becomes progressively less likely that any
speech will be detected until after between 20-40 loops of utterance_loop() it
becomes 100% unable to detect speech. If there is no noise while continuous is
initializing, it will never become able to detect speech during the session.
Every once in a while, it will emit this instead until memory runs out and it
crashes:

INFO: ngram_search.c(407): Resized backpointer table to 10000 entries
INFO: ngram_search.c(407): Resized backpointer table to 20000 entries
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 3003, best score -534784990
INFO: ngram_search.c(407): Resized backpointer table to 40000 entries
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 6067, best score -534667231
INFO: ngram_search.c(407): Resized backpointer table to 80000 entries
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 9152, best score -534799192
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 12234, best score -534766424
INFO: ngram_search.c(407): Resized backpointer table to 160000 entries
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 15316, best score -534743896
INFO: ngram_search.c(415): Resized score stack to 200000 entries
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 18392, best score -534858449
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 21486, best score -534783832
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 24555, best score -534703150
INFO: ngram_search.c(407): Resized backpointer table to 320000 entries
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 27647, best score -534717407
INFO: ngram_search_fwdtree.c(1433): Renormalizing Scores at frame 30718, best score -534769361

Any hints about what to look at? I've changed the buffer size of my device to
a few different values and futzed with the constants in cont_ad_base.c
(changing CONT_AD_ADFRMSIZE to 290 fixes this issue about 2 runs out of 3, but
it seems very random whether it helps or not -- also, when CONT_AD_ADFRMSIZE
is 290 and it's one of the working runs, it detects non-silence every ~12
loops whether there is really non-silence or not), so I'm starting to run out
of ideas. I've changed several of the silence detection values without any
good results yet, although I can easily believe that I missed something. Any
advice is appreciated, and thank you. Relatedly, what buffer size for the
audio device once it has begun recording is optimal for continuous?

Here is my configuration:

Current configuration:
[NAME]      [DEFLT]     [VALUE]
-agc        none        none
-agcthresh  2.0     2.000000e+00
-alpha      0.97        9.700000e-01
-ceplen     13      13
-cmn        current     current
-cmninit    8.0     8.0
-dither     no      no
-doublebw   no      no
-feat       1s_c_d_dd   s2_4x
-frate      100     100
-input_endian   little      little
-lda                
-ldadim     0       0
-lifter     0       0
-logspec    no      no
-lowerf     133.33334   1.000000e+00
-ncep       13      13
-nfft       512     512
-nfilt      40      20
-remove_dc  no      yes
-round_filters  yes     no
-samprate   16000       1.600000e+04
-seed       -1      -1
-smoothspec no      no
-svspec             
-transform  legacy      dct
-unit_area  yes     yes
-upperf     6855.4976   4.000000e+03
-varnorm    no      no
-verbose    no      no
-warp_params            
-warp_type  inverse_linear  inverse_linear
-wlen       0.025625    2.562500e-02

Halle - 2010-04-04

When recognition is working, the value of max at the start of ad_read is
always exactly 65536, and when the app is stuck and unable to recognize speech
or crashing while "Renormalizing Scores", the value of max at the start of
ad_read is always less than 65536.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-04

Hi

I could see two reasons for this behavior:

Wrong input format from your audio device, it might be sampling rate or byte order

Some changes you've made in cont_ad. It's not recommended to change anything there because values are actually
depend on each other.

To test endpointer, please try sphinx_ad_fileseg program with test.wav data
from sphinxbase/src/sphinx_adtools. Probably you can compare values from your
input with this file and get the idea what's wrong. Also please note that
calibration time should be rather big, like 5 seconds or so.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-04

Yes, and please try get cont-ad output with

cont_ad_set_logfp(cont, stdout);

It would be helpful to look on it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-04

is always exactly 65536

This indeed looks like byte order issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

Hi Nickolay,

Thanks very much for your assistance. Here are links to the three logs showing
the cont_ad_set_logfp output for the three different potential outcomes (they
are long, of course):

This one is for the success state, which is when there is uninterrupted
speech:
http://www.robot-commando.com/constant_speech_and_success.log.zip

This one is for the first failure state, for when it just never gets to the
point of "Listening...." but doesn't crash
http://www.robot-
commando.com/failure_never_gets_to_listening_doesn't_crash.log.zip

This is the second failure state, for when it gets to "Listening...." while
there is isn't constant speaking and then starts resizing scores until it runs
out of memory and crashes:
http://www.robot-
commando.com/failure_silence_resizing_scores_then_crash.log.zip

I have reset everything in cont_ad to its original values to remove any
potential sources of confusion. Do you think there could be a byte order or
sample rate issue with the audio format, but one that wouldn't prevent it from
being able to recognize speech well when there is constant speaking? I tried
setting a flag on the WAVE format that is being recorded that it should be
bigendian, but that was an invalid format, so I set it to native endianness
for the format. Or are you talking about byte order for a different area of
the code?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

Better URL for the middle log:

This one is for the first failure state, for when it just never gets to the
point of "Listening...." but doesn't crash":
http://www.robot-
commando.com/failure_never_gets_to_listening_doesnt_crash.log.zip

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

Also, just to rule this out - in my audio input, I am using a single buffer
and it is about a half second in length - I've changed this around in every
conceivable way (one buffer, three buffers, eight buffers, all kinds of
different buffer sizes) and it's had almost no effect at all, but just to rule
out one more thing, does one buffer of a half second sound like an OK value to
you? I can also set the buffer size to be a byte size, but the reference
example I used preferred a calculated timespan, so I stayed with that
approach.

My format is as follows:

Linear PCM
WAVE type
Format is signed integer
Format is packed
Native format endianness
Channels per frame = 1
Bits per channel = 16
Sample rate = 16000
Bytes per packet = (bits per channel / 8) * channels per frame
Bytes per frame = bytes per packet
Frames per packet = 1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

OK, I see now that WAVE shouldn't be bigendian, so that isn't the issue.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-04

Format is packed

This looks suspicious. What kind of packing is it? You need to try to dump
recorded audio into a file. It would be easy to check then. Also you can try
-rawlogdir option to dump audio.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

I can submit WAV files recorded with the same settings into this function you
posted, so I don't think the format is too far off if it has an issue:

https://sourceforge.net/projects/cmusphinx/forums/forum/5471/topic/3514523?me
ssage=7994617

About the packedness, I suspect you are on to something. There are three
options: packed means that the sample bits occupy the entire available bits
for the channel, or there is the option of setting align high which will place
the sample bits into the high bits of the channel, or align low which will
place the sample bits into the low bits of the channel. That sounds related to
the weirdness I'm experiencing if you say it sounds like it's related to
ordering. What is kind of maddening is that if I change the packedness flag, I
get an error every time. But I will keep looking at it. Do you think it should
be align low or align high?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

OK, here are a couple of raw audio files:

http://robot-commando.com/Rawfiles.zip

I don't actually have anything to listen to them with - what do you use to
analyze them?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

(The files were output with the -rawlogdir argument).

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-04

Looking inside, the format is good, the issue is actually that there is an
echo of about 120ms length. It looks like you implemented sound input
incorrectly. Probably you want to show your code.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-04

Will that echo have also been there in the original recording, or is the sound
at all processed in the course of the recognition routine before being output
as raw?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

OK, there is actually a lot of code for this and I need to check up on a
number of things relating to the echo, but here is probably the most
interesting thing, which is the contents of ad_read():

    UInt32 length = max; // I haven't had good results setting this to max * sizeof(int16)

    OSStatus status = AudioFileReadBytes (  
                                         r->recorder->AudioFileID(),
                                            false, //don't cache
                                            0, // starting position - is zero correct here? starting at the current rec position doesn't work
                                            &length, //in is bytes to read, out is bytes actually read
                                            buf //output 
                                        );

    if (status == -39 && r->recording==0) { // status -39 is EOF
        return AD_EOF;  
    } else if (status != 0) { //status 0 is success, other possibilities are an EOF, a parameter error or something else
        if(status = -39 && r->recording==1) {
            return 0;
        } else if (status == -50){
            // rarely, a -50 (bad parameter) error is being returned here 
        } else { // an unknown error, this isn't happening
            printf("status is %d", (int)status);
            return -1;  
        }

    } else {

            return length;
    }

    return 0;

Halle - 2010-04-04

Ugh, sorry that code tag keeps not working.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-04

The issue I see here is that cont_ad passes you number of samples (each 2
byte) and expects to get number of samples back (not number of bytes, but
number of bytes / 2). Please check that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-05

Whoops, that a big issue, thanks Nickolay. OK, I have gone back to the various
ad implementation examples that use byte-reading functions and simplified my
ad_read contents:

UInt32 length = max * r->bps;

printf("length before read (max * 2) is %d\n", (int)length);

OSStatus status = AudioFileReadBytes (
r->recorder->GetAudioFileID(), //queue
false, //cache?
0, // starting position is the record packet
&length, //input is how many bytes to read, output is how many read
buf
);
printf("length after read (actual bytes read) %d\n", (int)length);

if(length > 0) {
length /= r->bps;
} else if (length < 0) {
NSLog(@"status: %d", (int)status);
return AD_ERR_GEN;
} else {
length = 0;
}
if ((length == 0) && (r->recording==0)) {
return AD_EOF;
}
printf("length on return (half bytes read) is %d\n\n\n", (int)length);
return length;

Now it is more stable but still not working. Do you see any other errors?

I've investigated the "packing" issue and the format flag seems to just be
there to describe the canonical packing of linear pcm to lower-level
functions, not to set it:

http://wiki.multimedia.cx/index.php?title=PCM#16-bit_PCM

I've also created a routine so that a copy is made of the WAVE when "Non-zero
amount of data received; start recognition of new utterance.Listening..."
happens (I can get this to happen once in a session if I talk while
cont_ad_calib is happening, though it doesn't return a hyp now) so that I can
check out the file format, and I've played it back and looked at its settings
and it looks and sounds good as far as I can tell (I fixed the echoing issue).
An example of one of these recordings is here:

http://www.robot-commando.com/recordedFile.wav.zip

So, I think my sound format and recording is OK, but I'm still unsure about
the byte ordering thing. I've put some more logging on cont_ad_read_internal
and ad_read and this is what I'm seeing, maybe it is informative:

The first thing that happens is that cont_ad_calib is started, in the course
of which there are approximately 1400 iterations of ad_read. First there are
~1200 iterations in which max is 256, length is therefore set to receive as
many as 512 bytes, actually reads zero bytes, and returns zero bytes. This is
followed by approximately 200 more iterations of ad_read (still during
cont_ad_calib) in which max is 256, length is set to 512, has a full 512 bytes
read into it, and returns 256. I don't know if this is as expected. Again, my
sound input is set to have a single buffer of a second in a half in duration,
in case that is interesting.

It is at this point that cont_ad_calib completes and "READY..." appears and
then ad_read is called for an endless number of iterations and the ad_read
values are like this, with no utterance ever being detected:

2010-04-05 17:31:49.663 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 131072
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:49.764 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 86476
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:49.866 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 41880
In ad_read, length after read (actual bytes read) 41880
In ad_read, length on return (half bytes read) is 20940

2010-04-05 17:31:49.866 Continuous In cont_ad_read_internal, about to do the
second ad_read
In ad_read, length before read (max * 2) is 76288
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:49.968 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 86476
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:50.069 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 41880
In ad_read, length after read (actual bytes read) 41880
In ad_read, length on return (half bytes read) is 20940

2010-04-05 17:31:50.070 Continuous In cont_ad_read_internal, about to do the
second ad_read
In ad_read, length before read (max * 2) is 76288
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:50.171 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 86476
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:50.273 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 41880
In ad_read, length after read (actual bytes read) 41880
In ad_read, length on return (half bytes read) is 20940

2010-04-05 17:31:50.274 Continuous In cont_ad_read_internal, about to do the
second ad_read
In ad_read, length before read (max * 2) is 76288
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:50.375 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 86476
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:50.476 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 41880
In ad_read, length after read (actual bytes read) 41880
In ad_read, length on return (half bytes read) is 20940

2010-04-05 17:31:50.477 Continuous In cont_ad_read_internal, about to do the
second ad_read
In ad_read, length before read (max * 2) is 76288
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

2010-04-05 17:31:50.579 Continuous In cont_ad_read_internal, about to do the
first ad_read
In ad_read, length before read (max * 2) is 86476
In ad_read, length after read (actual bytes read) 44596
In ad_read, length on return (half bytes read) is 22298

Ad nauseum. Not sure how to proceed. Should I try some byte-reordering? Where
would be the most efficient place to try it?

What is weird is that it was working better back when I had the wrong length
being returned by ad_read. This really does seem to point to a bitrate issue
but I can't see one in the audio file.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-05

"my sound input is set to have a single buffer of a second in a half in
duration" should read "my sound input is set to have a single buffer of a
second and a half in duration".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-05

To clarify, ps_start_utt() is never being reached which is why I don't have
logging of what is happening during a recognition attempt, it's just stuck
returning zero from cont_ad_read().

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-05

And, in case you haven't gotten tired of my self-replies yet :) , there has
been one other thing nagging at me throughout, which is that that
AudioFileReadBytes() function above has an argument for the bit offset to
start reading at, which none of the driver examples have. I have tried setting
this to the following settings: 0, NULL, 44 (just to see if it was related to
the WAVE header) and the current recording packet. 0, NULL and 44 all worked
slightly, the current recording packet seemed wrong. Do you think there is
something else I should be doing with this argument?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-06

Hi

Offset should be 0 as I see on other examples found on google code search.
Let's try to fix one issue first:

actually reads zero bytes, and returns zero bytes.

It shouldn't be so. Not sure how is it on mac, but common sense is that if you
are reading from the microphone you can read as many bytes as you request. You
probably need to add little delay on device initialization to fill the audio
buffer. Is there any sample of code that shows how to input audio on macos? I
probably need to look on both your version and this sample because I don't
quite understand how your code is different from standard one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Halle - 2010-04-06

OK, I put a usleep() after I start recording for the same duration as my
buffer duration and now there are no more returns of zero bytes during
calibration and the calibration is much shorter, thanks. I had tried this
earlier assuming the read was starting too early, but I had it in the
utterance loop rather than the ad_start_rec and I think it wasn't occurring at
the right moment to be useful. My code doesn't really diverge significantly
from the Apple example code (I don't think official example code for Core
Audio is online anywhere, unfortunately) but I will attempt to see if I can
find something to point you to.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Continuous: nonstop talking OK, silence not

Speech Recognition Toolkit

Forums

Help

Continuous: nonstop talking OK, silence not document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Continuous: nonstop talking OK, silence not