fe->vad_data->global_state

Help
Halle
2014-08-12
2015-03-28
  • Halle
    Halle
    2014-08-12

    Hi Nickolay,

    I'm integrating the new VAD/noise robustness code and I wanted to check something with you. Is it correct that the buffer on which fe->vad_data->global_state switches to 1 is actually the buffer after the first buffer in which speech begins? I've noticed that if I write out a WAV using all of the buffers for which ps_get_in_speech returns 1, all of the buffers which contained speech are present in the WAV except for the first one. Is that expected?

     
  • Is it correct that the buffer on which fe->vad_data->global_state switches to 1 is actually the buffer after the first buffer in which speech begins?

    Yes, we are not doing great here, we actually need to buffer slightly more data than vad_prespeech, otherwise we can leave the beginning unhandled.

     
  • Halle
    Halle
    2014-08-14

    Is the prespeech buffer guaranteed to be the buffer in which the silence/speech threshold was crossed?

     
    • bic-user
      bic-user
      2014-08-14

      Yes. Every frame in prespeech threshold crosses threshold otherwise they all discarded

       
  • Halle
    Halle
    2014-08-14

    Thank you – to be more specific, my question was whether prespeech is guaranteed to be the buffer in which some frame first crossed the threshold and frames of speech began, not about whether all frames in prespeech are speech. If all frames in prespeech are speech, it it suggestive that it is the buffer before prespeech in which speech begins.

     
  • bic-user
    bic-user
    2014-08-14

    it it suggestive that it is the buffer before prespeech in which speech begins.

    Border speech / silence is somewhat smoothed, so it is hard to say.
    You can try cont_seg util from sphinxbase to check what
    audio is passed by VAD for recognition.

     
  • Halle
    Halle
    2014-08-14

    To ask it from a different perspective, what do you see as the potential downside of prepending the buffer before the one in which all frames are evaluated as speech? Isn't it either a silence buffer or a mixed silence/speech buffer containing the start of speech?

     
  • Halle
    Halle
    2014-08-14

    I'm asking because I have the impression that there has been an accuracy decrease in my testbed since integrating the new code, and the experiment writing out the buffers drew my curiosity about the underlying reason to this question about the difference between buffers known to contain speech and buffers for which ps_get_in_speech returns 1, so I'm trying to understand how much that also affects what is submitted to be decoded as part of an utterance.

     
    Last edit: Halle 2014-08-14
    • I'm asking because I have the impression that there has been an accuracy decrease in my testbed since integrating the new code, and the experiment writing out the buffers drew my curiosity about the underlying reason to this question about the difference between

      Most likely the reason is enabled noise removal, not VAD (you can disable it with -remove_noise no).

       
      • Halle
        Halle
        2014-09-19

        Ah, interesting, it was actually the other way around – my WER test (simple enough that it should always pass) started passing when I set the feat.params of wsj to -remove_noise yes rather than no.

         
        • Halle
          Halle
          2014-09-19

          Sorry, I spoke too soon, it was a fluke. My test is a test of a sentence that is about 6 words long, with an lm containing only those words. The test is that during continuous recognition, the very first recognition is able to recognize this sentence with no errors.

          There is a tendency in this test for a null hyp to be incorrectly recognized just as soon as recognition starts (there isn't really a sound at that point in the test recording).

          What I've noticed is that every time recognition starts, if there is first a null hyp recognized, then the sentence recognition is poor – it specifically has a bunch of incorrect insertions of the word "A". If the null hyp isn't somehow recognized first, the sentence recognition is sometimes correct. remove_noise settings don't seem to affect it enough that I can attribute the results to it one way or another.

          I am testing using two different driver versions. One does its own noise suppression (so it will provide a lot of packets with no power) and one that doesn't (so it probably shouldn't ever provide any packets with no power). Is there anything I need to set with respect to either of these situations while I evaluate them? Thanks!

           
          Last edit: Halle 2014-09-19
  • bic-user
    bic-user
    2014-08-14

    potential downside of prepending the buffer before the one in which all frames are evaluated as speech?

    no downsides actually

    accuracy decrease in my testbed since integrating the new code

    That's possible. Though I'd better play with threshold and check what kind of errors appeares after VAD involving: look through alignment of batch recognition test

     
  • Halle
    Halle
    2014-08-14

    Cool, thank you for considering.

     
    • Hello Halle

      Just to let you know that we started to buffer slight amount of silence in speech begin just few days ago. It improves accuracy significantly.