Menu

i-vector in online decoding

Help
peng-lee
2015-06-15
2015-07-13
  • peng-lee

    peng-lee - 2015-06-15

    Hi Dan,
    First thank you for your works. I have used DNN based online-decoding setup to get expected result. Now, I know it works but I do not know why it works. Especially the i-vector's effect. So, I have some questions about it, maybe my questions is so easy, but for me is really important. could you help me?
    my questions as follows:
    1. Why use i-vector in Dnn-based online-decoding setup? What is the main effect of i-vector?
    2. When we use online-wav-nnet2-latgen-faster to decode wav file, how extract i-vector online, does every utterance use the same i-vector? if not, does extract i-vector for every 10 frames or others?

     
    • Daniel Povey

      Daniel Povey - 2015-06-15

      would be good if someone else can answer this.
      dan

       

      Last edit: Nickolay V. Shmyrev 2015-06-16
      • peng-lee

        peng-lee - 2015-06-15

        Thanks for your quickly reply. Can anyone answer my questions or give me some reference.

         
        • Jan "yenda" Trmal

          The DNN models (from the Dan's nnet2) use the i-vectors to provide the
          neural network with the speaker identity. The input features are not
          speaker-normalized -- it's left to the network to figure this out.

          During the decoding, the trained i-vector extractor is used to estimate the
          i-vectors. They are extracted based on spk2utt map parameter of
          the online2-wav-gmm-latgen-faster.
          You can create various mappings (for example, you can make each utterance
          uttered by a unique speaker, or just carry out the mapping from the data
          dir)...

          The scripts steps/online/decode.sh and egs/rm/s5/local/online/run_nnet2.sh
          (for example) will hopefully answer your questions about how is it done.

          y.

           

          Last edit: Nickolay V. Shmyrev 2015-06-16
          • Daniel Povey

            Daniel Povey - 2015-06-15

            BTW, the iVector is extracted every 10 frames during training, but the
            input to the computation is all frames of the same speaker that are
            prior to the current frame. This is to emulate the online test
            condition.

             

            Last edit: Nickolay V. Shmyrev 2015-06-16
            • peng-lee

              peng-lee - 2015-07-13

              Thanks a lot! I want to know that does this way suit for dialogue condition. Does it extract an ivector for a speaker or for an utterance When an utterance include two or more speakers. In other words, Whether made speaker detection when extract ivector.

               
              • Daniel Povey

                Daniel Povey - 2015-07-13

                For dialogue, what you need is speaker diarization, not just speaker
                identification. Vimal and David (cc'd) are working on a speaker
                diarization setup for Kaldi, but it will be a few months, most likely,
                before it's ready.
                Dan

                On Sun, Jul 12, 2015 at 7:31 PM, peng-lee peng-lee@users.sf.net wrote:

                Thanks a lot! I want to know that does this way suit for dialogue condition.
                Does it extract an ivector for a speaker or for an utterance When an
                utterance include two or more speakers. In other words, Whether made speaker
                detection when extract ivector.


                i-vector in online decoding


                Sent from sourceforge.net because you indicated interest in
                https://sourceforge.net/p/kaldi/discussion/1355348/

                To unsubscribe from further messages, please visit
                https://sourceforge.net/auth/subscriptions/

                 
          • peng-lee

            peng-lee - 2015-06-16

            Thanks for your reply. Is there a good paper that explains the i-vector effects?