Hi Dan,
First thank you for your works. I have used DNN based online-decoding setup to get expected result. Now, I know it works but I do not know why it works. Especially the i-vector's effect. So, I have some questions about it, maybe my questions is so easy, but for me is really important. could you help me?
my questions as follows:
1. Why use i-vector in Dnn-based online-decoding setup? What is the main effect of i-vector?
2. When we use online-wav-nnet2-latgen-faster to decode wav file, how extract i-vector online, does every utterance use the same i-vector? if not, does extract i-vector for every 10 frames or others?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The DNN models (from the Dan's nnet2) use the i-vectors to provide the
neural network with the speaker identity. The input features are not
speaker-normalized -- it's left to the network to figure this out.
During the decoding, the trained i-vector extractor is used to estimate the
i-vectors. They are extracted based on spk2utt map parameter of
the online2-wav-gmm-latgen-faster.
You can create various mappings (for example, you can make each utterance
uttered by a unique speaker, or just carry out the mapping from the data
dir)...
The scripts steps/online/decode.sh and egs/rm/s5/local/online/run_nnet2.sh
(for example) will hopefully answer your questions about how is it done.
y.
Last edit: Nickolay V. Shmyrev 2015-06-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
BTW, the iVector is extracted every 10 frames during training, but the
input to the computation is all frames of the same speaker that are
prior to the current frame. This is to emulate the online test
condition.
Last edit: Nickolay V. Shmyrev 2015-06-16
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks a lot! I want to know that does this way suit for dialogue condition. Does it extract an ivector for a speaker or for an utterance When an utterance include two or more speakers. In other words, Whether made speaker detection when extract ivector.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
For dialogue, what you need is speaker diarization, not just speaker
identification. Vimal and David (cc'd) are working on a speaker
diarization setup for Kaldi, but it will be a few months, most likely,
before it's ready.
Dan
Thanks a lot! I want to know that does this way suit for dialogue condition.
Does it extract an ivector for a speaker or for an utterance When an
utterance include two or more speakers. In other words, Whether made speaker
detection when extract ivector.
Hi Dan,
First thank you for your works. I have used DNN based online-decoding setup to get expected result. Now, I know it works but I do not know why it works. Especially the i-vector's effect. So, I have some questions about it, maybe my questions is so easy, but for me is really important. could you help me?
my questions as follows:
1. Why use i-vector in Dnn-based online-decoding setup? What is the main effect of i-vector?
2. When we use online-wav-nnet2-latgen-faster to decode wav file, how extract i-vector online, does every utterance use the same i-vector? if not, does extract i-vector for every 10 frames or others?
would be good if someone else can answer this.
dan
Last edit: Nickolay V. Shmyrev 2015-06-16
Thanks for your quickly reply. Can anyone answer my questions or give me some reference.
The DNN models (from the Dan's nnet2) use the i-vectors to provide the
neural network with the speaker identity. The input features are not
speaker-normalized -- it's left to the network to figure this out.
During the decoding, the trained i-vector extractor is used to estimate the
i-vectors. They are extracted based on spk2utt map parameter of
the online2-wav-gmm-latgen-faster.
You can create various mappings (for example, you can make each utterance
uttered by a unique speaker, or just carry out the mapping from the data
dir)...
The scripts steps/online/decode.sh and egs/rm/s5/local/online/run_nnet2.sh
(for example) will hopefully answer your questions about how is it done.
y.
Last edit: Nickolay V. Shmyrev 2015-06-16
BTW, the iVector is extracted every 10 frames during training, but the
input to the computation is all frames of the same speaker that are
prior to the current frame. This is to emulate the online test
condition.
Last edit: Nickolay V. Shmyrev 2015-06-16
Thanks a lot! I want to know that does this way suit for dialogue condition. Does it extract an ivector for a speaker or for an utterance When an utterance include two or more speakers. In other words, Whether made speaker detection when extract ivector.
For dialogue, what you need is speaker diarization, not just speaker
identification. Vimal and David (cc'd) are working on a speaker
diarization setup for Kaldi, but it will be a few months, most likely,
before it's ready.
Dan
On Sun, Jul 12, 2015 at 7:31 PM, peng-lee peng-lee@users.sf.net wrote:
Thanks for your reply. Is there a good paper that explains the i-vector effects?
I find this presentation useful:
http://people.csail.mit.edu/sshum/talks/ivector_tutorial_interspeech_27Aug2011.pdf
Thanks a lot!
I find this presentation useful:
http://people.csail.mit.edu/sshum/talks/ivector_tutorial_interspeech_27Aug2011.pdf