[Kaldi-users] nnet2-online i-vector sensibility with short utterances

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Hello,

We're using the nnet2-online setup in a CTS task.  We have a good
experience with the same setup for a BN task.  However, for the CTS
task, where utterances can be very short ("yes", "mmm", etc), and we
observe a very strong dependence of the ivector length on duration
(which makes sense) a very strong dependence of ASR performance on
ivector length (which also makes sense).

It seems that in the nnet2-online setup the ivectors are not
normalized to length as is customary in speaker recognition.  The nnet
doesn't seem to like the duration dependence---what would be an
approach to deal with this?  Would it make sense to train the nnet
with length-normalized ivectors?

Cheers,

---david

-- 
David van Leeuwen