[Kaldi-users] Faulty i-vectors in a dialogue system

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi,

I've been using online_nnet2_decoder for quite some time now for ASR in a
dialogue system, where some users are returning users. Naturally, we use
online i-vector extraction to better recognize each user's speech.

Unfortunately, we have found some cases where the extracted i-vector
decreases the performance of the decoder, usually by identifying 0 or 1
word (something like 'a', 'i' or 'yea') instead of recognizing the whole
utterance. Usually, the degraded performance lasts for 5-6 utternaces (each
is 1-3 seconds) until a good i-vector is "recovered".

I would be grateful if anyone on the list may help with some of the
following questions:

1. Is it a bug, or i-vectors may behave this way (for no apparent reason,
when listening to the audio)?

2. Can I have a reliable way of telling when the i-vector is problematic?
(except checking the lengths of the utterance and the transcription). What
can be a good update method to the adaptation state (based on confidence,
length of utternance)?

3. Is it possible to separate the i-vector to some features which are
user-specific (like tone) and some that are environment specific (like
noise)? If so, I would probably want to "forget" the environment-specific
features and keep only the user-specific features when the utterances are
not consecutive

I was wondering if there is a way to "understand" the changes in the
adaptation state, for a non-expert in signal-processing like me :)

Thanks,
Beka