From: Amit B. <ami...@gm...> - 2015-07-22 09:59:13
|
Hi, I've been using online_nnet2_decoder for quite some time now for ASR in a dialogue system, where some users are returning users. Naturally, we use online i-vector extraction to better recognize each user's speech. Unfortunately, we have found some cases where the extracted i-vector decreases the performance of the decoder, usually by identifying 0 or 1 word (something like 'a', 'i' or 'yea') instead of recognizing the whole utterance. Usually, the degraded performance lasts for 5-6 utternaces (each is 1-3 seconds) until a good i-vector is "recovered". I would be grateful if anyone on the list may help with some of the following questions: 1. Is it a bug, or i-vectors may behave this way (for no apparent reason, when listening to the audio)? 2. Can I have a reliable way of telling when the i-vector is problematic? (except checking the lengths of the utterance and the transcription). What can be a good update method to the adaptation state (based on confidence, length of utternance)? 3. Is it possible to separate the i-vector to some features which are user-specific (like tone) and some that are environment specific (like noise)? If so, I would probably want to "forget" the environment-specific features and keep only the user-specific features when the utterances are not consecutive I was wondering if there is a way to "understand" the changes in the adaptation state, for a non-expert in signal-processing like me :) Thanks, Beka |