Re: [Kaldi-users] Faulty i-vectors in a dialogue system

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I can only tell from experience that iVector adaptation affects a word or
two (Significantly) at the most, and adapts reasonably by that time. So if
5-6 utterances are affected, the problem may be somewhere else. Try
shuffling the order of decode (offline of course) and see if you find a
pattern.

Nagendra

On Wed, Jul 22, 2015 at 8:26 AM, Amit Beka <ami...@gm...> wrote:

> I have listened to the recordings themselves (after VAD), and they all
> sound good, and were recorded with the same background noise (almost none)
> with the same speaker and in the same volume.
>
> I use the nnet2-online-latgen-faster decoder, and although my LM doesn't
> really suits the input, I expect it to give me at least *some* words as
> output
>
> On Wed, Jul 22, 2015 at 1:20 PM, Nagendra Goel <nag...@go...
> > wrote:
>
>> From your description this does not sound like a faulty ivector. Ivector
>> might have a small role but you should first look for problems elsewhere.
>> Maybe the recording itself goes bad?
>>
>> Nagendra Kumar Goel
>> On Jul 22, 2015 6:00 AM, "Amit Beka" <ami...@gm...> wrote:
>>
>>> Hi,
>>>
>>> I've been using online_nnet2_decoder for quite some time now for ASR in
>>> a dialogue system, where some users are returning users. Naturally, we use
>>> online i-vector extraction to better recognize each user's speech.
>>>
>>> Unfortunately, we have found some cases where the extracted i-vector
>>> decreases the performance of the decoder, usually by identifying 0 or 1
>>> word (something like 'a', 'i' or 'yea') instead of recognizing the whole
>>> utterance. Usually, the degraded performance lasts for 5-6 utternaces (each
>>> is 1-3 seconds) until a good i-vector is "recovered".
>>>
>>> I would be grateful if anyone on the list may help with some of the
>>> following questions:
>>>
>>> 1. Is it a bug, or i-vectors may behave this way (for no apparent
>>> reason, when listening to the audio)?
>>>
>>> 2. Can I have a reliable way of telling when the i-vector is
>>> problematic? (except checking the lengths of the utterance and the
>>> transcription). What can be a good update method to the adaptation state
>>> (based on confidence, length of utternance)?
>>>
>>> 3. Is it possible to separate the i-vector to some features which are
>>> user-specific (like tone) and some that are environment specific (like
>>> noise)? If so, I would probably want to "forget" the environment-specific
>>> features and keep only the user-specific features when the utterances are
>>> not consecutive
>>>
>>> I was wondering if there is a way to "understand" the changes in the
>>> adaptation state, for a non-expert in signal-processing like me :)
>>>
>>> Thanks,
>>> Beka
>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Don't Limit Your Business. Reach for the Cloud.
>>> GigeNET's Cloud Solutions provide you with the tools and support that
>>> you need to offload your IT needs and focus on growing your business.
>>> Configured For All Businesses. Start Your Cloud Today.
>>> https://www.gigenetcloud.com/
>>> _______________________________________________
>>> Kaldi-users mailing list
>>> Kal...@li...
>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>
>>>
>