Re: [Kaldi-users] Faulty i-vectors in a dialogue system

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Interestingly, disabling the i-vector eliminates the problem of short
transcriptions (these 'yea', 'a' etc.) and outputs better ones. I will
investigate it further.

Thanks for your help Nagendra!

On Wed, Jul 22, 2015 at 3:32 PM, Nagendra Goel <nag...@go...>
wrote:

> I can only tell from experience that iVector adaptation affects a word or
> two (Significantly) at the most, and adapts reasonably by that time. So if
> 5-6 utterances are affected, the problem may be somewhere else. Try
> shuffling the order of decode (offline of course) and see if you find a
> pattern.
>
> Nagendra
>
> On Wed, Jul 22, 2015 at 8:26 AM, Amit Beka <ami...@gm...> wrote:
>
>> I have listened to the recordings themselves (after VAD), and they all
>> sound good, and were recorded with the same background noise (almost none)
>> with the same speaker and in the same volume.
>>
>> I use the nnet2-online-latgen-faster decoder, and although my LM doesn't
>> really suits the input, I expect it to give me at least *some* words as
>> output
>>
>> On Wed, Jul 22, 2015 at 1:20 PM, Nagendra Goel <
>> nag...@go...> wrote:
>>
>>> From your description this does not sound like a faulty ivector. Ivector
>>> might have a small role but you should first look for problems elsewhere.
>>> Maybe the recording itself goes bad?
>>>
>>> Nagendra Kumar Goel
>>> On Jul 22, 2015 6:00 AM, "Amit Beka" <ami...@gm...> wrote:
>>>
>>>> Hi,
>>>>
>>>> I've been using online_nnet2_decoder for quite some time now for ASR in
>>>> a dialogue system, where some users are returning users. Naturally, we use
>>>> online i-vector extraction to better recognize each user's speech.
>>>>
>>>> Unfortunately, we have found some cases where the extracted i-vector
>>>> decreases the performance of the decoder, usually by identifying 0 or 1
>>>> word (something like 'a', 'i' or 'yea') instead of recognizing the whole
>>>> utterance. Usually, the degraded performance lasts for 5-6 utternaces (each
>>>> is 1-3 seconds) until a good i-vector is "recovered".
>>>>
>>>> I would be grateful if anyone on the list may help with some of the
>>>> following questions:
>>>>
>>>> 1. Is it a bug, or i-vectors may behave this way (for no apparent
>>>> reason, when listening to the audio)?
>>>>
>>>> 2. Can I have a reliable way of telling when the i-vector is
>>>> problematic? (except checking the lengths of the utterance and the
>>>> transcription). What can be a good update method to the adaptation state
>>>> (based on confidence, length of utternance)?
>>>>
>>>> 3. Is it possible to separate the i-vector to some features which are
>>>> user-specific (like tone) and some that are environment specific (like
>>>> noise)? If so, I would probably want to "forget" the environment-specific
>>>> features and keep only the user-specific features when the utterances are
>>>> not consecutive
>>>>
>>>> I was wondering if there is a way to "understand" the changes in the
>>>> adaptation state, for a non-expert in signal-processing like me :)
>>>>
>>>> Thanks,
>>>> Beka
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Don't Limit Your Business. Reach for the Cloud.
>>>> GigeNET's Cloud Solutions provide you with the tools and support that
>>>> you need to offload your IT needs and focus on growing your business.
>>>> Configured For All Businesses. Start Your Cloud Today.
>>>> https://www.gigenetcloud.com/
>>>> _______________________________________________
>>>> Kaldi-users mailing list
>>>> Kal...@li...
>>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users
>>>>
>>>>
>>
>