From: Amit B. <ami...@gm...> - 2015-07-22 14:58:37
|
Interestingly, disabling the i-vector eliminates the problem of short transcriptions (these 'yea', 'a' etc.) and outputs better ones. I will investigate it further. Thanks for your help Nagendra! On Wed, Jul 22, 2015 at 3:32 PM, Nagendra Goel <nag...@go...> wrote: > I can only tell from experience that iVector adaptation affects a word or > two (Significantly) at the most, and adapts reasonably by that time. So if > 5-6 utterances are affected, the problem may be somewhere else. Try > shuffling the order of decode (offline of course) and see if you find a > pattern. > > Nagendra > > On Wed, Jul 22, 2015 at 8:26 AM, Amit Beka <ami...@gm...> wrote: > >> I have listened to the recordings themselves (after VAD), and they all >> sound good, and were recorded with the same background noise (almost none) >> with the same speaker and in the same volume. >> >> I use the nnet2-online-latgen-faster decoder, and although my LM doesn't >> really suits the input, I expect it to give me at least *some* words as >> output >> >> On Wed, Jul 22, 2015 at 1:20 PM, Nagendra Goel < >> nag...@go...> wrote: >> >>> From your description this does not sound like a faulty ivector. Ivector >>> might have a small role but you should first look for problems elsewhere. >>> Maybe the recording itself goes bad? >>> >>> Nagendra Kumar Goel >>> On Jul 22, 2015 6:00 AM, "Amit Beka" <ami...@gm...> wrote: >>> >>>> Hi, >>>> >>>> I've been using online_nnet2_decoder for quite some time now for ASR in >>>> a dialogue system, where some users are returning users. Naturally, we use >>>> online i-vector extraction to better recognize each user's speech. >>>> >>>> Unfortunately, we have found some cases where the extracted i-vector >>>> decreases the performance of the decoder, usually by identifying 0 or 1 >>>> word (something like 'a', 'i' or 'yea') instead of recognizing the whole >>>> utterance. Usually, the degraded performance lasts for 5-6 utternaces (each >>>> is 1-3 seconds) until a good i-vector is "recovered". >>>> >>>> I would be grateful if anyone on the list may help with some of the >>>> following questions: >>>> >>>> 1. Is it a bug, or i-vectors may behave this way (for no apparent >>>> reason, when listening to the audio)? >>>> >>>> 2. Can I have a reliable way of telling when the i-vector is >>>> problematic? (except checking the lengths of the utterance and the >>>> transcription). What can be a good update method to the adaptation state >>>> (based on confidence, length of utternance)? >>>> >>>> 3. Is it possible to separate the i-vector to some features which are >>>> user-specific (like tone) and some that are environment specific (like >>>> noise)? If so, I would probably want to "forget" the environment-specific >>>> features and keep only the user-specific features when the utterances are >>>> not consecutive >>>> >>>> I was wondering if there is a way to "understand" the changes in the >>>> adaptation state, for a non-expert in signal-processing like me :) >>>> >>>> Thanks, >>>> Beka >>>> >>>> >>>> ------------------------------------------------------------------------------ >>>> Don't Limit Your Business. Reach for the Cloud. >>>> GigeNET's Cloud Solutions provide you with the tools and support that >>>> you need to offload your IT needs and focus on growing your business. >>>> Configured For All Businesses. Start Your Cloud Today. >>>> https://www.gigenetcloud.com/ >>>> _______________________________________________ >>>> Kaldi-users mailing list >>>> Kal...@li... >>>> https://lists.sourceforge.net/lists/listinfo/kaldi-users >>>> >>>> >> > |