|
From: Kirill K. <kir...@sm...> - 2015-03-24 21:04:21
|
Thank you very much Tony. A very sharp insight. > people do not know what they are going > to say when the open their mouths (which changes both the acoustic > model and, err, the the language model). Ah, hear, hear!!! -kkm > -----Original Message----- > From: Tony Robinson [mailto:to...@ca...] > Sent: 2015-03-24 1332 > To: kal...@li... > Subject: Re: [Kaldi-users] changes for 8k-sampled speech > > On 24/03/15 19:33, Kirill Katsnelson wrote: > > Thanks Dan. I would be surprised if it would only be 1%, but let's > see! > > Bandwidth is not the issue in telephony speech. If you take 16kHz > data, down sample the train and test and run everything again you'll > see > 10-15% relative degradation (this has held steady over 20 years). > Given we are often at 9% WER on general stuff like TED talks then 1% > absolute is about right. > > > I am concerned about another dimension: the quantization. In > telephony we are dealing not only with the reduced bandwidth, but also > the quantization noise from u-law compression (or A-low for the folk on > the other side of the pond). Essentially, to package 1 byte per sample > of speech, u- (or A-)law defines 256 logarithmically-spaced amplitude > values that the signal can only take. I am going to measure the > degradation from both effects, but can you also think of any "magic > numbers" (in mfcc.conf) that might need to be tweaked to deal with the > increased quantization noise? I doubt there are any, but I do want to > get the best possible result. > > I really doubt this is the problem. Take broadcast audio, and > downsample it to 8kHz and you can easily hear the difference. Use > G.711 when only a few samples saturate and you'll need a good audio > setup to hear the difference - so I think there will be negligible WER > difference. > > What really matters in telephony is the care taken to record the audio > and the care taken to speak the audio. Telephony has noisy > backgrounds, people using smartphones which have awful acoustics, sub $ > hardware, but most importantly, people do not know what they are going > to say when the open their mouths (which changes both the acoustic > model and, err, the the language model). > > So if all you want to do is generate a 8kHz/G.711 Librivox system then > that'll work fine, but don't expect it to work in a call centre > environment. > > BTW, if you want crude telephony then the tedlium recipe is a better > start. Not only does it have Karel's bottleneck/stacked DNN recipe > but > real soon now it'll have Cantab's LMs which give a 4-5% absolute > improvement in performance. We built then for this task and so they > are adapted to a TED style of presentation which isn't conversational > telephony but is a lot closer to it than Librivox/Gutenberg. > > > Tony > > -- > Speechmatics is a trading name of Cantab Research Limited > We are hiring: www.speechmatics.com/careers > Dr A J Robinson, Founder, Cantab Research Ltd > Phone direct: 01223 778240 office: 01223 794497 > Company reg no GB 05697423, VAT reg no 925606030 > 51 Canterbury Street, Cambridge, CB4 3QG, UK > > ----------------------------------------------------------------------- > ------- > Dive into the World of Parallel Programming The Go Parallel Website, > sponsored > by Intel and developed in partnership with Slashdot Media, is your hub > for all > things parallel software development, from weekly thought leadership > blogs to > news, videos, case studies, tutorials and more. Take a look and join > the > conversation now. http://goparallel.sourceforge.net/ > _______________________________________________ > Kaldi-users mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-users |