Re: [Kaldi-users] changes for 8k-sampled speech

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thank you very much Tony. A very sharp insight. 

> people do not know what they are going
> to say when the open their mouths (which changes both the acoustic
> model and, err, the the language model).

Ah, hear, hear!!!

 -kkm

> -----Original Message-----
> From: Tony Robinson [mailto:to...@ca...]
> Sent: 2015-03-24 1332
> To: kal...@li...
> Subject: Re: [Kaldi-users] changes for 8k-sampled speech
> 
> On 24/03/15 19:33, Kirill Katsnelson wrote:
> > Thanks Dan. I would be surprised if it would only be 1%, but let's
> see!
> 
> Bandwidth is not the issue in telephony speech.   If you take 16kHz
> data, down sample the train and test and run everything again you'll
> see
> 10-15% relative degradation (this has held steady over 20 years).
> Given we are often at 9% WER on general stuff like TED talks then 1%
> absolute is about right.
> 
> > I am concerned about another dimension: the quantization. In
> telephony we are dealing not only with the reduced bandwidth, but also
> the quantization noise from u-law compression (or A-low for the folk on
> the other side of the pond). Essentially, to package 1 byte per sample
> of speech, u- (or A-)law defines 256 logarithmically-spaced amplitude
> values that the signal can only take. I am going to measure the
> degradation from both effects, but can you also think of any "magic
> numbers" (in mfcc.conf) that might need to be tweaked to deal with the
> increased quantization noise? I doubt there are any, but I do want to
> get the best possible result.
> 
> I really doubt this is the problem.   Take broadcast audio, and
> downsample it to 8kHz and you can easily hear the difference.   Use
> G.711 when only a few samples saturate and you'll need a good audio
> setup to hear the difference - so I think there will be negligible WER
> difference.
> 
> What really matters in telephony is the care taken to record the audio
> and the care taken to speak the audio.   Telephony has noisy
> backgrounds, people using smartphones which have awful acoustics, sub $
> hardware, but most importantly, people do not know what they are going
> to say when the open their mouths (which changes both the acoustic
> model and, err, the the language model).
> 
> So if all you want to do is generate a 8kHz/G.711 Librivox system then
> that'll work fine, but don't expect it to work in a call centre
> environment.
> 
> BTW, if you want crude telephony then the tedlium recipe is a better
> start.   Not only does it have Karel's bottleneck/stacked DNN recipe
> but
> real soon now it'll have Cantab's LMs which give a 4-5% absolute
> improvement in performance.   We built then for this task and so they
> are adapted to a TED style of presentation which isn't conversational
> telephony but is a lot closer to it than Librivox/Gutenberg.
> 
> 
> Tony
> 
> --
> Speechmatics is a trading name of Cantab Research Limited
> We are hiring: www.speechmatics.com/careers
> Dr A J Robinson, Founder, Cantab Research Ltd
> Phone direct: 01223 778240 office: 01223 794497
> Company reg no GB 05697423, VAT reg no 925606030
> 51 Canterbury Street, Cambridge, CB4 3QG, UK
> 
> -----------------------------------------------------------------------
> -------
> Dive into the World of Parallel Programming The Go Parallel Website,
> sponsored
> by Intel and developed in partnership with Slashdot Media, is your hub
> for all
> things parallel software development, from weekly thought leadership
> blogs to
> news, videos, case studies, tutorials and more. Take a look and join
> the
> conversation now. http://goparallel.sourceforge.net/
> _______________________________________________
> Kaldi-users mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-users