|
From: Daniel P. <dp...@gm...> - 2015-04-17 00:38:38
|
Thanks for the update. I had a look at your duration-model paper https://phon.ioc.ee/dokuwiki/lib/exe/fetch.php?media=people:tanel:icassp2014-durmodel.pdf and it's quite exciting that you were able to get so much improvement. It would be good if you could put in the effort to "Kaldi-ify" your recipe, i.e. get rid of the dependency on Python and external tools, and make it compatible with the existing structure of the scripts. A good duration model is a feature we need. I can get someone else who knows the nnet2 code (e.g. Vijay) to help with the core neural-net-training part of it, or help myself, if you can do the other parts. Dan On Wed, Apr 15, 2015 at 10:25 AM, Tanel Alumäe <tan...@ph...> wrote: > Hello everybody, > > Daniel asked me to give and update on the GStreamer related work with > Kaldi that I have been doing. > > GStreamer is a multimedia framework. It consists of different plugins > (audio and video decoders and encoders, resamplers, effect modules, > input/output modules) that can be formed into pipelines. GStreamer can > be used via its GObject introspection bindings. Thus, one can use > GStreamer in any programming language that has support for GObject > introspection, which includes Python, Ruby, Java, Vala. > > The Kaldi code base includes a GStreamer plugin that supports GMM > models. More recently, I have also developed a similar plugin that > supports "online DNN" models. It's available at > https://github.com/alumae/gst-kaldi-nnet2-online. I am planning to > maintain it as separate project from Kaldi (I believe Daniel agrees). > The plugin has a very similar functionality as Kaldi's > online2-wav-nnet2-latgen-faster, with some extensions. First, it can do > on-the-fly audio segmentation, based on silences in the audio. It's > based on the endpointing code in Kaldi's nnet2 code, but instead of > terminating when an endpoint is encountered, it simply starts decoding > the next segment. It can also do language model rescoring, as in > lattice-lmrescore. > > It's very easy to create GUI speech recognition applications using the > plugin, or apply the plugin from command line to e.g. transcribe a long > audio file. Check the 'demo' folder at Github. > > My other project is https://github.com/alumae/kaldi-gstreamer-server. > It's a real-time full duplex speech recognition server, built around the > Kaldi's GStreamer plugins. Features (copied from the README): > > * Full duplex communication based on websockets: speech goes in, > partial hypotheses come out (think of Android's voice typing) > * Very scalable: the server consists of a master component and > workers; one worker is needed per concurrent recognition session; > workers can be started and stopped independently of the master on remote > machines > * Can do speech segmentation, i.e., a long speech signal is broken > into shorter segments based on silences > * Supports arbitrarily long speech input (e.g., you can stream live > speech into it) > * Supports Kaldi's GMM and "online DNN" models > * Supports rescoring of the recognition lattice with a large language > model > * Supports persisting the acoustic model adaptation state between > requests > * Supports unlimited set of audio codecs (actually only those > supported by GStreamer) > * Supports rewriting raw recognition results using external programs > (can be used for converting words to numbers, etc) > * Python, Java, Javascript clients are available > > We are using the server in several real world speech recognition > applications, mainly for the Estonian language. E.g., we have developed > an Android application that can act as a speech-recognition based > "keyboard" (as Google's voice typing), and in a radiology dictation > application that achieves 5% WER in real clinical environment. > > Not related to GStreamer, I have also an implementation of a novel phone > duration model available at Github: > https://github.com/alumae/kaldi-nnet-dur-model > It's probably more interesting for researchers, but nevertheless, on the > TEDLIUM test set it gives a drop in WER from 11.7% to 11.0%, from the > online multisplice speed-perturbed DNN system with Cantab large LM > rescoring. > > Regards, > Tanel > > > > ------------------------------------------------------------------------------ > BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT > Develop your own process in accordance with the BPMN 2 standard > Learn Process modeling best practices with Bonita BPM through live > exercises > http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- > event?utm_ > source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF > _______________________________________________ > Kaldi-developers mailing list > Kal...@li... > https://lists.sourceforge.net/lists/listinfo/kaldi-developers > |