Re: [Kaldi-developers] Update on GStreamer related developments

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Thanks for the update.
I had a look at your duration-model paper
https://phon.ioc.ee/dokuwiki/lib/exe/fetch.php?media=people:tanel:icassp2014-durmodel.pdf
and it's quite exciting that you were able to get so much improvement.  It
would be good if you could put in the effort to "Kaldi-ify" your recipe,
i.e. get rid of the dependency on Python and external tools, and make it
compatible with the existing structure of the scripts.  A good duration
model is a feature we need.  I can get someone else who knows the nnet2
code (e.g. Vijay) to help with the core neural-net-training part of it, or
help myself, if you can do the other parts.
Dan

On Wed, Apr 15, 2015 at 10:25 AM, Tanel Alumäe <tan...@ph...>
wrote:

> Hello everybody,
>
> Daniel asked me to give and update on the GStreamer related work with
> Kaldi that I have been doing.
>
> GStreamer is a multimedia framework. It consists of different plugins
> (audio and video decoders and encoders, resamplers, effect modules,
> input/output modules) that can be formed into pipelines. GStreamer can
> be used via its GObject introspection bindings. Thus, one can use
> GStreamer in any programming language that has support for GObject
> introspection, which includes Python, Ruby, Java, Vala.
>
> The Kaldi code base includes a GStreamer plugin that supports GMM
> models. More recently, I have also developed a similar plugin that
> supports "online DNN" models. It's available at
> https://github.com/alumae/gst-kaldi-nnet2-online. I am planning to
> maintain it as separate project from Kaldi (I believe Daniel agrees).
> The plugin has a very similar functionality as Kaldi's
> online2-wav-nnet2-latgen-faster, with some extensions. First, it can do
> on-the-fly audio segmentation, based on silences in the audio. It's
> based on the endpointing code in Kaldi's nnet2 code, but instead of
> terminating when an endpoint is encountered, it simply starts decoding
> the next segment. It can also do language model rescoring, as in
> lattice-lmrescore.
>
> It's very easy to create GUI speech recognition applications  using the
> plugin, or apply the plugin from command line to e.g. transcribe a long
> audio file. Check the 'demo' folder at Github.
>
> My other project is https://github.com/alumae/kaldi-gstreamer-server.
> It's a real-time full duplex speech recognition server, built around the
> Kaldi's GStreamer plugins. Features (copied from the README):
>
>   * Full duplex communication based on websockets: speech goes in,
> partial hypotheses come out (think of Android's voice typing)
>   * Very scalable: the server consists of a master component and
> workers; one worker is needed per concurrent recognition session;
> workers can be started and stopped independently of the master on remote
> machines
>   * Can do speech segmentation, i.e., a long speech signal is broken
> into shorter segments based on silences
>   * Supports arbitrarily long speech input (e.g., you can stream live
> speech into it)
>   * Supports Kaldi's GMM and "online DNN" models
>   * Supports rescoring of the recognition lattice with a large language
> model
>   * Supports persisting the acoustic model adaptation state between
> requests
>   * Supports unlimited set of audio codecs (actually only those
> supported by GStreamer)
>   * Supports rewriting raw recognition results using external programs
> (can be used for converting words to numbers, etc)
>   * Python, Java, Javascript clients are available
>
> We are using the server in several real world speech recognition
> applications, mainly for the Estonian language. E.g., we have developed
> an Android application that can act as a speech-recognition based
> "keyboard" (as Google's voice typing), and in a radiology dictation
> application that achieves 5% WER in real clinical environment.
>
> Not related to GStreamer, I have also an implementation of a novel phone
> duration model available at Github:
> https://github.com/alumae/kaldi-nnet-dur-model
> It's probably more interesting for researchers, but nevertheless, on the
> TEDLIUM test set it gives a drop in WER from 11.7% to 11.0%, from the
> online multisplice speed-perturbed DNN system with Cantab large LM
> rescoring.
>
> Regards,
> Tanel
>
>
>
> ------------------------------------------------------------------------------
> BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
> Develop your own process in accordance with the BPMN 2 standard
> Learn Process modeling best practices with Bonita BPM through live
> exercises
> http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual-
> event?utm_
> source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
> _______________________________________________
> Kaldi-developers mailing list
> Kal...@li...
> https://lists.sourceforge.net/lists/listinfo/kaldi-developers
>