From: Tanel A. <tan...@ph...> - 2015-04-15 14:42:27
|
Hello everybody, Daniel asked me to give and update on the GStreamer related work with Kaldi that I have been doing. GStreamer is a multimedia framework. It consists of different plugins (audio and video decoders and encoders, resamplers, effect modules, input/output modules) that can be formed into pipelines. GStreamer can be used via its GObject introspection bindings. Thus, one can use GStreamer in any programming language that has support for GObject introspection, which includes Python, Ruby, Java, Vala. The Kaldi code base includes a GStreamer plugin that supports GMM models. More recently, I have also developed a similar plugin that supports "online DNN" models. It's available at https://github.com/alumae/gst-kaldi-nnet2-online. I am planning to maintain it as separate project from Kaldi (I believe Daniel agrees). The plugin has a very similar functionality as Kaldi's online2-wav-nnet2-latgen-faster, with some extensions. First, it can do on-the-fly audio segmentation, based on silences in the audio. It's based on the endpointing code in Kaldi's nnet2 code, but instead of terminating when an endpoint is encountered, it simply starts decoding the next segment. It can also do language model rescoring, as in lattice-lmrescore. It's very easy to create GUI speech recognition applications using the plugin, or apply the plugin from command line to e.g. transcribe a long audio file. Check the 'demo' folder at Github. My other project is https://github.com/alumae/kaldi-gstreamer-server. It's a real-time full duplex speech recognition server, built around the Kaldi's GStreamer plugins. Features (copied from the README): * Full duplex communication based on websockets: speech goes in, partial hypotheses come out (think of Android's voice typing) * Very scalable: the server consists of a master component and workers; one worker is needed per concurrent recognition session; workers can be started and stopped independently of the master on remote machines * Can do speech segmentation, i.e., a long speech signal is broken into shorter segments based on silences * Supports arbitrarily long speech input (e.g., you can stream live speech into it) * Supports Kaldi's GMM and "online DNN" models * Supports rescoring of the recognition lattice with a large language model * Supports persisting the acoustic model adaptation state between requests * Supports unlimited set of audio codecs (actually only those supported by GStreamer) * Supports rewriting raw recognition results using external programs (can be used for converting words to numbers, etc) * Python, Java, Javascript clients are available We are using the server in several real world speech recognition applications, mainly for the Estonian language. E.g., we have developed an Android application that can act as a speech-recognition based "keyboard" (as Google's voice typing), and in a radiology dictation application that achieves 5% WER in real clinical environment. Not related to GStreamer, I have also an implementation of a novel phone duration model available at Github: https://github.com/alumae/kaldi-nnet-dur-model It's probably more interesting for researchers, but nevertheless, on the TEDLIUM test set it gives a drop in WER from 11.7% to 11.0%, from the online multisplice speed-perturbed DNN system with Cantab large LM rescoring. Regards, Tanel |