[Kaldi-developers] Update on GStreamer related developments

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hello everybody,

Daniel asked me to give and update on the GStreamer related work with
Kaldi that I have been doing.

GStreamer is a multimedia framework. It consists of different plugins
(audio and video decoders and encoders, resamplers, effect modules,
input/output modules) that can be formed into pipelines. GStreamer can
be used via its GObject introspection bindings. Thus, one can use
GStreamer in any programming language that has support for GObject
introspection, which includes Python, Ruby, Java, Vala.

The Kaldi code base includes a GStreamer plugin that supports GMM
models. More recently, I have also developed a similar plugin that
supports "online DNN" models. It's available at
https://github.com/alumae/gst-kaldi-nnet2-online. I am planning to
maintain it as separate project from Kaldi (I believe Daniel agrees).
The plugin has a very similar functionality as Kaldi's
online2-wav-nnet2-latgen-faster, with some extensions. First, it can do
on-the-fly audio segmentation, based on silences in the audio. It's
based on the endpointing code in Kaldi's nnet2 code, but instead of
terminating when an endpoint is encountered, it simply starts decoding
the next segment. It can also do language model rescoring, as in
lattice-lmrescore.

It's very easy to create GUI speech recognition applications  using the
plugin, or apply the plugin from command line to e.g. transcribe a long
audio file. Check the 'demo' folder at Github. 

My other project is https://github.com/alumae/kaldi-gstreamer-server.
It's a real-time full duplex speech recognition server, built around the
Kaldi's GStreamer plugins. Features (copied from the README):

  * Full duplex communication based on websockets: speech goes in,
partial hypotheses come out (think of Android's voice typing)
  * Very scalable: the server consists of a master component and
workers; one worker is needed per concurrent recognition session;
workers can be started and stopped independently of the master on remote
machines
  * Can do speech segmentation, i.e., a long speech signal is broken
into shorter segments based on silences
  * Supports arbitrarily long speech input (e.g., you can stream live
speech into it)
  * Supports Kaldi's GMM and "online DNN" models
  * Supports rescoring of the recognition lattice with a large language
model
  * Supports persisting the acoustic model adaptation state between
requests
  * Supports unlimited set of audio codecs (actually only those
supported by GStreamer)
  * Supports rewriting raw recognition results using external programs
(can be used for converting words to numbers, etc)
  * Python, Java, Javascript clients are available

We are using the server in several real world speech recognition
applications, mainly for the Estonian language. E.g., we have developed
an Android application that can act as a speech-recognition based
"keyboard" (as Google's voice typing), and in a radiology dictation
application that achieves 5% WER in real clinical environment.

Not related to GStreamer, I have also an implementation of a novel phone
duration model available at Github:
https://github.com/alumae/kaldi-nnet-dur-model 
It's probably more interesting for researchers, but nevertheless, on the
TEDLIUM test set it gives a drop in WER from 11.7% to 11.0%, from the
online multisplice speed-perturbed DNN system with Cantab large LM
rescoring. 

Regards,
Tanel