Kaldi / Discussion / Open Discussion: using Kaldi with ILA voice assistant?

Florian - 2015-04-10

Hello everybody,

I'm developing a multi-platform (Java based), user customizable voice assistant for a while now called ILA - intelligent learning assistant.

It's build modular to support the best, freely available speech recognition systems and so far I've integrated Sphinx-4 and Pocketsphinx (and the Goolge API, but it's not really an option for the future). Both systems work pretty well, but mainly because I'm using a small vocabulary of around 600 words with a dynamic language model and even then accuracy is still far away from anything like Google, Apple, Microsoft etc. (especially when it comes to robustness in different environments).

From all I've heard Kaldi could make a real difference here but I need some ideas on whats the best path to follow and how to get started. I hope you can help me with the following questions:

1) What's the performance (server requirements for a single user) needed to get Kaldi running in (close to) realtime with a large vocabulary (~50000)?

2) To be platform independant I think the only way to use Kaldi is to request transciption from a server. As far as I understand kaldi-gstreamer-server is the thing to look for?

3) What's the acoustic model to go for? Librispeech? Fisher? (it should be freely available, at best downloadable from a source like http://kaldi-asr.org (correct?))

4) Has anyone tried to get Kaldi running with Java? ^^

I've tried to answer some of these questions myself but I'm still struggling with the installation of Kaldi. Currently I'm working my way up to "Running the example scripts", but haven't succeded so far to reach a point where I can actually transcribe speech. I don't want to train my own model right now but somehow the tutorial seems to end there ...
Is there something like a quick-guide where I can just use default settings and pre-compiled acoustic and language models to get started?

Thanks in advance for all the help and ideas!
- Florian

Last edit: Florian 2015-04-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-04-10
  
  Hello everybody,
  
  I'm developing a multi-platform (Java based), user customizable voice
  assistant for a while now called ILA - intelligent learning assistant
  https://sites.google.com/site/ilavoiceassistant.
  
  It's build modular to support the best, freely available speech
  recognition systems and so far I've integrated Sphinx-4 and Pocketsphinx
  (and the Goolge API, but it's not really an option for the future). Both
  systems work pretty well, but maily because I'm restricting vocabulary to
  around 600 words with a dynamic language model and even then accuracy is
  still far away from anything like Google, Apple, Microsoft etc. (especially
  when it comes to robustness in different enviroments).
  
  From all I've heard Kaldi could make a real difference here but I need
  some ideas on whats the best path to follow and how to get started. I hope
  you can help me with the following questions:
  
  1) What's the performance (server requirements for a single person) needed
  to get Kaldi running in (close to) realtime with a large vocabulary
  (~50000)?
  
  It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs
  (using the online-nnet2-threaded setup) will give you a bit more headroom.
  
  2) To be platform independant I think the only way to use Kaldi is to
  request transciption from a server. As far as I understand
  kaldi-gstreamer-server is the thing to look for?
  
  I don't know much about that and I'm not sure how up-to-date it is (i.e.
  with the latest
  
  3) What's the acoustic model to go for? Librispeech? Fisher? (it should be
  freely available, at best downloadable from a source like
  http://kaldi-asr.org (correct?))
  
  Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
  if your data is high bandwidth. It makes sense to at least build the
  language modles for your domain though.
  
  4) Has anyone tried to get Kaldi running with Java? ^^
  
  I think the gstreamer stuff might use java, and there must have been some
  approach to do the wrapping, but I don't recall the details.
  
  Dan
  
  Thanks in advance for all the help and ideas!
  - Florian
  
  using Kaldi with ILA voice assistant?
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5eeb
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Florian - 2015-04-11

It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs
(using the online-nnet2-threaded setup) will give you a bit more headroom

great! :-) I was afraid all that DNN stuff requires more performance ^^

I don't know much about that and I'm not sure how up-to-date it is

basically I just need a way to send/recieve data to/from a Kaldi server. The HTTP API of kaldi-gstreamer-server with PUT and POST requests seems to be a good way but I didn't look at the details yet. Any other suggestions?

Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
if your data is high bandwidth. It makes sense to at least build the
language modles for your domain though.

usually 16khz is the prefered option. In ILA I use a dynamic 3-gram language model (dynamic because it growths with the amount of stuff the user teaches ILA). It's compatible with Sphinx and Pocketsphinx, I assume one could convert it to Kaldi format?
The librispeech model is the 21GB version from here: http://kaldi-asr.org/downloads/build/6/trunk/egs/ ? (it's really huge compared to the others ^^)

Thanks for the info!
- Florian

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-04-11
  
  It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs
  
  (using the online-nnet2-threaded setup) will give you a bit more headroom
  
  great! :-) I was afraid all that DNN stuff requires more performance ^^
  
  I don't know much about that and I'm not sure how up-to-date it is
  
  basically I just need a way to send/recieve data to/from a Kaldi server.
  The HTTP API of kaldi-gstreamer-server with PUT and POST requests seems to
  be a good way but I didn't look at the details yet. Any other suggestions?
  
  No other suggestions. Tanel might be able to advise RE gstreamer.
  
  Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
  if your data is high bandwidth. It makes sense to at least build the
  language modles for your domain though.
  
  usually 16khz is the prefered option. In ILA I use a dynamic 3-gram
  language model (dynamic because it growths with the amount of stuff the
  user teaches ILA). It's compatible with Sphinx and Pocketsphinx, I assume
  one could
  
  convert it to Kaldi format?
  
  Yes- sometimes inserting disambiguation symbols requires some work though,
  for non-ARPA LMs.
  Dan
  
  The librispeech model is the 21GB version from here:
  http://kaldi-asr.org/downloads/build/6/trunk/egs/ ? (it's really huge
  compared to the others ^^)
  
  Thanks for the info!
  - Florian
  
  using Kaldi with ILA voice assistant?
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Daniel Povey - 2015-04-11
    
    And regarding this:
    
    The librispeech model is the 21GB version from here:
    
    http://kaldi-asr.org/pdownloads/build/6/trunk/egs/
    http://kaldi-asr.org/downloads/build/6/trunk/egs ? (it's really huge
    compared to the others ^^)
    
    that is the size of the whole directory; the individual model will be much
    smaller.
    dan
    
    Thanks for the info!
    
    - Florian
    
    using Kaldi with ILA voice assistant?
    https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/kaldi/discussion/1355347/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nagendra Kumar Goel - 2015-04-11
    
    If you want to focus only on integration and not on asr engine I can give
    you access to a service.
    On Apr 10, 2015 11:10 PM, "Daniel Povey" danielpovey@users.sf.net wrote:
    
    It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs
    
    (using the online-nnet2-threaded setup) will give you a bit more headroom
    
    great! :-) I was afraid all that DNN stuff requires more performance ^^
    
    I don't know much about that and I'm not sure how up-to-date it is
    
    basically I just need a way to send/recieve data to/from a Kaldi server.
    The HTTP API of kaldi-gstreamer-server with PUT and POST requests seems to
    be a good way but I didn't look at the details yet. Any other suggestions?
    
    No other suggestions. Tanel might be able to advise RE gstreamer.
    
    Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
    if your data is high bandwidth. It makes sense to at least build the
    language modles for your domain though.
    
    usually 16khz is the prefered option. In ILA I use a dynamic 3-gram
    language model (dynamic because it growths with the amount of stuff the
    user teaches ILA). It's compatible with Sphinx and Pocketsphinx, I assume
    one could
    
    convert it to Kaldi format?
    
    Yes- sometimes inserting disambiguation symbols requires some work though,
    for non-ARPA LMs.
    Dan
    
    The librispeech model is the 21GB version from here:
    http://kaldi-asr.org/downloads/build/6/trunk/egs/ ? (it's really huge
    compared to the others ^^)
    
    Thanks for the info!
    - Florian
    
    using Kaldi with ILA voice assistant?
    
    https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/kaldi/discussion/1355347/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    using Kaldi with ILA voice assistant?
    http://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975/58d7
    
    Sent from sourceforge.net because you indicated interest in
    https://sourceforge.net/p/kaldi/discussion/1355347/
    
    To unsubscribe from further messages, please visit
    https://sourceforge.net/auth/subscriptions/
    
    alternate
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Florian - 2015-04-11

No other suggestions. Tanel might be able to advise RE gstreamer

Any additional info on this would be really helpful, ty!
I just found this paper by Tanel "Full-duplex Speech-to-text System for
Estonian" and it sounds exactly like what I'm looking for. As far as I understand he's actually refering to "kaldi-gstreamer-server" in this work. He even mentions that the client is available for Java too :-D

convert it to Kaldi format?
Yes- sometimes inserting disambiguation symbols requires some work though,
for non-ARPA LMs.

for some reason I thought it needed to be converted. Even better if it does not! :-)

If you want to focus only on integration and not on asr engine I can give
you access to a service.

that'd be great for testing! Is it using the kaldi-gstreamer-server?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tanel Alumäe - 2015-04-12
  
  I'm the author of kaldi-gstreamer-server. Yes, you can use PUT or POST to send audio to it, or a mode advanced websockets based protocol that lets you also get intermediate recognition results simultaneously to sending the audio.
  
  Yes, there is a Java client -- actually a library that can used to implement different clients. We have used this to create an Android client that does basically the same thing as Google's 'voice typing'. Also, I have one-window Java desktop application that uses it but I haven't released it (I'll put it on github if you are interested).
  
  Library: https://github.com/Kaljurand/net-speech-api
  Android app: https://github.com/Kaljurand/K6nele
  
  There is also a Javascript client implementation. A demo using English TEDLIUM multisplice DNN models is at http://bark.phon.ioc.ee/dictate/ but I'm a bit disappointed that it's accuracy on basic desktop dictation is pretty bad (but quality is great if we feed e.g. Bill Gates' latest TED talk into it).
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Florian - 2015-04-12

Yes, there is a Java client -- actually a library that can used to implement different clients

Great! thanks a lot. Actually I was just about to ask you directly for this because I saw it in your paper but couln't find it anywhere :-)

A demo using English TEDLIUM multisplice DNN models is at http://bark.phon.ioc.ee/dictate/

I played with it yesterday already but somwhow it wasn't working well. Now I tried it again and I'm very happy with it :-) (I think Firefox has some problems with it, sometimes it works and many times it doesn't).
Did you try other models like librispeech or fisher?

If you have time can you quickly have a look at my other post? Yesterday I installed everything from scratch up to the point where I can start the gstreamer-server and workers but I can't test Bill's speech because of some "Could not open file "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw" for writing" message. Maybe its just a trivial problem on my side but somehow I can't find it :-(

Last edit: Florian 2015-04-12

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Daniel Povey - 2015-04-12
  
  BTW, the online-nnet2 models are not as invariant to volume variations as
  we would like, because the training data all has well-normalized volume.
  More recently we have been artificially changing the volume while training,
  but we haven't uploaded those models yet.
  This might be the reason for some of the problems you have been having.
  
  Dan
  
  On Sun, Apr 12, 2015 at 2:53 AM, Florian floriq@users.sf.net wrote:
  
  Yes, there is a Java client -- actually a library that can used to
  implement different clients
  
  Great! thanks a lot. Actually I was just about to ask you directly for
  this because I saw it in your paper but couln't find it anywhere :-)
  
  A demo using English TEDLIUM multisplice DNN models is at
  http://bark.phon.ioc.ee/dictate/
  
  I played with it yesterday already but somwhow it wasn't working well. Now
  I tried it again and I'm very happy with it :-)
  Did you try other models like librispeech or fisher?
  
  If you have time can you quickly have a look at my other post
  http://sourceforge.net/p/kaldi/discussion/1355348/thread/28afe536?
  Yesterday I installed everything from scratch up to the point where I can
  start the gstreamer-server and workers but I can't test Bill's speech
  because of some "Could not open file
  "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw" for writing" message. Maybe
  its just a trivial problem on my side but somehow I can't find it :-(
  
  using Kaldi with ILA voice assistant?
  https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#a08c
  
  Sent from sourceforge.net because you indicated interest in
  https://sourceforge.net/p/kaldi/discussion/1355347/
  
  To unsubscribe from further messages, please visit
  https://sourceforge.net/auth/subscriptions/
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Tanel Alumäe - 2015-04-13
    
    Regarading the accuracy of online-nnet2 models: they work amazingly well in our Estonian real world applications, even in the radiology domain where radiologists use very low voice. Also the Android general-domain dictation is very accurate. Our training data includes mainly broadcast speech with a lot of semi-spontaneous telephone interviews and lectures/conference talks.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Tanel Alumäe - 2015-04-13
  
  I have played a bit with fisher and switchboard models from kaldi-asr.org but not very seriously. I haven't tested librispeech, I guess it would have a problem that is very much geared towards carefully dictated speech.
  
  About your error "Could not open file "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw": just create a directory "tmp". It's used for outputting the audio that is fed to Kaldi (for debug purposes, configurable in the YAML file). I also committed a fix that creates this directory automatically.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Vassil Panayotov - 2015-04-13
    
    Tanel, I might be biased :), but I think it might be good to give the librispeech model a go if you have time. I think that given a suitable LM, it will probably perform well on desktop dictation.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Florian - 2015-04-13

About your error "Could not open file "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw": just create a directory "tmp". It's used for outputting the audio that is fed to Kaldi (for debug purposes, configurable in the YAML file). I also committed a fix that creates this directory automatically.

to say it with home simpsons words: doh! Why didn't I try that? ^^
Thanks! Its working perfect now and I can start to write a client for ILA :-D

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

using Kaldi with ILA voice assistant?

Forums

Help

using Kaldi with ILA voice assistant?

- Florian

https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975

using Kaldi with ILA voice assistant?

Forums

Help

using Kaldi with ILA voice assistant? document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

- Florian

https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975

using Kaldi with ILA voice assistant?