Menu

using Kaldi with ILA voice assistant?

Florian
2015-04-10
2015-04-13
  • Florian

    Florian - 2015-04-10

    Hello everybody,

    I'm developing a multi-platform (Java based), user customizable voice assistant for a while now called ILA - intelligent learning assistant.

    It's build modular to support the best, freely available speech recognition systems and so far I've integrated Sphinx-4 and Pocketsphinx (and the Goolge API, but it's not really an option for the future). Both systems work pretty well, but mainly because I'm using a small vocabulary of around 600 words with a dynamic language model and even then accuracy is still far away from anything like Google, Apple, Microsoft etc. (especially when it comes to robustness in different environments).

    From all I've heard Kaldi could make a real difference here but I need some ideas on whats the best path to follow and how to get started. I hope you can help me with the following questions:

    1) What's the performance (server requirements for a single user) needed to get Kaldi running in (close to) realtime with a large vocabulary (~50000)?

    2) To be platform independant I think the only way to use Kaldi is to request transciption from a server. As far as I understand kaldi-gstreamer-server is the thing to look for?

    3) What's the acoustic model to go for? Librispeech? Fisher? (it should be freely available, at best downloadable from a source like http://kaldi-asr.org (correct?))

    4) Has anyone tried to get Kaldi running with Java? ^^

    I've tried to answer some of these questions myself but I'm still struggling with the installation of Kaldi. Currently I'm working my way up to "Running the example scripts", but haven't succeded so far to reach a point where I can actually transcribe speech. I don't want to train my own model right now but somehow the tutorial seems to end there ...
    Is there something like a quick-guide where I can just use default settings and pre-compiled acoustic and language models to get started?

    Thanks in advance for all the help and ideas!
    - Florian

     

    Last edit: Florian 2015-04-10
    • Daniel Povey

      Daniel Povey - 2015-04-10

      Hello everybody,

      I'm developing a multi-platform (Java based), user customizable voice
      assistant for a while now called ILA - intelligent learning assistant
      https://sites.google.com/site/ilavoiceassistant.

      It's build modular to support the best, freely available speech
      recognition systems and so far I've integrated Sphinx-4 and Pocketsphinx
      (and the Goolge API, but it's not really an option for the future). Both
      systems work pretty well, but maily because I'm restricting vocabulary to
      around 600 words with a dynamic language model and even then accuracy is
      still far away from anything like Google, Apple, Microsoft etc. (especially
      when it comes to robustness in different enviroments).

      From all I've heard Kaldi could make a real difference here but I need
      some ideas on whats the best path to follow and how to get started. I hope
      you can help me with the following questions:

      1) What's the performance (server requirements for a single person) needed
      to get Kaldi running in (close to) realtime with a large vocabulary
      (~50000)?

      It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs
      (using the online-nnet2-threaded setup) will give you a bit more headroom.

      2) To be platform independant I think the only way to use Kaldi is to
      request transciption from a server. As far as I understand
      kaldi-gstreamer-server is the thing to look for?

      I don't know much about that and I'm not sure how up-to-date it is (i.e.
      with the latest

      3) What's the acoustic model to go for? Librispeech? Fisher? (it should be
      freely available, at best downloadable from a source like
      http://kaldi-asr.org (correct?))

      Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
      if your data is high bandwidth. It makes sense to at least build the
      language modles for your domain though.

      4) Has anyone tried to get Kaldi running with Java? ^^

      I think the gstreamer stuff might use java, and there must have been some
      approach to do the wrapping, but I don't recall the details.

      Dan

      Thanks in advance for all the help and ideas!
      - Florian


      using Kaldi with ILA voice assistant?
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5eeb


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Florian

    Florian - 2015-04-11

    It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs
    (using the online-nnet2-threaded setup) will give you a bit more headroom

    great! :-) I was afraid all that DNN stuff requires more performance ^^

    I don't know much about that and I'm not sure how up-to-date it is

    basically I just need a way to send/recieve data to/from a Kaldi server. The HTTP API of kaldi-gstreamer-server with PUT and POST requests seems to be a good way but I didn't look at the details yet. Any other suggestions?

    Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
    if your data is high bandwidth. It makes sense to at least build the
    language modles for your domain though.

    usually 16khz is the prefered option. In ILA I use a dynamic 3-gram language model (dynamic because it growths with the amount of stuff the user teaches ILA). It's compatible with Sphinx and Pocketsphinx, I assume one could convert it to Kaldi format?
    The librispeech model is the 21GB version from here: http://kaldi-asr.org/downloads/build/6/trunk/egs/ ? (it's really huge compared to the others ^^)

    Thanks for the info!
    - Florian

     
    • Daniel Povey

      Daniel Povey - 2015-04-11

      It can be done using 1 fairly normal modern CPU, but allocating 2 CPUs

      (using the online-nnet2-threaded setup) will give you a bit more headroom

      great! :-) I was afraid all that DNN stuff requires more performance ^^

      I don't know much about that and I'm not sure how up-to-date it is

      basically I just need a way to send/recieve data to/from a Kaldi server.
      The HTTP API of kaldi-gstreamer-server with PUT and POST requests seems to
      be a good way but I didn't look at the details yet. Any other suggestions?

      No other suggestions. Tanel might be able to advise RE gstreamer.

      Depends whether 8 kHz or 16 kHz. Probably 16 kHz is better (Librispeech)
      if your data is high bandwidth. It makes sense to at least build the
      language modles for your domain though.

      usually 16khz is the prefered option. In ILA I use a dynamic 3-gram
      language model (dynamic because it growths with the amount of stuff the
      user teaches ILA). It's compatible with Sphinx and Pocketsphinx, I assume
      one could

      convert it to Kaldi format?

      Yes- sometimes inserting disambiguation symbols requires some work though,
      for non-ARPA LMs.
      Dan

      The librispeech model is the 21GB version from here:
      http://kaldi-asr.org/downloads/build/6/trunk/egs/ ? (it's really huge
      compared to the others ^^)

      Thanks for the info!
      - Florian


      using Kaldi with ILA voice assistant?
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#5975


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
  • Florian

    Florian - 2015-04-11

    No other suggestions. Tanel might be able to advise RE gstreamer

    Any additional info on this would be really helpful, ty!
    I just found this paper by Tanel "Full-duplex Speech-to-text System for
    Estonian" and it sounds exactly like what I'm looking for. As far as I understand he's actually refering to "kaldi-gstreamer-server" in this work. He even mentions that the client is available for Java too :-D

    convert it to Kaldi format?
    Yes- sometimes inserting disambiguation symbols requires some work though,
    for non-ARPA LMs.

    for some reason I thought it needed to be converted. Even better if it does not! :-)

    If you want to focus only on integration and not on asr engine I can give
    you access to a service.

    that'd be great for testing! Is it using the kaldi-gstreamer-server?

     
    • Tanel Alumäe

      Tanel Alumäe - 2015-04-12

      I'm the author of kaldi-gstreamer-server. Yes, you can use PUT or POST to send audio to it, or a mode advanced websockets based protocol that lets you also get intermediate recognition results simultaneously to sending the audio.

      Yes, there is a Java client -- actually a library that can used to implement different clients. We have used this to create an Android client that does basically the same thing as Google's 'voice typing'. Also, I have one-window Java desktop application that uses it but I haven't released it (I'll put it on github if you are interested).

      Library: https://github.com/Kaljurand/net-speech-api
      Android app: https://github.com/Kaljurand/K6nele

      There is also a Javascript client implementation. A demo using English TEDLIUM multisplice DNN models is at http://bark.phon.ioc.ee/dictate/ but I'm a bit disappointed that it's accuracy on basic desktop dictation is pretty bad (but quality is great if we feed e.g. Bill Gates' latest TED talk into it).

       
  • Florian

    Florian - 2015-04-12

    Yes, there is a Java client -- actually a library that can used to implement different clients

    Great! thanks a lot. Actually I was just about to ask you directly for this because I saw it in your paper but couln't find it anywhere :-)

    A demo using English TEDLIUM multisplice DNN models is at http://bark.phon.ioc.ee/dictate/

    I played with it yesterday already but somwhow it wasn't working well. Now I tried it again and I'm very happy with it :-) (I think Firefox has some problems with it, sometimes it works and many times it doesn't).
    Did you try other models like librispeech or fisher?

    If you have time can you quickly have a look at my other post? Yesterday I installed everything from scratch up to the point where I can start the gstreamer-server and workers but I can't test Bill's speech because of some "Could not open file "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw" for writing" message. Maybe its just a trivial problem on my side but somehow I can't find it :-(

     

    Last edit: Florian 2015-04-12
    • Daniel Povey

      Daniel Povey - 2015-04-12

      BTW, the online-nnet2 models are not as invariant to volume variations as
      we would like, because the training data all has well-normalized volume.
      More recently we have been artificially changing the volume while training,
      but we haven't uploaded those models yet.
      This might be the reason for some of the problems you have been having.

      Dan

      On Sun, Apr 12, 2015 at 2:53 AM, Florian floriq@users.sf.net wrote:

      Yes, there is a Java client -- actually a library that can used to
      implement different clients

      Great! thanks a lot. Actually I was just about to ask you directly for
      this because I saw it in your paper but couln't find it anywhere :-)

      A demo using English TEDLIUM multisplice DNN models is at
      http://bark.phon.ioc.ee/dictate/

      I played with it yesterday already but somwhow it wasn't working well. Now
      I tried it again and I'm very happy with it :-)
      Did you try other models like librispeech or fisher?

      If you have time can you quickly have a look at my other post
      http://sourceforge.net/p/kaldi/discussion/1355348/thread/28afe536?
      Yesterday I installed everything from scratch up to the point where I can
      start the gstreamer-server and workers but I can't test Bill's speech
      because of some "Could not open file
      "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw" for writing" message. Maybe
      its just a trivial problem on my side but somehow I can't find it :-(


      using Kaldi with ILA voice assistant?
      https://sourceforge.net/p/kaldi/discussion/1355347/thread/26d1958c/?limit=25#a08c


      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/kaldi/discussion/1355347/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/subscriptions/

       
      • Tanel Alumäe

        Tanel Alumäe - 2015-04-13

        Regarading the accuracy of online-nnet2 models: they work amazingly well in our Estonian real world applications, even in the radiology domain where radiologists use very low voice. Also the Android general-domain dictation is very accurate. Our training data includes mainly broadcast speech with a lot of semi-spontaneous telephone interviews and lectures/conference talks.

         
    • Tanel Alumäe

      Tanel Alumäe - 2015-04-13

      I have played a bit with fisher and switchboard models from kaldi-asr.org but not very seriously. I haven't tested librispeech, I guess it would have a problem that is very much geared towards carefully dictated speech.

      About your error "Could not open file "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw": just create a directory "tmp". It's used for outputting the audio that is fed to Kaldi (for debug purposes, configurable in the YAML file). I also committed a fix that creates this directory automatically.

       
      • Vassil Panayotov

        Tanel, I might be biased :), but I think it might be good to give the librispeech model a go if you have time. I think that given a suitable LM, it will probably perform well on desktop dictation.

         
  • Florian

    Florian - 2015-04-13

    About your error "Could not open file "tmp/bb1ac9b8-cece-4844-8593-8d861d9a2945.raw": just create a directory "tmp". It's used for outputting the audio that is fed to Kaldi (for debug purposes, configurable in the YAML file). I also committed a fix that creates this directory automatically.

    to say it with home simpsons words: doh! Why didn't I try that? ^^
    Thanks! Its working perfect now and I can start to write a client for ILA :-D