Menu

Can I use CMU Sphinx for offline processing ?

Help
Peter
2018-03-06
2018-03-06
  • Peter

    Peter - 2018-03-06

    Last year I started to test the Uberi Speech Recognition software at https://github.com/Uberi/speech_recognition . It uses CMU Sphinx for offline processing of audio files to create text output (transcriptions). The WER was high, but I realise that was because the model that comes with it, wasn't suited to the test audios.

    Did some testing with that same tool with Google Speech Recognition, Wit.ai, and IBM Speech to Text. The lowest WER was from Google, where spoken words were returned as text. Wit.ai was about as accurate as the CMU Sphinx model and IBM was not quite as accurate as Google. I do have some test results somewhere, by using those tools on a small audio file and then matching the transcript to the output text.

    Looking for a greater degree of accuracy (low WER) for offline processing, I then tried DeepSpeech. Easy to install, but when I tried it on a small test drive, the computer froze and actually caused HDD damage. DeepSpeech is only alpha and can't process audio files much bigger than (say) 30 seconds.

    So, then tried Kaldi. It was basically easy to install, but there is a lot of work required to even test a small audio. Installing/compilation took hours and then to test one of the models, takes of a lot of time to understand the documentation and also takes a lot of processing time. Now I have come to this stage, and realised I just don't have the 'grunt' in this computer to work with Kaldi.

    Now wanting to look at using CMU Sphinx. Here is an overview of what is required.

    1. There is a laptop that I can use just for testing CMU Sphinx. It is an i5, 4Gb ram and 500Gb. Will this computer be suitable for installing, compilations, buidling models,etc,etc ?
    2. Need to do all the processing of the audio files offline.
    3. There are hundred of MP3's, most about 40 mins, but some in excess of 1 hr
    4. I can use ffmpeg to convert the MP3's to the required format for processing.
    5. Require the output to have a WER of less than 10%, and have the option to add time markers to the output text (transcription).

    If you can advise please.
    Peter

     
    • Nickolay V. Shmyrev

      Last year I started to test the Uberi Speech Recognition software at https://github.com/Uberi/speech_recognition

      This package does not use our API and other APIs properly. It is usually better to use the toolkit API directly without a wrapper.

      It is an i5, 4Gb ram and 500Gb. Will this computer be suitable for installing, compilations, buidling models,etc,etc ?

      This spec should be enough to use cmusphinx and kaldi, but probably not enough for serious ASR development. For the best WER you need to use Kaldi.

      The WER depends a lot on various factors - accent, music on background, noise, echo and so on. It is hard to give you a WER advice from the limited information you provided.

       
      • Peter

        Peter - 2018-03-06

        This spec should be enough to use cmusphinx and kaldi, but probably not enough for serious ASR development. For the best WER you need to use Kaldi.

        Okay thanks. With Kaldi, I was able to install/compile Kaldi, and then install the Librispeech model and test it. But tried another model and the computer was not sufficient. I may work through this today - http://jrmeyer.github.io/asr/2016/01/08/Installing-CMU-Sphinx-on-Ubuntu.html

        The WER depends a lot on various factors - accent, music on background, noise, echo and so on. It is hard to give you a WER advice from the limited information you provided.

        Yes okay, I didn't give much information. This is just one speaker, all the audios, and there is often music, but not in background, just playing off and on. If there is low volume I can increase it with ffmpeg, and have found that audacity is useful for removing noise. That can be done where necessary, prior to processing.

        At this stage, I'm assuming that the path to follow for just one speaker, is to build a speaker specific model, so using lots of small audios with just a few words for the training.

        Thanks for your help, Peter

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.