Menu

Problem Converting TTS audio back to text

Help
2017-12-22
2017-12-29
  • Victor Biro

    Victor Biro - 2017-12-22

    Hi,

    I have a use of pocketsphinx that is giving me some problems. There are multiple parts, each of which may contribute to the failure to successfully create accurate text output.

    The wav files are generated from a Software Defined Radio (SDR) using an application that uses GNU Radio (https://www.gnuradio.org/) called Trunk Recorder (https://github.com/robotastic/trunk-recorder) The files created are 16bit 8000 hz LittleEndian wav files.

    The 'file' command tells me this:

    1401-1513771892_7.70856e+08.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 8000 Hz
    

    I have tried feeding the files in using the command with the 8k model:

    pocketsphinx_continuous -infile file.wav -samprate 8000 -hmm /lib/sphinx/hmm/en-8k/ 
    

    I have also tried converting to 16bit, 16khz using FFMPEG with the command:

    ffmpeg  -i infile.wav -acodec pcm_s16le -ac 1 -ar 16000 outfile.wav
    

    Due to the origin of the file (GNURadio), that it was originally a digital audio signal (as opposed to audio captured by a mic), the artificail voice, and converting of the file format, I feel as though I am missing something that may be obvious to someone else.

    BTW, I have tried adapting a language model, and that didn't seem to do anything to improve results.

    I can't seem to attach sample of the wav file.

    Any thoughts?
    Victor

     
    • Nickolay V. Shmyrev

      You need to share the file

       
  • Victor Biro

    Victor Biro - 2017-12-26

    Nickolay,

    My apologies. I wanted to when I originally posted, but didn't see an opportunity to. I do now.

    Attached is a file that is recognised as:

    "for for for","to eat to you hear it in your shoes during eighty five good for a new piece he talks friends are fast pack corn","fool fool pool","where que forty forty forty one for forty find comfort for thirty two she forty three he had collapsed and and kept the direction of highway for twenty seven the north end and after eat well west pac warm
    

    While there are some correct words, it is not correct.

    Victor

     

    Last edit: Victor Biro 2017-12-26
    • Nickolay V. Shmyrev

      Default model is not supposed to decode this but with enough adaptation data (30mins) and 8khz acoustic model it should be 100% accurate.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.