Menu

Very poor accuracy with Sphinx 4

2016-08-11
2023-07-06
  • New SourceForge User

    Hi. I am a newcomer trying to get Sphinx 4 to work. I have run it on my careful and clear reading of test.wav (I am an American native English speaker) for which the ground truth is:

    this is the first interval of speaking
    after the first moment of silence this is the second interval of speaking
    after the third moment of silence this is the third interval of speaking and the last

    and I got instead

    this is the first interval of speaking
    one of silence
    the second interval of speaking
    to the third moment of silence
    thirty hunter was speaking and the last

    for which word error rate was 37%. (The word error rate was similar for the version of test.wav included with Sphinx.)

    I have also run it on my careful and clear reading of Little Red Riding Hood, which begins:

    Once upon a time there lived in a certain village a little country girl the prettiest creature who was ever seen
    Her mother was excessively fond of her and her grandmother doted on her still more
    This good woman had a little red riding hood made for her
    It suited the girl so extremely well that everybody called her Little Red Riding Hood
    One day her mother having made some cakes said to her
    Go my dear and see how your grandmother is doing for
    I hear she has been very ill
    Take her a cake and this little pot of butter
    Little Red Riding Hood set out immediately to go to her grandmother who lived in another village
    As she was going through the wood she met with a wolf who had a very great mind to eat

    and I got instead:

    little country girl
    freeze creature whose it is steve
    her mother was excessively fond of her
    in your grandmother going on are still more
    the woman had a little red riding hood makes her
    to to girls looks yummy well
    everybody called her a little red riding
    one day her mother having
    goal mightier can see how are you from others doing
    we're here she's been very ill
    in this little pot luck
    little red riding hood seventy three
    go to her grandma
    he lives in a little
    go into the woods you know with the walls
    for a great minds iraq

    (The reference to "Iraq", while amusing, seemed a bit out of place in "Little Red Riding Hood.")
    While these Sphinx transcripts are amusing, I'm tired of being amused and want instead a reasonable word error rate, rather than the 60% I got here.

    Obviously I am doing something wrong, since Sphinx can't be so bad. But what is it?

    Here are some technical details. Both WAV files were indeed mono, with 16-bit precision, and a sampling
    rate of 16000 samples per second as confirmed by SOX. (The files were recorded on a Windows laptop into .m4a files and converted by FFMPEG to .wav format.) The default acoustic model and default language model, as distributed with Sphinx, were used; see the actual Java code below for the details. The included jars were:

    1. sphinx4-core-5prealpha-20160628.232526-10.jar
    2. sphinx4-data-5prealpha-20160628.232535-10.jar
    3. sphinx4-samples-5prealpha-20160628.232549-9.jar

    I have appended below the actual Java code used to transcribe.

    I can't believe Sphinx would get such poor word error rates on such
    simple audio files. Please tell me what I am doing wrong. Thanks very much for your help.

    Here is the Java code. The transcripts I referred to above are the ones produced first by the code.

    package edu.cmu.sphinx.demo.transcriber;

    import java.io.InputStream;

    import edu.cmu.sphinx.api.Configuration;
    import edu.cmu.sphinx.api.SpeechResult;
    import edu.cmu.sphinx.api.StreamSpeechRecognizer;
    import edu.cmu.sphinx.decoder.adaptation.Stats;
    import edu.cmu.sphinx.decoder.adaptation.Transform;
    import edu.cmu.sphinx.result.WordResult;
    import java.io.FileInputStream;
    import java.io.File;

    public class Transcriber {

    public static void main(String[] args) throws Exception {
        System.out.println("Loading models...");
    
        Configuration configuration = new Configuration();
    
        configuration
                .setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
    
        configuration
                .setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration
                .setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");
    
        StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
        InputStream stream = new FileInputStream(new     File("C:\\workspace\\TranscribeAudio\\src\\main\\java\\input.wav"));
        stream.skip(44);
    
        recognizer.startRecognition(stream);
        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
    
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
    
            System.out.println("List of recognized words and their times:");
            for (WordResult r : result.getWords()) {
                System.out.println(r);
            }
    
            System.out.println("Best 3 hypothesis:");
            for (String s : result.getNbest(3))
                System.out.println(s);
    
        }
        recognizer.stopRecognition();
    
    
        stream = new FileInputStream(new File("C:\\workspace\\TranscribeAudio\\src\\main\\java\\input.wav"));
        stream.skip(44);
    
        Stats stats = recognizer.createStats(1);
        recognizer.startRecognition(stream);
    

    while ((result = recognizer.getResult()) != null) {
    stats.collect(result);
    }
    recognizer.stopRecognition();

        Transform transform = stats.createTransform();
        recognizer.setTransform(transform);
    
        stream = new FileInputStream(new File("C:\\workspace\\TranscribeAudio\\src\\main\\java\\input.wav"));
        stream.skip(44);
        recognizer.startRecognition(stream);
        while ((result = recognizer.getResult()) != null) {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
        }
        recognizer.stopRecognition();
    
    
    }
    

    }

     
    • Nickolay V. Shmyrev

      You need to provide the audio

       
      • New SourceForge User

        I've attached the one .wav file.

         
        • Nickolay V. Shmyrev

          It seems you are using very heavy compression which corrupts spectrum and not very good microphone. Default model is not going to recognize such speech accurately.

          You can probably try with something like USB/bluetooth microphone and with a recorder which can record raw wav files.

           
          • A G T

            A G T - 2016-08-13

            Hi Nickolay,

            You said above that the default model is not going to recognize such speech (where the spectrum has been corrupted, perhaps by a compression or conversion algorithm, but the speech is still clear to a human ear). I have a few questions:

            1. I understand that Sphinx4 only works with 16bit + 16KHz or 16 bit + 8 KHz raw WAV files. Is it possible to configure it or to train an acoustic model to work with 8bit + 8KHz raw WAV recordings? If not, what method or tool would you recommend to convert 8bit + 8KHz recordings to 16bit + 8KHz?
            2. If I am only concerned with transcription accuracy and prepared to spend relatively large amounts of processing time and power, how should I configure sphinxtrain (and possibly the recognizer)? I assume that I'll need to train a "bigger" acoustic model and perhaps reconfigure the decoder to work with it.
            3. (The questions above are more important for me, but this is a natural question which popped up in my head.) In your comment above, I assume that you meant the the default acoustic model: how might one go about training an acoustic model that will work well with a diverse set of recordings that may or may not have been compressed and decompressed. Is it possible do this even in priciple or do we need separate acoustic models for every different permutation?

            Thanks for answering all the questions by everyone! (I'm a Sphinx4 novice too and I have been reading these forums for a few months and your answers have been tremendously helpful and have saved me from several weeks of headache.)

            A.

             
            • Nickolay V. Shmyrev

              I understand that Sphinx4 only works with 16bit + 16KHz or 16 bit + 8 KHz raw WAV files. Is it possible to configure it or to train an acoustic model to work with 8bit + 8KHz raw WAV recordings? If not, what method or tool would you recommend to convert 8bit + 8KHz recordings to 16bit + 8KHz?

              8bit representation is rarely used for PCM data, it looses too much information. Most likely you meant to decode some compressed format which you can simply decompress to PCM first.

              If I am only concerned with transcription accuracy and prepared to spend relatively large amounts of processing time and power, how should I configure sphinxtrain (and possibly the recognizer)? I assume that I'll need to train a "bigger" acoustic model and perhaps reconfigure the decoder to work with it.

              If you want accuracy you can use Kaldi DNN models. Something like https://github.com/alumae/kaldi-offline-transcriber English models are available here https://github.com/srvk/eesen-transcriber

              (The questions above are more important for me, but this is a natural question which popped up in my head.) In your comment above, I assume that you meant the the default acoustic model: how might one go about training an acoustic model that will work well with a diverse set of recordings that may or may not have been compressed and decompressed. Is it possible do this even in priciple or do we need separate acoustic models for every different permutation?

              State of the art solution is to train the system on as large amount of diverse data as possible this goes to training up to 10000 hours of speech data. You can track the effort here:

              https://github.com/kaldi-asr/kaldi/issues/870

               

              Last edit: Nickolay V. Shmyrev 2016-08-13
              • A G T

                A G T - 2016-08-14

                Thanks Nickolay, that was a big help.

                EESEN-transcriber works reasonably well on my personal computer with a variety of sample files (WAV, MP3, M4A, etc). However:

                1. My work requires trancription of 8bit-8KHz recordings which have two, and sometimes more than two, people talking. The quality of these recordings is not great (and much worse than the samples I tested EESEN-transcriber on) but that's the form in which they are available to me. Unfortunately, I'm unable to share these recordings outside my workplace.
                2. Relatedly, the computers on which I need to run the transcriber programs are Linux boxes that are NOT connected to the internet. Installing Sphinx4 on these machines is straightforward -- I only need to copy some Jar files -- but installing Kaldi or EESEN-transcriber is much more difficult, since they need to download some components from the internet. Hence my questions above about attempting to improve Sphinx's transcription accuracy.

                Thanks again for your help, it is much appreciated.

                A.

                 
                • Nickolay V. Shmyrev

                  Accuracy problems are much more complicated than installation problems.

                   
  • New SourceForge User

    and one more

     
  • New SourceForge User

    Thank you, Nickolay. I was using the default "Voice Recorder" under Windows 10, which does not seem to allow direct recording into .wav format. I will see what I can do.

     
  • New SourceForge User

    Hi, again, Nickolay!

    Thanks very much for the suggestions. I rerecorded test.wav and
    Little Red Riding Hood using Audacity directly to .wav format with 1
    channel, 16kHz sampling rate, and 16-bit precision. (Actually, Audacity recorded to 32-bit FLOAT and then converted to 16-bit integers; I believe that conversion shouldn't have caused Sphinx any problems.) The performance on test.wav
    improved from 37% to 14%, which I was happy about. However, for
    Little Red Riding Hood, the WER only improved from 60% to 48%. (I've
    attached one Little Red Riding Hood .wav file.) A sample of the ground truth and Sphinx output appears below. All the files were
    still recorded using my laptop's built-in microphone, not a USB
    microphone, as you'd suggested. Do you think the 48% WER could simply
    be due to the quality of my laptop's microphone? I'm afraid I'm doing
    something else wrong.

    The correct transcript begins:

    Once upon a time there lived in a certain village a little country
    girl the prettiest creature who was ever seen
    Her mother was excessively fond of her and her grandmother doted on
    her still more
    This good woman had a little red riding hood made for her
    It suited the girl so extremely well that everybody called her Little
    Red Riding Hood
    One day her mother having made some cakes said to her
    Go my dear and see how your grandmother is doing for
    I hear she has been very ill
    Take her a cake and this little pot of butter
    Little Red Riding Hood set out immediately to go to her grandmother
    who lived in another village
    As she was going through the wood she met with a wolf who had a very
    great mind to eat her up but he dared not because of some woodcutters
    working nearby in the forest

    The Sphinx transcript begins:

    two
    edit
    what's our lives by their lives in a city village a little country
    girl
    praise creature who was ever seen
    her mother was excessively fond of her
    and your grandmother build on hers
    more
    ms good woman had a little red riding hood
    need for her
    this is the girl so it's really well
    everybody called her little red riding hood
    one day her mother having a suitcase
    said to her
    going here and see how you from others doing
    for eight years she's been very ill
    take her case in this little car water
    little red riding hood seldom used to go to rome other
    who lives in the village
    ash is going through the woods
    you met with a wolf
    what a very great mind to europe
    but he did not because it's somewhat cons
    can you buy the farms

    I'm eager to hear your opinion, Nickolay! Thanks again.

     
  • Kristen Basson

    Kristen Basson - 2023-07-06

    Hi there!

    I have the same problem as New SourceForge User. I have followed the CMU Sphinx tutorial exactly as well as this youtube tutorial: https://www.youtube.com/watch?v=d_tYW-4i3nE&ab_channel=ForgottenLegends

    The transcriptions are inaccurate and I am unsure why. If you look at the youtube video even those transcriptions are incorrect from hearing the audio.

    I am trying to transcribe wav files for the language isiXhosa and the transcriptions were inaccurate for them too. I had a 30% WER for my acoustic model and the accuracy was perfect when using pocketsphinx on python but does not seem to work with sphinx4 on eclipse. That is why i tried with the basic US-english model and that too transcribed inaccurately.

    Any help would be much appreciated!

     

Log in to post a comment.

MongoDB Logo MongoDB