CMU Sphinx / Forums / Help: VERY HIGH WER even with enough amount of audio

Leimiaoren - 2016-02-02

I have 5 hours of training audio

I used some of the training data as test data but the value of the WER is 93%, What could be the problem?

I attached here my whole training folder and I have the following questions:

Why can't I achieve 0% WER for a testing data that came from the training audios itself?
Is it possible to achieve 10% WER with 5 hours of data? or is this not enough?
What are other ways to improve WER of my system?
I have 4000 audio utterances and sometimes I get 3000 errors in one baum welch iteration. What can I do to fix this because I have read this is a serious issue.
*Do i need to use force aligned? if yes then where can I download sphinx3?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-02-02
  
  Why can't I achieve 0% WER for a testing data that came from the training audios itself?
  
  In your training folder features are not extracted properly. Some features are shorter than it shoudl be. Total duration is reported to be 2.1 hours while it is actually 4.8 hours. It seems features are still extracted from 44khz files, you need to run training from start reextracting features.
  
  For example, file size of BT_204.mfc must be 31kB, in your archive it is 8kB.
  
  Once you reextract features, you will get WER 5%, I have this WER when I train with your folder without any modifications.
  
  I have 4000 audio utterances and sometimes I get 3000 errors in one baum welch iteration. What can I do to fix this because I have read this is a serious issue.
  
  Once you fix feature extraction errors will go away.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2016-02-02
    
    Some of your files are still 44khz:
    
    D_104.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_12.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_20.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_28.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_37.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_45.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_57.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_64.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_71.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_79.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_86.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz D_97.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_12.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_18.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_25.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_30.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_49.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_56.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_5.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_63.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_71.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_78.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz E_83.wav: RIFF (little-endian) data, WAVE audio, Microsoft PCM, 16 bit, mono 44100 Hz
    
    To batch resample files, use sox.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Leimiaoren - 2016-02-03
      
      Thank you very much for patiently answering all my questions Sir
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Leimiaoren - 2016-02-03
    
    Thank you for the fast reply Sir and it was indeed true that Word Error Rate had dropped to 6%.
    
    But we still have a problem
    
    When I configured my acoustic model to my program, it does not transcribe the data files accurately while the an4.align shows really good result.
    
    For example, I used A_1 audio file as input and the transcribed text from my program is "abaja abajo abajada abajada abajada" but in an4.align it showed "aba abaaba abaja abajo abajada"
    
    What could be my problem this time? I have already checked everything. Why can't the configured model transcribe my trained audio data accurately?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Leimiaoren - 2016-02-03
    
    Last edit: Leimiaoren 2016-02-03
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-02-15
      
      Sphinx4 is not going to accurately reproduce results of training. You need accurate model first of all, for example your LM does not seem correct, you'd better train it properly.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Leimiaoren - 2016-02-23
        
        If Sphinx4 cannot accurately reproduce results of training then is there any study about this? Like a research or article so I can cite it on my paper. I can't find any.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2016-02-23
        
        There are no studies, it is a common sense.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Leimiaoren - 2016-03-01
        
        Then whats in the training data that does not let the sphinx4 decoder transcribe it accurately? Or what does the sphinx4 decoder lack that cannot transcribe the training data accurately? Please do reply. I REALLY NEED THIS.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leimiaoren - 2016-03-02

Please. help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2016-03-02
  
  You need to provide all the information to describe your problem and reproduce your troubles. The faster you provide the information the faster you get an advice.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Leimiaoren - 2016-03-02
    
    Then whats in the training data that does not let the sphinx4 decoder transcribe it accurately? Or what does the sphinx4 decoder lack that cannot transcribe the training data accurately? Please do reply. I REALLY NEED THIS.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2016-03-02
      
      No idea, I haven't seen your training data and also I haven't seen why sphinx4 cannot transribe accurately.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Leimiaoren - 2016-03-02
        
        When I configured my acoustic model to my program, it does not transcribe the data files accurately while the an4.align shows really good result.
        
        For example, I used A_1 audio file as input and the transcribed text from my program is "abaja abajo abajada abajada abajada" but in an4.align it showed "aba abaaba abaja abajo abajada"
        
        There must be a reason why sphinx4 decoder cannot accurately reproduced trained results. Do you by any chance know the reason sir?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2016-03-02
        
        It seems to be a bug in sphinx4 related to .lm.bin language model. If you use arpa format language model, result will be accurate. I need to investigate why lmbin does not work in s4, that will take some time. You can report a bug about it in our issue tracker.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Leimiaoren - 2016-03-02

thank you very much for answering. I can use this as reference.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

VERY HIGH WER even with enough amount of audio

Speech Recognition Toolkit

Forums

Help

VERY HIGH WER even with enough amount of audio document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

VERY HIGH WER even with enough amount of audio