CMU Sphinx / Forums / Help: Chinese models trained on a huge amount of acoustic data

yeelearn - 2018-09-17

I want to know the data set used by the Chinese(mandarin) model provided by CMU.

When I want to train my Chinese model, I find that the recognition rate is very low.
I suspect that the data set is of poor quality or training method problem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2018-09-17
  
  I want to know the data set used by the Chinese(mandarin) model provided by CMU.
  
  https://catalog.ldc.upenn.edu/LDC98S73
  
  When I want to train my Chinese model, I find that the recognition rate is very low.
  I suspect that the data set is of poor quality or training method problem.
  
  If you use AISHELL http://www.openslr.org/33/ or AISHELL2, the accuracy should be much better.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - yeelearn - 2018-09-17
    
    thanks
    
    I want to know what your noise file is, such as +LAUGH+ +LIPSMACK+ +COUGH+ ...
    
    would you mind sharing “sphinx_train.cfg” file with me, so that I can find different places with my files.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2018-09-17
      
      I want to know what your noise file is, such as +LAUGH+ +LIPSMACK+ +COUGH+ ...
      
      Noisedict inside model contains:
      
      <s> SIL </s> SIL <sil> SIL ++laugh++ +LAUGH+ ++lipsmack++ +LIPSMACK+ ++cough++ +COUGH+ ++breath++ +BREATHE+ ++incomplete++ +GARBAGE+
      
      would you mind sharing “sphinx_train.cfg” file with me, so that I can find different places with my files.
      
      The model was trained long time ago, so the configuration file is lost. You can use default one, it should give you same results.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - yeelearn - 2018-09-18
        
        Ok, let me try again.
        Thank you very much.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Chinese models trained on a huge amount of acoustic data

Speech Recognition Toolkit

Forums

Help

Chinese models trained on a huge amount of acoustic data document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Chinese models trained on a huge amount of acoustic data