CMU Sphinx / Forums / Help: Resampling to properly train acoustic model

Ernest - 2018-11-10

Dear All,

PROBLEM:
I want to train acoustic model for Polish language and for that I have to collect recordings first.
BTW. - I have sucesfully trained a model to recognize my own speech. As I wanted to primarly use it to recognize speech over a telephone I have used recordings with 8kHz sample rate.

QUESTION:
Can I collect recordings with sample rate of 16kHz , so that I can use them to train acustic model for desktop applications AND after downsampling to 8kHz (e.g. using sox or audacity) use it as well to train acustic model for telephone speech recognition?
I simply want to avoid collecting recordings twice, for different sample rates.

REMARKS:
Here https://cmusphinx.github.io/wiki/tutorialam/#data-preparation I've found this: "Please note that you cannot upsample your audio, that means you can not train 16 kHz model with 8 kHz data." Does it mean that I can do downsampling?

Thank you in advance,
Ernest

Last edit: Ernest 2018-11-10

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2018-11-10
  
  Usually telephone audio is quite different from wideband audio due to different codecs and corruptions. That is why telephone model has to be trained on telephone data. Downsampled wideband data can be used for boostrap or for initial model but results in low quality.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Ernest - 2018-11-10
    
    Hello Nickolay,
    
    thank you for your prompt reply!
    
    I would be grateful for further suggestions from an expert like you, on how can I proceed then:
    
    I have already developed a website (https://naukait.com:7171/) to collect recordings. Somehow I skipped this part of the tutorial: "if you are going to recognize telephone speech it is preferred to use telephone recordings. ". Can't I still make recordings with my website -> 16kHz sampling rate, and downsample to 8kHz plus use some "trick" to make audio similar to this obtained over a telephone?
    
    I understand that for desktop apps my 16kHz recordings should be ok?
    
    What is bootstrap and what is initial model? Are these some kind of models which then has to be further improved? If yes, then how could this be done?
    
    REMARKS:
    My website for recording asks users to record specific sentences. They result from the research (not mine) and come from so called "CORPORA dictionary". These specific sentences are supposed to improve trainign of an acoustic model for Polish language.
    
    Best regards,
    Ernest
    
    Last edit: Ernest 2018-11-10
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2018-11-11
      
      I would be grateful for further suggestions from an expert like you, on how can I proceed then:
      
      To give you the suggestion I need to understand the goal of your development and your resources.
      
      Overall, recording specific data is not reasonable these days, you just get thousands of hours of speech from the external sources, not necessary transcribed. You can check https://github.com/jimregan/wolnelektury-audio-corpus for example.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Ernest - 2018-11-12
        
        Hello Nickolay,
        
        thank you for your valuable remark and the link.
        My initial goal was to develop a service over a telephone which provides schedules for public transportation. I already have a prototype based on Asterisk and Unimrcp and my acoustic model for Polish language but able to only recognize my own speech. I haven't managed to find free acoustic model for Polish, so I have decided to create my own. Therefore I've developed this website to collect more recordings, in order to build a complete acoustic model for Polish. I have also decided that it would be great if I could use these recordings to also create acoustic model for desktop apps. Therefore I thought that maybe it's better if I collect recordings with 16kHz sample rate, which I can then downsample.
        
        Best regards,
        Ernest
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Resampling to properly train acoustic model

Speech Recognition Toolkit

Forums

Help

Resampling to properly train acoustic model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Resampling to properly train acoustic model