CMU Sphinx / Forums / Help: Dutch language model

jongerenchaos - 2007-11-02

Hello,

I want to use a dutch language model. Currently i didn't found a dutch language model for Sphinx(2).

Is there anyone who can create a dutch language model? I found the following URL with a complete dutch dictonary
<a href=http://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/SLcorpus/DBMS2/tables/TwenteCorpusContextDist.txt.bz2> TwenteCorpusContextDist.txt.bz2 (from the IFA Spoken Language Corpora; GPL license)</a>

Is there anyone who can compile this file in the Sphinx2 language? I want to test it if there is anybody who will do this.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2007-11-02
  
  And, we will also need a language model. So I need at least 20 Mb of Dutch texts.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2007-11-05
  
  Hi, I've just made a Dutch model for sphinx3 from IFA corpus. Sphinx2 or pocketsphinx model can be made too, not time yet. Helper files and model itself could be downloaded from:
  
  http://www.mediafire.com/download.php?b2juwvounye
  
  Few issues still exists:
  
  We need testing data, in particular language model. To create one I need a lot of Dutch texts.
  
  I stripped around 80% of the database due to 5000 OOV words, celex seems to miss a lot of important data. This has to be fixed
  
  There are still some bad transcriptions, sphinx report about them as ERRORS
  
  It would be nice to use hand-made segmentation as well, that will greatly improve WER.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - jongerenchaos - 2007-11-05
    
    Great! Thank you very much for this great files!
    
    I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
    I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).
    
    The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2007-11-05
      
      >I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
      
      Ok, will do that soon too. Although it's better to move to pocketsphinx for you I suppose. How do you use it? Are you running it with fsg or with a language model?
      
      >I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).
      
      It should be just a reading of some classical text, some newspaper or any other article. From a single speaker you need around 10-20 minutes of speech. Speech should be segmented on chunks about of 10 seconds and transcribed. That's all. Recording must be done at say 16000 kHz in a wav file. If you will work with asterisk, you need 8 kHz instead.
      
      >The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.
      
      Your help is appreciated
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2007-11-10
      
      Ok, sphinx2 models are trained too. You can download them at
      
      http://www.mediafire.com/?fdfdenxgjtm
      
      simple script to test numbers recognition is also included. I hope they will work fine.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2007-11-02
  
  I can, but if you want these model to work well for you, submit your own speech to voxforge:
  
  http://voxforge.org/home/downloads/speech/dutch
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - jongerenchaos - 2007-11-03
    
    Thanks for your fast reply.
    
    I don't know how i can use this files incombination with Spinhx2. Maybe it is possible to compile these soundfiles to a dutch test module for example Sphinx2 so that i can test with this new Language model.
    For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Nickolay V. Shmyrev - 2007-11-03
      
      > For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.
      
      High quality microphone is not required, speech must be recorded in a real conditions. So start with something simple first. Record your speech, next step will be text collection.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - jongerenchaos - 2007-11-03
        
        How much different voices are needed to create this to a complete dutch language model?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2007-11-03
        
        Well, you can never say your model is complete, but for example you can compare it with Switchboard:
        
        http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62
        
        it has 543 speakers total.
        
        Actually currently we only care about your voice, not any others :)
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Dutch language model

Speech Recognition Toolkit

Forums

Help

Dutch language model document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Dutch language model