Menu

Dutch language model

Help
2007-11-02
2012-09-22
  • jongerenchaos

    jongerenchaos - 2007-11-02

    Hello,

    I want to use a dutch language model. Currently i didn't found a dutch language model for Sphinx(2).

    Is there anyone who can create a dutch language model? I found the following URL with a complete dutch dictonary
    <a href=http://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/SLcorpus/DBMS2/tables/TwenteCorpusContextDist.txt.bz2> TwenteCorpusContextDist.txt.bz2 (from the IFA Spoken Language Corpora; GPL license)</a>

    Is there anyone who can compile this file in the Sphinx2 language? I want to test it if there is anybody who will do this.

     
    • Nickolay V. Shmyrev

      And, we will also need a language model. So I need at least 20 Mb of Dutch texts.

       
    • Nickolay V. Shmyrev

      Hi, I've just made a Dutch model for sphinx3 from IFA corpus. Sphinx2 or pocketsphinx model can be made too, not time yet. Helper files and model itself could be downloaded from:

      http://www.mediafire.com/download.php?b2juwvounye

      Few issues still exists:

      1. We need testing data, in particular language model. To create one I need a lot of Dutch texts.

      2. I stripped around 80% of the database due to 5000 OOV words, celex seems to miss a lot of important data. This has to be fixed

      3. There are still some bad transcriptions, sphinx report about them as ERRORS

      4. It would be nice to use hand-made segmentation as well, that will greatly improve WER.

       
      • jongerenchaos

        jongerenchaos - 2007-11-05

        Great! Thank you very much for this great files!

        I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
        I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).

        The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.

         
        • Nickolay V. Shmyrev

          >I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).

          Ok, will do that soon too. Although it's better to move to pocketsphinx for you I suppose. How do you use it? Are you running it with fsg or with a language model?

          >I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).

          It should be just a reading of some classical text, some newspaper or any other article. From a single speaker you need around 10-20 minutes of speech. Speech should be segmented on chunks about of 10 seconds and transcribed. That's all. Recording must be done at say 16000 kHz in a wav file. If you will work with asterisk, you need 8 kHz instead.

          >The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.

          Your help is appreciated

           
        • Nickolay V. Shmyrev

          Ok, sphinx2 models are trained too. You can download them at

          http://www.mediafire.com/?fdfdenxgjtm

          simple script to test numbers recognition is also included. I hope they will work fine.

           
    • Nickolay V. Shmyrev

      I can, but if you want these model to work well for you, submit your own speech to voxforge:

      http://voxforge.org/home/downloads/speech/dutch

       
      • jongerenchaos

        jongerenchaos - 2007-11-03

        Thanks for your fast reply.

        I don't know how i can use this files incombination with Spinhx2. Maybe it is possible to compile these soundfiles to a dutch test module for example Sphinx2 so that i can test with this new Language model.
        For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.

         
        • Nickolay V. Shmyrev

          > For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.

          High quality microphone is not required, speech must be recorded in a real conditions. So start with something simple first. Record your speech, next step will be text collection.

           
          • jongerenchaos

            jongerenchaos - 2007-11-03

            How much different voices are needed to create this to a complete dutch language model?

             
            • Nickolay V. Shmyrev

              Well, you can never say your model is complete, but for example you can compare it with Switchboard:

              http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62

              it has 543 speakers total.

              Actually currently we only care about your voice, not any others :)

               

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.