I want to use a dutch language model. Currently i didn't found a dutch language model for Sphinx(2).
Is there anyone who can create a dutch language model? I found the following URL with a complete dutch dictonary
<a href=http://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/SLcorpus/DBMS2/tables/TwenteCorpusContextDist.txt.bz2> TwenteCorpusContextDist.txt.bz2 (from the IFA Spoken Language Corpora; GPL license)</a>
Is there anyone who can compile this file in the Sphinx2 language? I want to test it if there is anybody who will do this.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, I've just made a Dutch model for sphinx3 from IFA corpus. Sphinx2 or pocketsphinx model can be made too, not time yet. Helper files and model itself could be downloaded from:
I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).
The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
>I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
Ok, will do that soon too. Although it's better to move to pocketsphinx for you I suppose. How do you use it? Are you running it with fsg or with a language model?
>I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).
It should be just a reading of some classical text, some newspaper or any other article. From a single speaker you need around 10-20 minutes of speech. Speech should be segmented on chunks about of 10 seconds and transcribed. That's all. Recording must be done at say 16000 kHz in a wav file. If you will work with asterisk, you need 8 kHz instead.
>The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.
Your help is appreciated
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I don't know how i can use this files incombination with Spinhx2. Maybe it is possible to compile these soundfiles to a dutch test module for example Sphinx2 so that i can test with this new Language model.
For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.
High quality microphone is not required, speech must be recorded in a real conditions. So start with something simple first. Record your speech, next step will be text collection.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I want to use a dutch language model. Currently i didn't found a dutch language model for Sphinx(2).
Is there anyone who can create a dutch language model? I found the following URL with a complete dutch dictonary
<a href=http://www.fon.hum.uva.nl/IFA-SpokenLanguageCorpora/IFAcorpus/SLcorpus/DBMS2/tables/TwenteCorpusContextDist.txt.bz2> TwenteCorpusContextDist.txt.bz2 (from the IFA Spoken Language Corpora; GPL license)</a>
Is there anyone who can compile this file in the Sphinx2 language? I want to test it if there is anybody who will do this.
And, we will also need a language model. So I need at least 20 Mb of Dutch texts.
Hi, I've just made a Dutch model for sphinx3 from IFA corpus. Sphinx2 or pocketsphinx model can be made too, not time yet. Helper files and model itself could be downloaded from:
http://www.mediafire.com/download.php?b2juwvounye
Few issues still exists:
We need testing data, in particular language model. To create one I need a lot of Dutch texts.
I stripped around 80% of the database due to 5000 OOV words, celex seems to miss a lot of important data. This has to be fixed
There are still some bad transcriptions, sphinx report about them as ERRORS
It would be nice to use hand-made segmentation as well, that will greatly improve WER.
Great! Thank you very much for this great files!
I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).
The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.
>I use for now Sphinx2 and i'll wait for the dutch Sphinx2 version (I use Sphinx2 with Asterisk PBX).
Ok, will do that soon too. Although it's better to move to pocketsphinx for you I suppose. How do you use it? Are you running it with fsg or with a language model?
>I think you need raw-text including the soundfiles in wav for example from various persons? Can you give me an URL where i can found a description/information about the soundsfiles and texts (how i can make this in the correct sound format etc...).
It should be just a reading of some classical text, some newspaper or any other article. From a single speaker you need around 10-20 minutes of speech. Speech should be segmented on chunks about of 10 seconds and transcribed. That's all. Recording must be done at say 16000 kHz in a wav file. If you will work with asterisk, you need 8 kHz instead.
>The Sphinx project is very difficult (in my opinion) and for now i've no idea how i can compile other new language versions. Therefore i can only help you with dutch soundfiles and texts and offcourse also with testresults.
Your help is appreciated
Ok, sphinx2 models are trained too. You can download them at
http://www.mediafire.com/?fdfdenxgjtm
simple script to test numbers recognition is also included. I hope they will work fine.
I can, but if you want these model to work well for you, submit your own speech to voxforge:
http://voxforge.org/home/downloads/speech/dutch
Thanks for your fast reply.
I don't know how i can use this files incombination with Spinhx2. Maybe it is possible to compile these soundfiles to a dutch test module for example Sphinx2 so that i can test with this new Language model.
For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.
> For now i have no usefull high quality microphone, therefore i can't upload (for now) some audiofiles.
High quality microphone is not required, speech must be recorded in a real conditions. So start with something simple first. Record your speech, next step will be text collection.
How much different voices are needed to create this to a complete dutch language model?
Well, you can never say your model is complete, but for example you can compare it with Switchboard:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC97S62
it has 543 speakers total.
Actually currently we only care about your voice, not any others :)