CMU Sphinx / Forums / Speech Recognition Theory: German Voxforge acoustic model first release

Nickolay V. Shmyrev - 2008-02-20

Hi all. We are happy to announce first open source German acoustic model based on free GPL speech corpus. You can download it at Voxforge as usual:

http://voxforge.org/home/downloads

The model was trained from 5 hours of speech, unfortunately from a small amount of speakers. So please help to improve it - submit your speech to voxforge.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- DefRay - 2008-03-31
  
  I tried the new model, mostly in the S4 HelloNGram demo.
  For digits it works great, but free speech is still pretty bad, though I could improve accuracy a lot by creating a .lm from the full transcription file (3000 sentences).
  One of the main problems I encounter is still the missing word problem I posted above, with the German umlauts.
  I guess it's a problem in the getWord() method of the FastDictionary component, though it's probably fixed by now in SVN? (I'm still using the original sphinx4-1.0beta package).
  
  I hope me and some other Germans will find some time to contribute to the voxforge project as the model is currently very much overtrained by Ralf's files.
  
  Is there a possibility of getting all the transcripts (all the prompts.txt files) in the repository without downloading the audio files or going through every submission in the "listen" section?
  We could then at least create a probably pretty decent language model for free speech ...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - DefRay - 2008-03-31
    
    regarding the missing word error:
    All the words containing those umlauts are correct in the dictionary, transcription and language model.
    There seems to be a problem loading them from he dictionary.
    If you're not encountering this problem, I should probably upgrade my sphinx4 to a newer SVN version.
    
    The mentioned problem is severe for recognition because in Ralf's samples there are a lot of words with umlauts.
    Decoding with S3 there's apparently no such problem as the digit "fünf" containing an umlaut is correctly recognized.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Holger Brandl - 2008-02-20
  
  Hi Nickolay,
  
  cool! Would it be possible to provide a script along with the files which creates a s4-AcousticModel-jar?
  
  -Holger
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2008-02-20
    
    Sure, I'll upload jar too. Right now I just need help from someone who knows the language in order to fix rather significant amount of BW errors during training.
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - DefRay - 2008-03-13
      
      Well, I could probably help you a little.
      I'm a German student and I've been waiting for a German acoustic model for some time now.
      I'll try out the one from Voxforge and if I can be of some (not too much time consuming) help, feel free to ask me anything.
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Nickolay V. Shmyrev - 2008-03-13
        
        Thanks! Your help is really appreciated. We have to share audio first (it's available for download from voxforge right now but in rather unpleasant way (you have to use wget). I would be really happy if someone could review the dictionary and prompts. Training gives a lot of errors on bad alignment, so one with German experience should check transcription.
        
        About jsgf and missing phones, it should work, model has test subfolder with test script for sphinx3. I think it should be rather stable to reproduce. I'll try to upload jar too so we can check each other.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        DefRay - 2008-03-14
        
        Could you post some links to the files I have to download? I can't really find it on their website? Or should I download everything in the speech corpus from the "listen" section?
        
        The Phoneset seems pretty good, also the pronunciation in the dictionary, but to individually check all 3200+ entries could take some time. Could someone explain to me, why the phone "qq" is used in front of every word beginning with a vocal ?
        I could check the transcript for errors, but where can I find the prompts to download?
        
        If you have a link to a .jar version of the acoustic model for S4, I could test it with a few demos. Mine seems to be created wrong, at least flatLinguist complains as stated above, maybe I should create one with the cd-8gau-files?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2008-03-14
        
        > Could you post some links to the files I have to download? I can't really find it on their website? Or should I download everything in the speech corpus from the "listen" section?
        
        Yeah, currently they are only available in Listen. You can get them with wget for example. See the discussion at the bottom of the:
        
        http://voxforge.org/home/forums/other-languages/german/localizing-the-speechsubmission-app-to-german?pn=2
        
        > The Phoneset seems pretty good, also the pronunciation in the dictionary, but to individually check all 3200+ entries could take some time. Could someone explain to me, why the phone "qq" is used in front of every word beginning with a vocal ?
        I could check the transcript for errors, but where can I find the prompts to download?
        
        I understand. The problem is that due to the Bomp restrictions dictionary is created by espeak rules and a little perl script for phone mapping. Probably qq is really not needed for the beginning. I suppose the dictionary is not correct just because there are around 300 rejected prompts as you can see in the log.
        
        > If you have a link to a .jar version of the acoustic model for S4, I could test it with a few demos. Mine seems to be created wrong, at least flatLinguist complains as stated above, maybe I should create one with the cd-8gau-files?
        
        I'll try to prepare it tomorrow.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2008-03-15
        
        Ok, here is an example for you to test sphinx4:
        
        http://www.mediafire.com/?j1l9d0ujmgg
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        DefRay - 2008-03-17
        
        Thanks.
        I did some testing with it.
        Decoding digits with the WavFile or the HelloDigits demo and a jsgf grammar works very good.
        When using a language model, created from the transcript file, and decoding with the HelloNGram demo, digits work pretty good. But recognition of random speech is really bad (almost completely random results).
        Was the acoustic model created from all the files in the German "Listen" section on voxforge.org ?
        Because even decoding audio files by Ralf Herzog (which should have been trained I guess) returns really bad results.
        So I don't know what's wrong, but a model created from all those sentences in the German corpus should decode better ...
        
        Also, I don't know if it's a problem on my side, but when starting the HelloNGram demo, it complains about all the words with "umlaute", like äöü, though they are represented correctly in the dictionary and the language model.
        Here's a little output:
        
        02:04.181 WARNING dictionary Missing word: bￃﾼcher
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.181 WARNING dictionary Missing word: verzￃﾶgert
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.195 WARNING dictionary Missing word: hartnￃﾤckig
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: abstￃﾤnden
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: fￃﾼrsorglichkeit
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: gefￃﾤhrlich
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: verhￃﾤltnis
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: dￃﾼrfen
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: kￃﾼchen
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: zusￃﾤtzliche
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: wￃﾤhrungspolitik
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        02:04.196 WARNING dictionary Missing word: gￃﾤngigsten
        in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
        dictionary
        
        Is FastDictionary having problems with those (notice the wrong display of the letters) or is it something I have set wrong ?
        The WavFile and HelloDigits demos didn't complain about the digit "fünf" for example.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        DefRay - 2008-03-17
        
        BTW,
        what problems in the training run do you mean?
        Looking at the html file, your last training run completed successful.
        Do you mean the "final state not reached" errors in the cd-training logs?
        I don't know exactly what those mean. Is it that there's something wrong with the audio file to the corresponding utterance in the transcript?
        If every one of those utterances is ignored, it's of course a big problem ... would it be hard to force-align them, so they're used in BW training anyway?
        I listened to some of the problematic sound files, and they were perfectly ok, no cut-offs or anything. I don't know why the error occurs for them as opposed to the other files.
        
        Also, why are there only few submission by Ralf Herzog trainied? (de91 through de120)
        Would the model be overtrained otherwise ?
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2008-03-17
        
        About errors, yeah I meant errors when final state not reached in cd logs. It mostly means that transcription is out-of-sync with the audio. Probably not in this particular place but it signals about problem in transcription and/or dictionary. That's why I'm asking for review here. I think we need to start with a smaller set of prompts and find a problem in our dictionary. Then we can double the data and keep an eye on dropped prompts. Final state should always be reached. Force align is a good idea indeed, I'll try it too.
        
        Overtraining is a problem, I several times told Ralf we don't need so many recordings from him. But since there is no more data I think we'll train on existing audio. There are some other speakers so I hope it will be ok for beginning. Also, we'll probably use Ralf's voice for TTS
        database, so his work at least sensible. I trained only some of them just because I didn't download everything else. We are still in process of transferring the data to repository.
        
        About problems with generic recognition, could you please submit a sample as I did. I'll try to look and check what's wrong there.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        DefRay - 2008-03-26
        
        Ich checked the transcript and dictionary for spelling. Everything's alright here.
        I also checked some of the problematic sound files (those mentioned in the training logs) with audacity, they are completely normal, like any of the other files.
        I really don't know what's wrong here ...
        
        Also, I still don't know why I'm getting those errors with the special German characters (like posted above).
        I tried creating an LM from the transcript and it worked, but still the decoding is pretty bad, probably because of all the "missing" words from the dictionary and the skipped utterances in training.
        
        If can can give me a hint in the right direction to go on, please do so.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2008-03-28
        
        Ah, I've found the problem:
        
        Uttid mismatch: ctlfile = "de100-75"; transcript = "de91-75"
        
        .fileids file is just not in sync with transcript file. I'll retrain model and upload audio this weekend.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
        
        Nickolay V. Shmyrev - 2008-03-31
        
        Well, I uploaded new models, they must be available here soon:
        
        http://www.repository.voxforge1.org/downloads/de/Trunk/AcousticModels/
        
        audio is also available:
        
        http://www.repository.voxforge1.org/downloads/de/Trunk/Audio/Main/16kHz_16bit/
        
        it's more than 21 hours.
        
        The model is clearly overtrained mostly because of Ralf's submission. But please try it and report about results.
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - DefRay - 2008-03-13
      
      I just encountered a problem, I don't know if it's my mistake:
      I created a .jar S4 Model to use the German Voxforge Model for testing the HelloDigits demo.
      I also created a .lm with the LMTool.
      
      I tried running the demo, both with an lmGrammar or a jsgfGrammar, but flatLinguist always complains about missing HMM for certain phones, though the phones are in the .phone file and in the dictionary and the words in the jsgfGrammar are all in the dictionary and transcript ...
      
      I used the ci_cont-Models, should I better use the CD ones ?
      flatLinguist can't find the phone "qq" when using a lmGrammar for the demo and the phone "v" when using a jsgfGrammar.
      For the jsgf I only used German digits, so the phone "v" is only in the word "zwei".
      
      Have I done something wrong anywhere ? Seems like my acoustic model is broken.?
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Craig - 2010-09-07

Hi people,

I came across this thread while searching for information about German speech
recognition with Sphinx. I have downloaded the German acoustic model,
dictionary

and a simple demo from:

http://www.mediafire.com/?j1l9d0ujmgg

The digits example works well, but the dictionary has only a small number of
words and was missing a majority of the words in a portion of sample text. I
am about to

extend the dictionary, and intend to try both a translation from IPA to
Arpabet (using Ralph's German dictionary,
http://spirit.blau.in/simon/2009/10/24/ralfs-
german-

dictionary-version-016/ ) and also a direct phonetic interpretation of the
spelling. This is probably more achievable in German than it would be in
English, as German

pronunciation is fairly predictable.

The current German dictionary from the above link is in a slightly odd format,
however. Here is a sample:

ab qq a p
abbauen qq a b au @ n
aber qq aa: b ei
abfälle qq a p f ee l @
abgaben qq a p g aa: b @ n
abgebaut qq a p g @ b au t
abgeben qq a p g e: b @ n
abgebrochen qq a p g @ b r oo x @ n
abgedeckt qq a p g @ d ee k t
abgefallen qq a p g @ f a l @ n
abgefunden qq a p g @ f uu n d @ n
abgehalten qq a p g @ h a l t @ n

I noted that someone up above asked about the double-q phoneme that appears
before initial vowels. There was no clear answer... Does anyone know why it is
there

and what it means? Would it matter if this was left out of the dictionary?
Also, the unstressed e (the schwa) is represented by an @ symbol, which is not
part of the

standard Arpabet ( http://www.speech.cs.cmu.edu/cgi-
bin/cmudict ). Is it necessary
to keep the existing phoneme labels to use the acoustic model?

If someone has already created a large German pronunciation dictionary and/or
acoustic model, I would like to share it. Also, I can probably help train the
model using

some of the German audiobooks in my collection.

Cheers,

Craig.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-08

Hello

I am about to extend the dictionary, and intend to try both a translation
from IPA to Arpabet
(using Ralph's German dictionary,
http://spirit.blau.in/simon/2009/10/24/ralfs-german-dictionary-
version-016/ ) and also a direct phonetic interpretation of the spelling.
This is probably more achievable in German than it would be in English,
as German pronunciation is fairly predictable.

The existing dictionary was built with espeak TTS and the script from the
acoustic model package root so you can easily
extend it to any vocabulary you like. There is Ralf's work. There are
commercial alternatives (BOMP) for example which could be
better in some situations. anyway, you have many choices here and could pick
the best one. For voxforge any consistent dictionary will be good and it would
be very nice to retrain the German model with it.

Existing dictionary can be extended manually or with G2P software. I would
also recommend
you to plug into TTS engine like OpenMARY which can actually generate
pronunciations without
fixed vocabulary specified by the dictionary. This will also solve
tokenization issue when you need
to convert numbers and abbreviations into textual form.

I noted that someone up above asked about the double-q phoneme that appears
before initial vowels. There was no clear answer... Does anyone know why it
is there and what it means? Would it matter if this was left out of the
dictionary?

qq means glottal stop which usually present before vowel. Though some German
phoneticians disagree on that.
See related discussion.

http://www.voxforge.org/home/forums/message-boards/general-discussion
/dictionary-format

Also,
the unstressed e (the schwa) is represented by an @ symbol, which is not part
of the standard Arpabet ( http://www.speech.cs.cmu.edu/cgi-
bin/cmudict ). Is it necessary
to keep the existing phoneme labels to use the acoustic model?

They are not Arpabet in any sense. Phones shouldn't be the part of the
Arpabet, it's unrelated thing. They are just phones specific for each model.
It's preferred to have ASCII-only case-insensitive phones. You can choose
whatever names you like.

If someone has already created a large German pronunciation dictionary
and/or
acoustic model, I would like to share it. Also, I can probably help train the
model using some of the German audiobooks in my collection.

Ralf did, why don't you use his work.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Craig - 2010-09-10

Thanks for the answers...

I'd be happy to use Ralph's work but didn't find it in a format ready to be
used in Sphinx. Instead I found a huge XML file that needs reformatting to
remove all the XML tags. I'll search through the links above and the related
discussion and see if I can find a better link than the one I originally
posted.

Thanks for the info about the glottal stop - I had since found it on wiki: ht
tp://en.wikipedia.org/wiki/Glottal_stop

It seems to me that the glottal stop would not stand alone as a phoneme very
well, since it is short and voiceless. Isn't it more a means of transitioning
between vowel sounds (as in the hyphen in "uh-oh"), or perhaps starting a
vowel abruptly? In other words it might modify the neighbouring phonemes more
than exist as a context-insensitive phoneme in its own right. My guyess is
that dropping it would make little difference.

My question about the nonstandard phone labels (@ and so on) related to the
acoustic model downloaded from the link I posted. I know you can use whatever
labels you like in making a model, but I already have the model that came with
the German digits demo (
http://www.mediafire.com/?j1l9d0ujmgg
), along with a tiny dictionary that uses the same labels. If I use that
acoustic model, am I stuck with the labels its creators chose, or is there
some way of changing them? If Ralph already has a better acoustic model and
matching dictionary then the question is somewhat academic, but I am still
curious.

Cheers,

Craig.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-09-11

Instead I found a huge XML file that needs reformatting to remove all the
XML tags. I'll

Yes, you need a simple script to convert

If I use that acoustic model, am I stuck with the labels its creators chose,
or is there some way of changing them?

You should use same labels and same way to build the dictionary. That's
espeak2phones.pl from the archive

http://www.repository.voxforge1.org/downloads/de/Archive/voxforge-
de.tar.gz

But the advantage of voxforge is that you can easily retrain the model with
your own dictionary. There is no problem doing that.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Craig - 2010-09-11

"Ralp did, why don't you use his work."

"Yes, you need a simple script to convert "

So I need to write a program to convert a massive XML file into the right
format, run a Perl script to get access to the phoneme labels, etc... Sure,
it's all possible, but it's hardly set up for easy use. I'm part-way through
converting the XML file with my own Java program but it is so large my IDE can
barely handle it. Is there really no-one who has a decent, ready-to-use Sphinx
set-up for German?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

German Voxforge acoustic model first release

Speech Recognition Toolkit

Forums

Help

German Voxforge acoustic model first release

German Voxforge acoustic model first release

Speech Recognition Toolkit

Forums

Help

German Voxforge acoustic model first release document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

German Voxforge acoustic model first release