Menu

German Voxforge acoustic model first release

2008-02-20
2012-09-22
  • Nickolay V. Shmyrev

    Hi all. We are happy to announce first open source German acoustic model based on free GPL speech corpus. You can download it at Voxforge as usual:

    http://voxforge.org/home/downloads

    The model was trained from 5 hours of speech, unfortunately from a small amount of speakers. So please help to improve it - submit your speech to voxforge.

     
    • DefRay

      DefRay - 2008-03-31

      I tried the new model, mostly in the S4 HelloNGram demo.
      For digits it works great, but free speech is still pretty bad, though I could improve accuracy a lot by creating a .lm from the full transcription file (3000 sentences).
      One of the main problems I encounter is still the missing word problem I posted above, with the German umlauts.
      I guess it's a problem in the getWord() method of the FastDictionary component, though it's probably fixed by now in SVN? (I'm still using the original sphinx4-1.0beta package).

      I hope me and some other Germans will find some time to contribute to the voxforge project as the model is currently very much overtrained by Ralf's files.

      Is there a possibility of getting all the transcripts (all the prompts.txt files) in the repository without downloading the audio files or going through every submission in the "listen" section?
      We could then at least create a probably pretty decent language model for free speech ...

       
      • DefRay

        DefRay - 2008-03-31

        regarding the missing word error:
        All the words containing those umlauts are correct in the dictionary, transcription and language model.
        There seems to be a problem loading them from he dictionary.
        If you're not encountering this problem, I should probably upgrade my sphinx4 to a newer SVN version.

        The mentioned problem is severe for recognition because in Ralf's samples there are a lot of words with umlauts.
        Decoding with S3 there's apparently no such problem as the digit "fünf" containing an umlaut is correctly recognized.

         
    • Holger Brandl

      Holger Brandl - 2008-02-20

      Hi Nickolay,

      cool! Would it be possible to provide a script along with the files which creates a s4-AcousticModel-jar?

      -Holger

       
      • Nickolay V. Shmyrev

        Sure, I'll upload jar too. Right now I just need help from someone who knows the language in order to fix rather significant amount of BW errors during training.

         
        • DefRay

          DefRay - 2008-03-13

          Well, I could probably help you a little.
          I'm a German student and I've been waiting for a German acoustic model for some time now.
          I'll try out the one from Voxforge and if I can be of some (not too much time consuming) help, feel free to ask me anything.

           
          • Nickolay V. Shmyrev

            Thanks! Your help is really appreciated. We have to share audio first (it's available for download from voxforge right now but in rather unpleasant way (you have to use wget). I would be really happy if someone could review the dictionary and prompts. Training gives a lot of errors on bad alignment, so one with German experience should check transcription.

            About jsgf and missing phones, it should work, model has test subfolder with test script for sphinx3. I think it should be rather stable to reproduce. I'll try to upload jar too so we can check each other.

             
            • DefRay

              DefRay - 2008-03-14

              Could you post some links to the files I have to download? I can't really find it on their website? Or should I download everything in the speech corpus from the "listen" section?

              The Phoneset seems pretty good, also the pronunciation in the dictionary, but to individually check all 3200+ entries could take some time. Could someone explain to me, why the phone "qq" is used in front of every word beginning with a vocal ?
              I could check the transcript for errors, but where can I find the prompts to download?

              If you have a link to a .jar version of the acoustic model for S4, I could test it with a few demos. Mine seems to be created wrong, at least flatLinguist complains as stated above, maybe I should create one with the cd-8gau-files?

               
              • Nickolay V. Shmyrev

                > Could you post some links to the files I have to download? I can't really find it on their website? Or should I download everything in the speech corpus from the "listen" section?

                Yeah, currently they are only available in Listen. You can get them with wget for example. See the discussion at the bottom of the:

                http://voxforge.org/home/forums/other-languages/german/localizing-the-speechsubmission-app-to-german?pn=2

                > The Phoneset seems pretty good, also the pronunciation in the dictionary, but to individually check all 3200+ entries could take some time. Could someone explain to me, why the phone "qq" is used in front of every word beginning with a vocal ?
                I could check the transcript for errors, but where can I find the prompts to download?

                I understand. The problem is that due to the Bomp restrictions dictionary is created by espeak rules and a little perl script for phone mapping. Probably qq is really not needed for the beginning. I suppose the dictionary is not correct just because there are around 300 rejected prompts as you can see in the log.

                > If you have a link to a .jar version of the acoustic model for S4, I could test it with a few demos. Mine seems to be created wrong, at least flatLinguist complains as stated above, maybe I should create one with the cd-8gau-files?

                I'll try to prepare it tomorrow.

                 
              • Nickolay V. Shmyrev

                Ok, here is an example for you to test sphinx4:

                http://www.mediafire.com/?j1l9d0ujmgg

                 
                • DefRay

                  DefRay - 2008-03-17

                  Thanks.
                  I did some testing with it.
                  Decoding digits with the WavFile or the HelloDigits demo and a jsgf grammar works very good.
                  When using a language model, created from the transcript file, and decoding with the HelloNGram demo, digits work pretty good. But recognition of random speech is really bad (almost completely random results).
                  Was the acoustic model created from all the files in the German "Listen" section on voxforge.org ?
                  Because even decoding audio files by Ralf Herzog (which should have been trained I guess) returns really bad results.
                  So I don't know what's wrong, but a model created from all those sentences in the German corpus should decode better ...

                  Also, I don't know if it's a problem on my side, but when starting the HelloNGram demo, it complains about all the words with "umlaute", like äöü, though they are represented correctly in the dictionary and the language model.
                  Here's a little output:

                  02:04.181 WARNING dictionary Missing word: bᅢᄐcher
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.181 WARNING dictionary Missing word: verzᅢᄊgert
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.195 WARNING dictionary Missing word: hartnᅢᄂckig
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: abstᅢᄂnden
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: fᅢᄐrsorglichkeit
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: gefᅢᄂhrlich
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: verhᅢᄂltnis
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: dᅢᄐrfen
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: kᅢᄐchen
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: zusᅢᄂtzliche
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: wᅢᄂhrungspolitik
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary
                  02:04.196 WARNING dictionary Missing word: gᅢᄂngigsten
                  in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-
                  dictionary

                  Is FastDictionary having problems with those (notice the wrong display of the letters) or is it something I have set wrong ?
                  The WavFile and HelloDigits demos didn't complain about the digit "fünf" for example.

                   
                • DefRay

                  DefRay - 2008-03-17

                  BTW,
                  what problems in the training run do you mean?
                  Looking at the html file, your last training run completed successful.
                  Do you mean the "final state not reached" errors in the cd-training logs?
                  I don't know exactly what those mean. Is it that there's something wrong with the audio file to the corresponding utterance in the transcript?
                  If every one of those utterances is ignored, it's of course a big problem ... would it be hard to force-align them, so they're used in BW training anyway?
                  I listened to some of the problematic sound files, and they were perfectly ok, no cut-offs or anything. I don't know why the error occurs for them as opposed to the other files.

                  Also, why are there only few submission by Ralf Herzog trainied? (de91 through de120)
                  Would the model be overtrained otherwise ?

                   
                  • Nickolay V. Shmyrev

                    About errors, yeah I meant errors when final state not reached in cd logs. It mostly means that transcription is out-of-sync with the audio. Probably not in this particular place but it signals about problem in transcription and/or dictionary. That's why I'm asking for review here. I think we need to start with a smaller set of prompts and find a problem in our dictionary. Then we can double the data and keep an eye on dropped prompts. Final state should always be reached. Force align is a good idea indeed, I'll try it too.

                    Overtraining is a problem, I several times told Ralf we don't need so many recordings from him. But since there is no more data I think we'll train on existing audio. There are some other speakers so I hope it will be ok for beginning. Also, we'll probably use Ralf's voice for TTS
                    database, so his work at least sensible. I trained only some of them just because I didn't download everything else. We are still in process of transferring the data to repository.

                    About problems with generic recognition, could you please submit a sample as I did. I'll try to look and check what's wrong there.

                     
                    • DefRay

                      DefRay - 2008-03-26

                      Ich checked the transcript and dictionary for spelling. Everything's alright here.
                      I also checked some of the problematic sound files (those mentioned in the training logs) with audacity, they are completely normal, like any of the other files.
                      I really don't know what's wrong here ...

                      Also, I still don't know why I'm getting those errors with the special German characters (like posted above).
                      I tried creating an LM from the transcript and it worked, but still the decoding is pretty bad, probably because of all the "missing" words from the dictionary and the skipped utterances in training.

                      If can can give me a hint in the right direction to go on, please do so.

                       
                      • Nickolay V. Shmyrev

                        Ah, I've found the problem:

                        Uttid mismatch: ctlfile = "de100-75"; transcript = "de91-75"

                        .fileids file is just not in sync with transcript file. I'll retrain model and upload audio this weekend.

                         
        • DefRay

          DefRay - 2008-03-13

          I just encountered a problem, I don't know if it's my mistake:
          I created a .jar S4 Model to use the German Voxforge Model for testing the HelloDigits demo.
          I also created a .lm with the LMTool.

          I tried running the demo, both with an lmGrammar or a jsgfGrammar, but flatLinguist always complains about missing HMM for certain phones, though the phones are in the .phone file and in the dictionary and the words in the jsgfGrammar are all in the dictionary and transcript ...

          I used the ci_cont-Models, should I better use the CD ones ?
          flatLinguist can't find the phone "qq" when using a lmGrammar for the demo and the phone "v" when using a jsgfGrammar.
          For the jsgf I only used German digits, so the phone "v" is only in the word "zwei".

          Have I done something wrong anywhere ? Seems like my acoustic model is broken.?

           
  • Craig

    Craig - 2010-09-07

    Hi people,

    I came across this thread while searching for information about German speech
    recognition with Sphinx. I have downloaded the German acoustic model,
    dictionary

    and a simple demo from:

    http://www.mediafire.com/?j1l9d0ujmgg

    The digits example works well, but the dictionary has only a small number of
    words and was missing a majority of the words in a portion of sample text. I
    am about to

    extend the dictionary, and intend to try both a translation from IPA to
    Arpabet (using Ralph's German dictionary,
    http://spirit.blau.in/simon/2009/10/24/ralfs-
    german-

    dictionary-version-016/ ) and also a direct phonetic interpretation of the
    spelling. This is probably more achievable in German than it would be in
    English, as German

    pronunciation is fairly predictable.

    The current German dictionary from the above link is in a slightly odd format,
    however. Here is a sample:

    ab qq a p
    abbauen qq a b au @ n
    aber qq aa: b ei
    abfälle qq a p f ee l @
    abgaben qq a p g aa: b @ n
    abgebaut qq a p g @ b au t
    abgeben qq a p g e: b @ n
    abgebrochen qq a p g @ b r oo x @ n
    abgedeckt qq a p g @ d ee k t
    abgefallen qq a p g @ f a l @ n
    abgefunden qq a p g @ f uu n d @ n
    abgehalten qq a p g @ h a l t @ n

    I noted that someone up above asked about the double-q phoneme that appears
    before initial vowels. There was no clear answer... Does anyone know why it is
    there

    and what it means? Would it matter if this was left out of the dictionary?
    Also, the unstressed e (the schwa) is represented by an @ symbol, which is not
    part of the

    standard Arpabet ( http://www.speech.cs.cmu.edu/cgi-
    bin/cmudict
    ). Is it necessary
    to keep the existing phoneme labels to use the acoustic model?

    If someone has already created a large German pronunciation dictionary and/or
    acoustic model, I would like to share it. Also, I can probably help train the
    model using

    some of the German audiobooks in my collection.

    Cheers,

    Craig.

     
  • Nickolay V. Shmyrev

    Hello

    I am about to extend the dictionary, and intend to try both a translation
    from IPA to Arpabet
    (using Ralph's German dictionary,
    http://spirit.blau.in/simon/2009/10/24/ralfs-german-dictionary-
    version-016/
    ) and also a direct phonetic interpretation of the spelling.
    This is probably more achievable in German than it would be in English,
    as German pronunciation is fairly predictable.

    The existing dictionary was built with espeak TTS and the script from the
    acoustic model package root so you can easily
    extend it to any vocabulary you like. There is Ralf's work. There are
    commercial alternatives (BOMP) for example which could be
    better in some situations. anyway, you have many choices here and could pick
    the best one. For voxforge any consistent dictionary will be good and it would
    be very nice to retrain the German model with it.

    Existing dictionary can be extended manually or with G2P software. I would
    also recommend
    you to plug into TTS engine like OpenMARY which can actually generate
    pronunciations without
    fixed vocabulary specified by the dictionary. This will also solve
    tokenization issue when you need
    to convert numbers and abbreviations into textual form.

    I noted that someone up above asked about the double-q phoneme that appears
    before initial vowels. There was no clear answer... Does anyone know why it
    is there and what it means? Would it matter if this was left out of the
    dictionary?

    qq means glottal stop which usually present before vowel. Though some German
    phoneticians disagree on that.
    See related discussion.

    http://www.voxforge.org/home/forums/message-boards/general-discussion
    /dictionary-format

    Also,
    the unstressed e (the schwa) is represented by an @ symbol, which is not part
    of the standard Arpabet ( http://www.speech.cs.cmu.edu/cgi-
    bin/cmudict
    ). Is it necessary
    to keep the existing phoneme labels to use the acoustic model?

    They are not Arpabet in any sense. Phones shouldn't be the part of the
    Arpabet, it's unrelated thing. They are just phones specific for each model.
    It's preferred to have ASCII-only case-insensitive phones. You can choose
    whatever names you like.

    If someone has already created a large German pronunciation dictionary
    and/or
    acoustic model, I would like to share it. Also, I can probably help train the
    model using some of the German audiobooks in my collection.

    Ralf did, why don't you use his work.

     
  • Craig

    Craig - 2010-09-10

    Thanks for the answers...

    I'd be happy to use Ralph's work but didn't find it in a format ready to be
    used in Sphinx. Instead I found a huge XML file that needs reformatting to
    remove all the XML tags. I'll search through the links above and the related
    discussion and see if I can find a better link than the one I originally
    posted.

    Thanks for the info about the glottal stop - I had since found it on wiki: ht
    tp://en.wikipedia.org/wiki/Glottal_stop

    It seems to me that the glottal stop would not stand alone as a phoneme very
    well, since it is short and voiceless. Isn't it more a means of transitioning
    between vowel sounds (as in the hyphen in "uh-oh"), or perhaps starting a
    vowel abruptly? In other words it might modify the neighbouring phonemes more
    than exist as a context-insensitive phoneme in its own right. My guyess is
    that dropping it would make little difference.

    My question about the nonstandard phone labels (@ and so on) related to the
    acoustic model downloaded from the link I posted. I know you can use whatever
    labels you like in making a model, but I already have the model that came with
    the German digits demo (
    http://www.mediafire.com/?j1l9d0ujmgg
    ), along with a tiny dictionary that uses the same labels. If I use that
    acoustic model, am I stuck with the labels its creators chose, or is there
    some way of changing them? If Ralph already has a better acoustic model and
    matching dictionary then the question is somewhat academic, but I am still
    curious.

    Cheers,

    Craig.

     
  • Nickolay V. Shmyrev

    Instead I found a huge XML file that needs reformatting to remove all the
    XML tags. I'll

    Yes, you need a simple script to convert

    If I use that acoustic model, am I stuck with the labels its creators chose,
    or is there some way of changing them?

    You should use same labels and same way to build the dictionary. That's
    espeak2phones.pl from the archive

    http://www.repository.voxforge1.org/downloads/de/Archive/voxforge-
    de.tar.gz

    But the advantage of voxforge is that you can easily retrain the model with
    your own dictionary. There is no problem doing that.

     
  • Craig

    Craig - 2010-09-11

    "Ralp did, why don't you use his work."

    "Yes, you need a simple script to convert "

    So I need to write a program to convert a massive XML file into the right
    format, run a Perl script to get access to the phoneme labels, etc... Sure,
    it's all possible, but it's hardly set up for easy use. I'm part-way through
    converting the XML file with my own Java program but it is so large my IDE can
    barely handle it. Is there really no-one who has a decent, ready-to-use Sphinx
    set-up for German?

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.