Menu

WER more than 100% when training an acoustic model for russian

Help
2016-08-10
2016-08-11
  • Dino The Dinosaur

    Hello!

    I'm in a process of discovering the CMUsphinx toolkit and came across the problem. I know about the reason why it is more than 100% (I mean, technically), but what could possibly be the cause of such bad recognition?
    I used the russian audio corpus from VoxForge (urp.tgz). It is, basically, an audio book cut into pieces. So my guess is that a read written text, moreover, in russian, is recognised very poorly since 1) russian language is "too rich" in a sense that words have different forms and occur like totally different words to a computer; 2) the vocabulary of a russian written text is "too rich" as well, that multiplies with a previous point; 3) the audio book itself has really complex and various language; 4) the corpus is not big enough to cover all those difficulties.

    Because when sphinxtrain finishes its work, the results are quite impressive, from my point of view, but my machine doesn't agree with me.

    If you have come across the similar problem (with russian language preferably), could you please check my reasoning?

    Thank you in advance
    Olya

     
    • Nickolay V. Shmyrev

      In order to get help on accuracy you need to provide the data you are trying to decode. You need to provide all necessary information to reproduce your problem.

      It is not quite clear what is the problem yet.

       
      • Dino The Dinosaur

        Hello, Nickolay!

        Yeah, sure

        http://www.megafileupload.com/o6mx/urp.rar

         

        Last edit: Dino The Dinosaur 2016-08-10
        • Nickolay V. Shmyrev

          It is, basically, an audio book cut into pieces.

          urp is not an audiobook but a phonetically balanced speech database for TTS, it was designed to include "unusual" words.

          So my guess is that a read written text, moreover, in russian, is recognised very poorly since 1) russian language is "too rich" in a sense that words have different forms and occur like totally different words to a computer; 2) the vocabulary of a russian written text is "too rich" as well, that multiplies with a previous point; 3) the audio book itself has really complex and various language; 4) the corpus is not big enough to cover all those difficulties.

          This is correct, you do not have enough data neither for acoustic model nor for language model. And it is not specific to Russian, you have much less data than required in our tutorial. From 900 words in a test set 460 are missing in the language model.

           
          • Dino The Dinosaur

            Thank you for your answers!

             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.