I'm in a process of discovering the CMUsphinx toolkit and came across the problem. I know about the reason why it is more than 100% (I mean, technically), but what could possibly be the cause of such bad recognition?
I used the russian audio corpus from VoxForge (urp.tgz). It is, basically, an audio book cut into pieces. So my guess is that a read written text, moreover, in russian, is recognised very poorly since 1) russian language is "too rich" in a sense that words have different forms and occur like totally different words to a computer; 2) the vocabulary of a russian written text is "too rich" as well, that multiplies with a previous point; 3) the audio book itself has really complex and various language; 4) the corpus is not big enough to cover all those difficulties.
Because when sphinxtrain finishes its work, the results are quite impressive, from my point of view, but my machine doesn't agree with me.
If you have come across the similar problem (with russian language preferably), could you please check my reasoning?
Thank you in advance
Olya
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In order to get help on accuracy you need to provide the data you are trying to decode. You need to provide all necessary information to reproduce your problem.
It is not quite clear what is the problem yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
urp is not an audiobook but a phonetically balanced speech database for TTS, it was designed to include "unusual" words.
So my guess is that a read written text, moreover, in russian, is recognised very poorly since 1) russian language is "too rich" in a sense that words have different forms and occur like totally different words to a computer; 2) the vocabulary of a russian written text is "too rich" as well, that multiplies with a previous point; 3) the audio book itself has really complex and various language; 4) the corpus is not big enough to cover all those difficulties.
This is correct, you do not have enough data neither for acoustic model nor for language model. And it is not specific to Russian, you have much less data than required in our tutorial. From 900 words in a test set 460 are missing in the language model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello!
I'm in a process of discovering the CMUsphinx toolkit and came across the problem. I know about the reason why it is more than 100% (I mean, technically), but what could possibly be the cause of such bad recognition?
I used the russian audio corpus from VoxForge (urp.tgz). It is, basically, an audio book cut into pieces. So my guess is that a read written text, moreover, in russian, is recognised very poorly since 1) russian language is "too rich" in a sense that words have different forms and occur like totally different words to a computer; 2) the vocabulary of a russian written text is "too rich" as well, that multiplies with a previous point; 3) the audio book itself has really complex and various language; 4) the corpus is not big enough to cover all those difficulties.
Because when sphinxtrain finishes its work, the results are quite impressive, from my point of view, but my machine doesn't agree with me.
If you have come across the similar problem (with russian language preferably), could you please check my reasoning?
Thank you in advance
Olya
In order to get help on accuracy you need to provide the data you are trying to decode. You need to provide all necessary information to reproduce your problem.
It is not quite clear what is the problem yet.
Hello, Nickolay!
Yeah, sure
http://www.megafileupload.com/o6mx/urp.rar
Last edit: Dino The Dinosaur 2016-08-10
urp is not an audiobook but a phonetically balanced speech database for TTS, it was designed to include "unusual" words.
This is correct, you do not have enough data neither for acoustic model nor for language model. And it is not specific to Russian, you have much less data than required in our tutorial. From 900 words in a test set 460 are missing in the language model.
Thank you for your answers!