Hi. I am a newcomer trying to get Sphinx 4 to work. I have run it on my careful and clear reading of test.wav (I am an American native English speaker) for which the ground truth is:
this is the first interval of speaking
after the first moment of silence this is the second interval of speaking
after the third moment of silence this is the third interval of speaking and the last
and I got instead
this is the first interval of speaking
one of silence
the second interval of speaking
to the third moment of silence
thirty hunter was speaking and the last
for which word error rate was 37%. (The word error rate was similar for the version of test.wav included with Sphinx.)
I have also run it on my careful and clear reading of Little Red Riding Hood, which begins:
Once upon a time there lived in a certain village a little country girl the prettiest creature who was ever seen
Her mother was excessively fond of her and her grandmother doted on her still more
This good woman had a little red riding hood made for her
It suited the girl so extremely well that everybody called her Little Red Riding Hood
One day her mother having made some cakes said to her
Go my dear and see how your grandmother is doing for
I hear she has been very ill
Take her a cake and this little pot of butter
Little Red Riding Hood set out immediately to go to her grandmother who lived in another village
As she was going through the wood she met with a wolf who had a very great mind to eat
and I got instead:
little country girl
freeze creature whose it is steve
her mother was excessively fond of her
in your grandmother going on are still more
the woman had a little red riding hood makes her
to to girls looks yummy well
everybody called her a little red riding
one day her mother having
goal mightier can see how are you from others doing
we're here she's been very ill
in this little pot luck
little red riding hood seventy three
go to her grandma
he lives in a little
go into the woods you know with the walls
for a great minds iraq
(The reference to "Iraq", while amusing, seemed a bit out of place in "Little Red Riding Hood.")
While these Sphinx transcripts are amusing, I'm tired of being amused and want instead a reasonable word error rate, rather than the 60% I got here.
Obviously I am doing something wrong, since Sphinx can't be so bad. But what is it?
Here are some technical details. Both WAV files were indeed mono, with 16-bit precision, and a sampling
rate of 16000 samples per second as confirmed by SOX. (The files were recorded on a Windows laptop into .m4a files and converted by FFMPEG to .wav format.) The default acoustic model and default language model, as distributed with Sphinx, were used; see the actual Java code below for the details. The included jars were:
sphinx4-core-5prealpha-20160628.232526-10.jar
sphinx4-data-5prealpha-20160628.232535-10.jar
sphinx4-samples-5prealpha-20160628.232549-9.jar
I have appended below the actual Java code used to transcribe.
I can't believe Sphinx would get such poor word error rates on such
simple audio files. Please tell me what I am doing wrong. Thanks very much for your help.
Here is the Java code. The transcripts I referred to above are the ones produced first by the code.
It seems you are using very heavy compression which corrupts spectrum and not very good microphone. Default model is not going to recognize such speech accurately.
You can probably try with something like USB/bluetooth microphone and with a recorder which can record raw wav files.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You said above that the default model is not going to recognize such speech (where the spectrum has been corrupted, perhaps by a compression or conversion algorithm, but the speech is still clear to a human ear). I have a few questions:
I understand that Sphinx4 only works with 16bit + 16KHz or 16 bit + 8 KHz raw WAV files. Is it possible to configure it or to train an acoustic model to work with 8bit + 8KHz raw WAV recordings? If not, what method or tool would you recommend to convert 8bit + 8KHz recordings to 16bit + 8KHz?
If I am only concerned with transcription accuracy and prepared to spend relatively large amounts of processing time and power, how should I configure sphinxtrain (and possibly the recognizer)? I assume that I'll need to train a "bigger" acoustic model and perhaps reconfigure the decoder to work with it.
(The questions above are more important for me, but this is a natural question which popped up in my head.) In your comment above, I assume that you meant the the default acoustic model: how might one go about training an acoustic model that will work well with a diverse set of recordings that may or may not have been compressed and decompressed. Is it possible do this even in priciple or do we need separate acoustic models for every different permutation?
Thanks for answering all the questions by everyone! (I'm a Sphinx4 novice too and I have been reading these forums for a few months and your answers have been tremendously helpful and have saved me from several weeks of headache.)
A.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I understand that Sphinx4 only works with 16bit + 16KHz or 16 bit + 8 KHz raw WAV files. Is it possible to configure it or to train an acoustic model to work with 8bit + 8KHz raw WAV recordings? If not, what method or tool would you recommend to convert 8bit + 8KHz recordings to 16bit + 8KHz?
8bit representation is rarely used for PCM data, it looses too much information. Most likely you meant to decode some compressed format which you can simply decompress to PCM first.
If I am only concerned with transcription accuracy and prepared to spend relatively large amounts of processing time and power, how should I configure sphinxtrain (and possibly the recognizer)? I assume that I'll need to train a "bigger" acoustic model and perhaps reconfigure the decoder to work with it.
(The questions above are more important for me, but this is a natural question which popped up in my head.) In your comment above, I assume that you meant the the default acoustic model: how might one go about training an acoustic model that will work well with a diverse set of recordings that may or may not have been compressed and decompressed. Is it possible do this even in priciple or do we need separate acoustic models for every different permutation?
State of the art solution is to train the system on as large amount of diverse data as possible this goes to training up to 10000 hours of speech data. You can track the effort here:
EESEN-transcriber works reasonably well on my personal computer with a variety of sample files (WAV, MP3, M4A, etc). However:
My work requires trancription of 8bit-8KHz recordings which have two, and sometimes more than two, people talking. The quality of these recordings is not great (and much worse than the samples I tested EESEN-transcriber on) but that's the form in which they are available to me. Unfortunately, I'm unable to share these recordings outside my workplace.
Relatedly, the computers on which I need to run the transcriber programs are Linux boxes that are NOT connected to the internet. Installing Sphinx4 on these machines is straightforward -- I only need to copy some Jar files -- but installing Kaldi or EESEN-transcriber is much more difficult, since they need to download some components from the internet. Hence my questions above about attempting to improve Sphinx's transcription accuracy.
Thanks again for your help, it is much appreciated.
A.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you, Nickolay. I was using the default "Voice Recorder" under Windows 10, which does not seem to allow direct recording into .wav format. I will see what I can do.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks very much for the suggestions. I rerecorded test.wav and
Little Red Riding Hood using Audacity directly to .wav format with 1
channel, 16kHz sampling rate, and 16-bit precision. (Actually, Audacity recorded to 32-bit FLOAT and then converted to 16-bit integers; I believe that conversion shouldn't have caused Sphinx any problems.) The performance on test.wav
improved from 37% to 14%, which I was happy about. However, for
Little Red Riding Hood, the WER only improved from 60% to 48%. (I've
attached one Little Red Riding Hood .wav file.) A sample of the ground truth and Sphinx output appears below. All the files were
still recorded using my laptop's built-in microphone, not a USB
microphone, as you'd suggested. Do you think the 48% WER could simply
be due to the quality of my laptop's microphone? I'm afraid I'm doing
something else wrong.
The correct transcript begins:
Once upon a time there lived in a certain village a little country
girl the prettiest creature who was ever seen
Her mother was excessively fond of her and her grandmother doted on
her still more
This good woman had a little red riding hood made for her
It suited the girl so extremely well that everybody called her Little
Red Riding Hood
One day her mother having made some cakes said to her
Go my dear and see how your grandmother is doing for
I hear she has been very ill
Take her a cake and this little pot of butter
Little Red Riding Hood set out immediately to go to her grandmother
who lived in another village
As she was going through the wood she met with a wolf who had a very
great mind to eat her up but he dared not because of some woodcutters
working nearby in the forest
The Sphinx transcript begins:
two
edit
what's our lives by their lives in a city village a little country
girl
praise creature who was ever seen
her mother was excessively fond of her
and your grandmother build on hers
more
ms good woman had a little red riding hood
need for her
this is the girl so it's really well
everybody called her little red riding hood
one day her mother having a suitcase
said to her
going here and see how you from others doing
for eight years she's been very ill
take her case in this little car water
little red riding hood seldom used to go to rome other
who lives in the village
ash is going through the woods
you met with a wolf
what a very great mind to europe
but he did not because it's somewhat cons
can you buy the farms
I'm eager to hear your opinion, Nickolay! Thanks again.
I checked your file. There is nothing suspicious, of course your microphone could be better but overall it is the quality I would expect. Your microphone cuts frequences above 4khz, so the 8khz model
The transcriptions are inaccurate and I am unsure why. If you look at the youtube video even those transcriptions are incorrect from hearing the audio.
I am trying to transcribe wav files for the language isiXhosa and the transcriptions were inaccurate for them too. I had a 30% WER for my acoustic model and the accuracy was perfect when using pocketsphinx on python but does not seem to work with sphinx4 on eclipse. That is why i tried with the basic US-english model and that too transcribed inaccurately.
Any help would be much appreciated!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi. I am a newcomer trying to get Sphinx 4 to work. I have run it on my careful and clear reading of test.wav (I am an American native English speaker) for which the ground truth is:
this is the first interval of speaking
after the first moment of silence this is the second interval of speaking
after the third moment of silence this is the third interval of speaking and the last
and I got instead
this is the first interval of speaking
one of silence
the second interval of speaking
to the third moment of silence
thirty hunter was speaking and the last
for which word error rate was 37%. (The word error rate was similar for the version of test.wav included with Sphinx.)
I have also run it on my careful and clear reading of Little Red Riding Hood, which begins:
Once upon a time there lived in a certain village a little country girl the prettiest creature who was ever seen
Her mother was excessively fond of her and her grandmother doted on her still more
This good woman had a little red riding hood made for her
It suited the girl so extremely well that everybody called her Little Red Riding Hood
One day her mother having made some cakes said to her
Go my dear and see how your grandmother is doing for
I hear she has been very ill
Take her a cake and this little pot of butter
Little Red Riding Hood set out immediately to go to her grandmother who lived in another village
As she was going through the wood she met with a wolf who had a very great mind to eat
and I got instead:
little country girl
freeze creature whose it is steve
her mother was excessively fond of her
in your grandmother going on are still more
the woman had a little red riding hood makes her
to to girls looks yummy well
everybody called her a little red riding
one day her mother having
goal mightier can see how are you from others doing
we're here she's been very ill
in this little pot luck
little red riding hood seventy three
go to her grandma
he lives in a little
go into the woods you know with the walls
for a great minds iraq
(The reference to "Iraq", while amusing, seemed a bit out of place in "Little Red Riding Hood.")
While these Sphinx transcripts are amusing, I'm tired of being amused and want instead a reasonable word error rate, rather than the 60% I got here.
Obviously I am doing something wrong, since Sphinx can't be so bad. But what is it?
Here are some technical details. Both WAV files were indeed mono, with 16-bit precision, and a sampling
rate of 16000 samples per second as confirmed by SOX. (The files were recorded on a Windows laptop into .m4a files and converted by FFMPEG to .wav format.) The default acoustic model and default language model, as distributed with Sphinx, were used; see the actual Java code below for the details. The included jars were:
I have appended below the actual Java code used to transcribe.
I can't believe Sphinx would get such poor word error rates on such
simple audio files. Please tell me what I am doing wrong. Thanks very much for your help.
Here is the Java code. The transcripts I referred to above are the ones produced first by the code.
package edu.cmu.sphinx.demo.transcriber;
import java.io.InputStream;
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;
import edu.cmu.sphinx.decoder.adaptation.Stats;
import edu.cmu.sphinx.decoder.adaptation.Transform;
import edu.cmu.sphinx.result.WordResult;
import java.io.FileInputStream;
import java.io.File;
public class Transcriber {
while ((result = recognizer.getResult()) != null) {
stats.collect(result);
}
recognizer.stopRecognition();
}
You need to provide the audio
I've attached the one .wav file.
It seems you are using very heavy compression which corrupts spectrum and not very good microphone. Default model is not going to recognize such speech accurately.
You can probably try with something like USB/bluetooth microphone and with a recorder which can record raw wav files.
Hi Nickolay,
You said above that the default model is not going to recognize such speech (where the spectrum has been corrupted, perhaps by a compression or conversion algorithm, but the speech is still clear to a human ear). I have a few questions:
Thanks for answering all the questions by everyone! (I'm a Sphinx4 novice too and I have been reading these forums for a few months and your answers have been tremendously helpful and have saved me from several weeks of headache.)
A.
8bit representation is rarely used for PCM data, it looses too much information. Most likely you meant to decode some compressed format which you can simply decompress to PCM first.
If you want accuracy you can use Kaldi DNN models. Something like https://github.com/alumae/kaldi-offline-transcriber English models are available here https://github.com/srvk/eesen-transcriber
State of the art solution is to train the system on as large amount of diverse data as possible this goes to training up to 10000 hours of speech data. You can track the effort here:
https://github.com/kaldi-asr/kaldi/issues/870
Last edit: Nickolay V. Shmyrev 2016-08-13
Thanks Nickolay, that was a big help.
EESEN-transcriber works reasonably well on my personal computer with a variety of sample files (WAV, MP3, M4A, etc). However:
Thanks again for your help, it is much appreciated.
A.
Accuracy problems are much more complicated than installation problems.
and one more
Thank you, Nickolay. I was using the default "Voice Recorder" under Windows 10, which does not seem to allow direct recording into .wav format. I will see what I can do.
Hi, again, Nickolay!
Thanks very much for the suggestions. I rerecorded test.wav and
Little Red Riding Hood using Audacity directly to .wav format with 1
channel, 16kHz sampling rate, and 16-bit precision. (Actually, Audacity recorded to 32-bit FLOAT and then converted to 16-bit integers; I believe that conversion shouldn't have caused Sphinx any problems.) The performance on test.wav
improved from 37% to 14%, which I was happy about. However, for
Little Red Riding Hood, the WER only improved from 60% to 48%. (I've
attached one Little Red Riding Hood .wav file.) A sample of the ground truth and Sphinx output appears below. All the files were
still recorded using my laptop's built-in microphone, not a USB
microphone, as you'd suggested. Do you think the 48% WER could simply
be due to the quality of my laptop's microphone? I'm afraid I'm doing
something else wrong.
The correct transcript begins:
Once upon a time there lived in a certain village a little country
girl the prettiest creature who was ever seen
Her mother was excessively fond of her and her grandmother doted on
her still more
This good woman had a little red riding hood made for her
It suited the girl so extremely well that everybody called her Little
Red Riding Hood
One day her mother having made some cakes said to her
Go my dear and see how your grandmother is doing for
I hear she has been very ill
Take her a cake and this little pot of butter
Little Red Riding Hood set out immediately to go to her grandmother
who lived in another village
As she was going through the wood she met with a wolf who had a very
great mind to eat her up but he dared not because of some woodcutters
working nearby in the forest
The Sphinx transcript begins:
two
edit
what's our lives by their lives in a city village a little country
girl
praise creature who was ever seen
her mother was excessively fond of her
and your grandmother build on hers
more
ms good woman had a little red riding hood
need for her
this is the girl so it's really well
everybody called her little red riding hood
one day her mother having a suitcase
said to her
going here and see how you from others doing
for eight years she's been very ill
take her case in this little car water
little red riding hood seldom used to go to rome other
who lives in the village
ash is going through the woods
you met with a wolf
what a very great mind to europe
but he did not because it's somewhat cons
can you buy the farms
I'm eager to hear your opinion, Nickolay! Thanks again.
Hello New SourceForge User
I checked your file. There is nothing suspicious, of course your microphone could be better but overall it is the quality I would expect. Your microphone cuts frequences above 4khz, so the 8khz model
https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English/cmusphinx-en-us-ptm-8khz-5.2.tar.gz/download
should be slightly better for you.
You can run acoustic model adaptation as described in our tutorial to get better accuracy in your case:
http://cmusphinx.sourceforge.net/wiki/tutorialadapt
Like I wrote above modern DNN recognizers like Kaldi should give you much better accuracy. You can try offline transcriber I linked above.
Hi there!
I have the same problem as New SourceForge User. I have followed the CMU Sphinx tutorial exactly as well as this youtube tutorial: https://www.youtube.com/watch?v=d_tYW-4i3nE&ab_channel=ForgottenLegends
The transcriptions are inaccurate and I am unsure why. If you look at the youtube video even those transcriptions are incorrect from hearing the audio.
I am trying to transcribe wav files for the language isiXhosa and the transcriptions were inaccurate for them too. I had a 30% WER for my acoustic model and the accuracy was perfect when using pocketsphinx on python but does not seem to work with sphinx4 on eclipse. That is why i tried with the basic US-english model and that too transcribed inaccurately.
Any help would be much appreciated!