hi,
i listened to a sample dialog present in http://www.speech.cs.cmu.edu/letsgo/example.html
which is from "LetsGo" project based on sphinx.
The recognition rate of that voice is great and i would like to train a system that can do a similar job.
however i tried open source acoustic models like WSJ1 and HUB4, but the results i got were very bad and had a WER of almost 100%. Im not sure what im doing wrong here.
for audio input i use a standard microphone that comes with the head phone.
Any guesses/advices are greatly appreciated.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi,
i tried the files you've uploaded (dictionary, language model, etc) and tried the audio files i sent to you. But result was very poor (100% WER). For this purpose i used the "WSJ1 (dictation) acoustic models - for wideband (16kHz) microphone speech" acoustic model.
please let me know if you have got any good results with the audio files i already sent to you. If i can get good results with those audio files, then i might be able to go ahead.
thanks
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Files are decoded mostly correctly but there is a little problem - there are too many
garbage A letters. The reason for it is simple - your files are actually preprocessed somehow. They had a large periods of complete silence. Decoder fails to understand that. You have to add dither with -dither yes but even dither doesn't help with subvector quantization.
Another problem with your files - they have no initial silence. File should start with around half a second of silence to be correctly recognized.
Summarizing all above - don't preprocess the files if you'd like to get good quality. Use recordings as is.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i finally managed to run it and got the result you got.
thank you very much for the support.
i used the same setup to convert the following speech (just one word - attention), but it fails.
i recorded it with fairly good quality microphone.
please have look at the sample and let me know what you think.
hi again,
i got good results with the language models you've sent. however with more general models that has large number of words, the results im getting are quite poor, with almost 100% word error rates. however i've seen some statistics about sphinx-3 which shows around 70% accuracy even with large vocabularies.
my plan is to use this in call centers, which has fairly large vocabularies and the input audio will be coming from telephone line. since the results im getting is poor, even with good quality microphones, im not sure whether i can archive what i want. it would be great if someone can point me to a case study/project where sphinx-3 is used in a similar application.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> which has fairly large vocabularies and the input audio will be coming from telephone line.
Well, everything depends on real vocabulary size that's why we expect to get exact numbers from you. For telephone line you can get around 95% with 2000 words. It's actually enough for a simple speech. Of course if you vocabulary is 40000 you can't expect more than 70% even from advanced commercial systems.
So first of all you must design you complete system - design vocabulary, interaction and so on. Later if you'll have recognition rate problems share your recordings, we'll try to optimize decoder parameters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello,
I can only say that 70% is very much an achievable target with sphinx. I don't
know enough about your system to tell what's wrong.
cheers
Nagendra
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
hi,
i listened to a sample dialog present in
http://www.speech.cs.cmu.edu/letsgo/example.html
which is from "LetsGo" project based on sphinx.
The recognition rate of that voice is great and i would like to train a system that can do a similar job.
however i tried open source acoustic models like WSJ1 and HUB4, but the results i got were very bad and had a WER of almost 100%. Im not sure what im doing wrong here.
for audio input i use a standard microphone that comes with the head phone.
Any guesses/advices are greatly appreciated.
does any of you know a place where i can download such corpses (for call centers, etc)?
Upload a speech you want to recognize and we'll show you the options you must use.
Were you able to recognize simple commands we discussed before?
hi,
i tried the files you've uploaded (dictionary, language model, etc) and tried the audio files i sent to you. But result was very poor (100% WER). For this purpose i used the "WSJ1 (dictation) acoustic models - for wideband (16kHz) microphone speech" acoustic model.
please let me know if you have got any good results with the audio files i already sent to you. If i can get good results with those audio files, then i might be able to go ahead.
thanks
hi,
thank you for the help.
following link has some samples i would like to recognize correctly.
http://rapidshare.com/files/103263009/samples.zip.html
(they are in wav file format).
it would be best if the system can handle a large vocabulary, because my application requires to recognize a medium/large vocabulary.
thanks
Well, again they decode mostly fine, check my result here. I used sphinx3 trunk and new wsj model available
http://www.mediafire.com/?zkp03h9temd
http://www.speech.cs.cmu.edu/sphinx/models/wsj_jan2008/wsj_all_mllt_4000_20080104.tar.gz
Files are decoded mostly correctly but there is a little problem - there are too many
garbage A letters. The reason for it is simple - your files are actually preprocessed somehow. They had a large periods of complete silence. Decoder fails to understand that. You have to add dither with -dither yes but even dither doesn't help with subvector quantization.
Another problem with your files - they have no initial silence. File should start with around half a second of silence to be correctly recognized.
Summarizing all above - don't preprocess the files if you'd like to get good quality. Use recordings as is.
when i ran it, it gives the following error.
INFO: Word Insertion Penalty =0.700000
INFO: Silence probability =0.100000
INFO: Filler probability =0.100000
INFO:
INFO: dict2pid.c(577): Building PID tables for dictionary
INFO: Initialization of dict2pid_t, report:
INFO: Dict2pid is in composite triphone mode
INFO: 267 composite states; 106 composite sseq
INFO:
INFO: kbcore.c(623): Inside kbcore: Verifying models consistency ......
FATAL_ERROR: "kbcore.c", line 628: Feature streamlen(1) != mgau streamlen(30)
im using the trunk.
any idea what this could be?
i finally managed to run it and got the result you got.
thank you very much for the support.
i used the same setup to convert the following speech (just one word - attention), but it fails.
i recorded it with fairly good quality microphone.
please have look at the sample and let me know what you think.
here is the link to the sample.
http://rapidshare.com/files/104087869/test.zip.html
thank you.
Your file is 44.1 kHz stereo, convert it to mono 16 kHz and it will work fine.
hi again,
i got good results with the language models you've sent. however with more general models that has large number of words, the results im getting are quite poor, with almost 100% word error rates. however i've seen some statistics about sphinx-3 which shows around 70% accuracy even with large vocabularies.
my plan is to use this in call centers, which has fairly large vocabularies and the input audio will be coming from telephone line. since the results im getting is poor, even with good quality microphones, im not sure whether i can archive what i want. it would be great if someone can point me to a case study/project where sphinx-3 is used in a similar application.
> which has fairly large vocabularies and the input audio will be coming from telephone line.
Well, everything depends on real vocabulary size that's why we expect to get exact numbers from you. For telephone line you can get around 95% with 2000 words. It's actually enough for a simple speech. Of course if you vocabulary is 40000 you can't expect more than 70% even from advanced commercial systems.
So first of all you must design you complete system - design vocabulary, interaction and so on. Later if you'll have recognition rate problems share your recordings, we'll try to optimize decoder parameters.
Hello,
I can only say that 70% is very much an achievable target with sphinx. I don't
know enough about your system to tell what's wrong.
cheers
Nagendra