We have adapted the WavFile demo to transcribe WAV files that contain more than just digits, but our recognition accuracy seems poor, and we thought maybe you folks might see something obvious we could be doing to improve things.
Some Background:
We are doing a small experiment to see if we can transcribe audio recorded interviews. We created a short survey and then got a couple dozen volunteers to take it. We recorded each question to a WAV file sampled at 16kHz, 16 bit mono as suggested in the documentation.
We then took the interview script, added a few words we knew would appear in responses, and used it as the knowledge base for the LM tool to create new .LM and .DIC files.
We actually replaced the WavFile config.xml with another we found on this forum, and modified it to reference the new LM and DIC files.
Our Results:
For a representative WAV file (see link below), the spoken words were
"To categorize the voice characteristics of the audio recordings we need to collect some data about me as the interviewer and you as the respondent. For example, my gender is male."
The transcription returns
RESULT: eight and write voice characteristics the audio record when need collect need of of need
While it does seem to have recognized a fair number of the words, it's still pretty far from being a useful transcription.
I've included what I think are the relevant files here as links, for ease in reading this post. Let me know if you'd prefer to have them included inline, or if there's anything else you'd like to see.
On your suggestion I tried modifying the language model we were using. It had used very long sentences that you implied were causing problems (or at least that's how I interpreted it). So I took the corpus and split it at what seemed like logical phrasing breaks (natural pauses). Then I repeated some words that appeared more frequently in the recordings than in the script (in an attempt to "favor" them as a translation) and I also added a few other words that weren't in the script but appeared frequently in people's free-form responses.
I used lmtool to build a new LM and dict and re-ran my trial. Some transcriptions were better, but some were worse. I don't understand why they would be worse, and I'd really like to get the feeling that my work is taking me somewhere, that is that I am converging toward maximum accuracy.
Can you provide me any insight into why these included transcriptions would be worse than in the previous trial? In fairness, I've included only a few cases where the transcription was worse- there were also quite a good number that showed improvement.
If you have any questions, please ask. I put all the files in a ZIP and uploaded it to:
Hm, I checked this, sorry for delays. Actually I'd recommend you to collect more testing data - around 50 files at least to get more or less significant statistics.
About quality, it recognizes everything quite well. You probably just need to add more sentences to the lm (there is no "for example" in a proper position there) and use more variants in the dictionary. Try to add:
EXAMPLE(2) IH G Z AE M P L
in the proper place in the dictionary. Then it will decode first sentence better:
WE NEED TO CATEGORIZE THE LOUDNESS OF OUR VOICES FOR EXAMPLE I I AM SAYING THIS ITEM IN A TONE OF VOICE THAT IS MEDIUM
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hm, in your case beams are too narrow, you need wider ones (around 1e-120). Also, with such a bad quality it will be very hard to recognize accurately. Check your file, it has enormous DC. Could you record with better microphone?
Heh, and to be honest I don't know what to do with sphinx4 to make it work. WSJ unfortunately doens't perform well. But sphinx3 with hub4 works quite well:
FWDVIT: TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORDINGS
WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE
RESPONDENT FOR EXAMPLE MY GENDER IS MALE (test)
FWDVIT: A TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORD THE WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE REGION A FOR EXAMPLE MY GENDER IS A MALE A (test)
But I don't know how to recognize this properly with sphinx4
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Wow- thanks for the fast and great response, Nickolay!
The audio was picked up with the built-in mic on a standard-issue (Windows) IBM laptop, and it is our hope to be able to do the interviews with no other special equipment. Once we have the files off the laptop, though, can you recommend any tools (and, helpfully, what adjustments to make to the files using those tools) that we could do post-collection and pre-recognition? Uh, and what's DC mean, that's so huge on the WAV file (sorry!)? I will make the adjustments to the beam you mention, and I'll try using HUB4 instead of WSJ. I am very impressed with the accuracy you got with Sphinx-3! Unfortunately, I don't think we have any *nix boxes around here (but I will check) so we may be stuck with Sphinx-4...
Can you suggest any resources that might be able to help with improving Sphinx-4 accuracy? My naive impression was that 3 and 4 were basically the same program, so it should at least be possible to duplicate your impressive results?
Thanks again,
Carl
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Well, I looked closer on sphinx4 case and now I understand the reasons.
As usual they are rather complicated but let me explain.
First of all about DC. Google for "DC offset", you'll find a lot of
pictures. Basically waveform is a function moving around zero. In
perfect case silence has zero values. But due to hardware problems
sometimes this function is shifted on a constant value. Silences are not
zero regions but regions of some positive or negative value. Check your
file with wav editor, you'll see that. Usually DC is not a problem since
you can easily remove it by substraction of the average or by one-pole
filter, but in your case it causes problems.
The first part of speechClassifier, speechMarker and nonSpeechData
filter builds the so called endpointer. Speech is a continuous stream
with pauses. Decoder can't decode big chunks at once, it usually splits
big chunks on smaller ones by big pauses. In your case endpointer is not
correct because of DC, it just can't detect the silence properly. Once
you'll remove DC it will be more stable, but another problem will
appear. Your language model. It's too small to cover all variants of
small chunks. For example if we'll split your recording according to
pauses, we'll get something like:
TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORDINGS
WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE RESPONDENT
FOR EXAMPLE MY GENDER
MALE
But the problem is that your language model doens't cover such chunks,
it wasn't trained on such text. Instead it only suitable to recognize a
very big chunk. You need better language model.
Now, let's try to remove endpointer to cover to be able to reuse the LM
(sphinx3_decode also have no endpointer, it decodes phrase at once) by
using the following frontend, alternatively you can try to set the
property <property name="mergeSpeechSegments" value="true"/> of
NonSpeechDataFilter, but since speech detector doesn't work it won't
help.
Ok, now we have the better utterance in a big chunk that is suitable for
decoding with our LM, but there are another problem - there are big
parts of silence in the utterance. To let decoder find them, you have to
build a special dictionary that sphinx4 won't build by default (sphinx3
will):
addSilEndingPronuncation=true is important here. Once you'll do the
above, it will give you the acceptable results:
RESULT: to categorize the voice characteristics of the audio recordings
we need to collect some data of need the an years own for am will my
gender is male in
Still, sphinx4 doesn't perform well on utterances with a big silence
part like yours, sphinx3 is better. For more info, read Javadoc on all
this. Basically I'd better suggest you to
1) Get better microphone
2) Enable endpointer but get better language model (sphinx3 will also require that once
you'll decode very big utterances).
3) Use sphinx3, it's available on windows either as a native binary or under cygwin or
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for your help, Nickolay. I haven't figured out how to use the HUB4 model yet, but made the changes you suggest to my config file (pasted below), and used Audacity (another Sourceforge project) to make the following edits to the WAV file:
- trimmed leading and trailing silence
- applied low pass filter at 5kHz
- applied hi pass filter at 800Hz
- normalized resulting file to -3dB peak level
The tranlation Sphinx4 came back with was:
RESULT: categorize voice characteristics during record continue collect need to need being please own for can my gender is male
I've tried various combinations of mods to the WAV file (including cutting out long pauses within the file), and played around with the beams a bit in the config.xml, but have not stumbled upon a combination that's yielded better results than what's shown above.
Nickolay, can I ask what you did different from me to produce such better results, even with Sphinx4? If part of it was using HUB4, could you provide an example of a config that uses it? I'd like to get some useful results, but I'm feeling a bit frustrated and what documentation I can find does not seem to provide the answers I'm looking for.
make the following edits to the WAV file:
- trimmed leading and trailing silence
- applied low pass filter at 5kHz
- applied hi pass filter at 800Hz
- normalized resulting file to -3dB peak level
This is the wrong step in completely opposite direction. Model extracts cepstrum from 130 to 6800 Hz, filtering will make recognition much more unstable.
Thanks Nickolay, I've been working on something else and missed your post. So I should not have done the filtering, but normalizing and removing the leading and trailing silence would be good steps, would they not?
I'm also interested in trying Sphinx3 out for this, but I didn't see a binary distribution anywhere under http://cmusphinx.sourceforge.net/html/download.php. Can you provide me a link? (Sorry for the stupid question...)
Thanks!
Carl
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm, I thought this was fixed in SPhinx 3.7 but apparently not.
Sphinx3 has a stupid "feature" which for no good reason makes it refuse to load text-format language models unless you also add '-lminmemory 1' to your command-line or configuration file.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Aha- adding '-lminmemory 1' to the command line did it. Thanks!
Now I have (I hope) a very simple Sphinx3 question. When I run sphinx3_decode on a single WAV file, using the params that Nickolay volunteered above, I get about 1000 lines of output, of which only 2 are the transcription. Is there a switch to reduce the number of info messages produced, or alternately output the result (only) to a separate destination?
Thanks again,
Carl
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That (-hyp <filename>)did the trick- thanks Nickolay! BTW, recognition with Sphinx 3 is much better than the best we'd achieved with Sphinx 4 so far, at least for our application. Next steps I guess will be:
Fiddle with tuning params looking for overall improvement
Try to build a better language model, based on all of our recordings.
Thanks again,
Carl
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I know you're in major demand here, but did you get a chance to look at my post of 7/9 in this thread? Can you tell if I did something wrong, or if not, what I should do next?
Let me know if you have any problems accessing the files I provided.
Thanks much!
Carl
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
We have adapted the WavFile demo to transcribe WAV files that contain more than just digits, but our recognition accuracy seems poor, and we thought maybe you folks might see something obvious we could be doing to improve things.
Some Background:
We are doing a small experiment to see if we can transcribe audio recorded interviews. We created a short survey and then got a couple dozen volunteers to take it. We recorded each question to a WAV file sampled at 16kHz, 16 bit mono as suggested in the documentation.
We then took the interview script, added a few words we knew would appear in responses, and used it as the knowledge base for the LM tool to create new .LM and .DIC files.
We actually replaced the WavFile config.xml with another we found on this forum, and modified it to reference the new LM and DIC files.
Our Results:
For a representative WAV file (see link below), the spoken words were
"To categorize the voice characteristics of the audio recordings we need to collect some data about me as the interviewer and you as the respondent. For example, my gender is male."
The transcription returns
RESULT: eight and write voice characteristics the audio record when need collect need of of need
While it does seem to have recognized a fair number of the words, it's still pretty far from being a useful transcription.
I've included what I think are the relevant files here as links, for ease in reading this post. Let me know if you'd prefer to have them included inline, or if there's anything else you'd like to see.
Config.xml:
http://www.mediafire.com/?nbtzzhxh039
Quex.dic:
http://www.mediafire.com/?0ttydtajedx
Quex.lm:
http://www.mediafire.com/?fxrzj44yv3n
Quex.sent:
http://www.mediafire.com/?bne5zmgmgxe
Sample WAV file produced in the interview:
http://www.mediafire.com/?xunghzxt37w
Same file with leading and trailing junk trimmed off:
http://www.mediafire.com/?dfx4mc1m02d
Any help would be appreciated!
Carl
Nickolay,
On your suggestion I tried modifying the language model we were using. It had used very long sentences that you implied were causing problems (or at least that's how I interpreted it). So I took the corpus and split it at what seemed like logical phrasing breaks (natural pauses). Then I repeated some words that appeared more frequently in the recordings than in the script (in an attempt to "favor" them as a translation) and I also added a few other words that weren't in the script but appeared frequently in people's free-form responses.
I used lmtool to build a new LM and dict and re-ran my trial. Some transcriptions were better, but some were worse. I don't understand why they would be worse, and I'd really like to get the feeling that my work is taking me somewhere, that is that I am converging toward maximum accuracy.
Can you provide me any insight into why these included transcriptions would be worse than in the previous trial? In fairness, I've included only a few cases where the transcription was worse- there were also quite a good number that showed improvement.
If you have any questions, please ask. I put all the files in a ZIP and uploaded it to:
http://www.mediafire.com/?bambm2s5wev
Thanks!
Carl
Hm, I checked this, sorry for delays. Actually I'd recommend you to collect more testing data - around 50 files at least to get more or less significant statistics.
About quality, it recognizes everything quite well. You probably just need to add more sentences to the lm (there is no "for example" in a proper position there) and use more variants in the dictionary. Try to add:
EXAMPLE(2) IH G Z AE M P L
in the proper place in the dictionary. Then it will decode first sentence better:
WE NEED TO CATEGORIZE THE LOUDNESS OF OUR VOICES FOR EXAMPLE I I AM SAYING THIS ITEM IN A TONE OF VOICE THAT IS MEDIUM
Hm, in your case beams are too narrow, you need wider ones (around 1e-120). Also, with such a bad quality it will be very hard to recognize accurately. Check your file, it has enormous DC. Could you record with better microphone?
Heh, and to be honest I don't know what to do with sphinx4 to make it work. WSJ unfortunately doens't perform well. But sphinx3 with hub4 works quite well:
sphinx3_decode \ -adcin yes \ -cepext .wav \ -cepdir . \ -ctl test.ctl \ -dict quex.dic \ -fdict filler.dict \ -remove_dc no \ -lm quex.lm \ -hmm /home/shmyrev/local/share/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd
FWDVIT: TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORDINGS
WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE
RESPONDENT FOR EXAMPLE MY GENDER IS MALE (test)
WSJ results are a little worse:
sphinx3_decode \ -adcin yes \ -cepext .wav \ -cepdir . \ -ctl test.ctl \ -dict quex.dic \ -fdict filler.dict \ -remove_dc no \ -lm quex.lm \ -hmm wsj_all_cd30.mllt_cd_cont_4000 \ -subvq wsj_all_cd30.mllt_cd_cont_4000/subvq \ -beam 1e-80 \ -wbeam 1e-60 \ -subvqbeam 1e-2 \ -maxhmmpf 2500 \ -maxcdsenpf 1500 \ -maxwpf 20
FWDVIT: A TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORD THE WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE REGION A FOR EXAMPLE MY GENDER IS A MALE A (test)
But I don't know how to recognize this properly with sphinx4
Wow- thanks for the fast and great response, Nickolay!
The audio was picked up with the built-in mic on a standard-issue (Windows) IBM laptop, and it is our hope to be able to do the interviews with no other special equipment. Once we have the files off the laptop, though, can you recommend any tools (and, helpfully, what adjustments to make to the files using those tools) that we could do post-collection and pre-recognition? Uh, and what's DC mean, that's so huge on the WAV file (sorry!)? I will make the adjustments to the beam you mention, and I'll try using HUB4 instead of WSJ. I am very impressed with the accuracy you got with Sphinx-3! Unfortunately, I don't think we have any *nix boxes around here (but I will check) so we may be stuck with Sphinx-4...
Can you suggest any resources that might be able to help with improving Sphinx-4 accuracy? My naive impression was that 3 and 4 were basically the same program, so it should at least be possible to duplicate your impressive results?
Thanks again,
Carl
Well, I looked closer on sphinx4 case and now I understand the reasons.
As usual they are rather complicated but let me explain.
First of all about DC. Google for "DC offset", you'll find a lot of
pictures. Basically waveform is a function moving around zero. In
perfect case silence has zero values. But due to hardware problems
sometimes this function is shifted on a constant value. Silences are not
zero regions but regions of some positive or negative value. Check your
file with wav editor, you'll see that. Usually DC is not a problem since
you can easily remove it by substraction of the average or by one-pole
filter, but in your case it causes problems.
Now let's look on the frontend configuration:
The first part of speechClassifier, speechMarker and nonSpeechData
filter builds the so called endpointer. Speech is a continuous stream
with pauses. Decoder can't decode big chunks at once, it usually splits
big chunks on smaller ones by big pauses. In your case endpointer is not
correct because of DC, it just can't detect the silence properly. Once
you'll remove DC it will be more stable, but another problem will
appear. Your language model. It's too small to cover all variants of
small chunks. For example if we'll split your recording according to
pauses, we'll get something like:
TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORDINGS
WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE RESPONDENT
FOR EXAMPLE MY GENDER
MALE
But the problem is that your language model doens't cover such chunks,
it wasn't trained on such text. Instead it only suitable to recognize a
very big chunk. You need better language model.
Now, let's try to remove endpointer to cover to be able to reuse the LM
(sphinx3_decode also have no endpointer, it decodes phrase at once) by
using the following frontend, alternatively you can try to set the
property <property name="mergeSpeechSegments" value="true"/> of
NonSpeechDataFilter, but since speech detector doesn't work it won't
help.
Ok, now we have the better utterance in a big chunk that is suitable for
decoding with our LM, but there are another problem - there are big
parts of silence in the utterance. To let decoder find them, you have to
build a special dictionary that sphinx4 won't build by default (sphinx3
will):
addSilEndingPronuncation=true is important here. Once you'll do the
above, it will give you the acceptable results:
RESULT: to categorize the voice characteristics of the audio recordings
we need to collect some data of need the an years own for am will my
gender is male in
Still, sphinx4 doesn't perform well on utterances with a big silence
part like yours, sphinx3 is better. For more info, read Javadoc on all
this. Basically I'd better suggest you to
1) Get better microphone
2) Enable endpointer but get better language model (sphinx3 will also require that once
you'll decode very big utterances).
3) Use sphinx3, it's available on windows either as a native binary or under cygwin or
Thanks for your help, Nickolay. I haven't figured out how to use the HUB4 model yet, but made the changes you suggest to my config file (pasted below), and used Audacity (another Sourceforge project) to make the following edits to the WAV file:
- trimmed leading and trailing silence
- applied low pass filter at 5kHz
- applied hi pass filter at 800Hz
- normalized resulting file to -3dB peak level
The tranlation Sphinx4 came back with was:
RESULT: categorize voice characteristics during record continue collect need to need being please own for can my gender is male
I've tried various combinations of mods to the WAV file (including cutting out long pauses within the file), and played around with the beams a bit in the config.xml, but have not stumbled upon a combination that's yielded better results than what's shown above.
Nickolay, can I ask what you did different from me to produce such better results, even with Sphinx4? If part of it was using HUB4, could you provide an example of a config that uses it? I'd like to get some useful results, but I'm feeling a bit frustrated and what documentation I can find does not seem to provide the answers I'm looking for.
Thanks again for the help!
Carl
Modified WAV file:
http://www.mediafire.com/?mlmrr2dxxxe
Config file with mods to beam, plus front end and dictionary section changes:
<?xml version="1.0" encoding="UTF-8"?>
<!--
Sphinx-4 Configuration file
-->
<!-- ******** -->
<!-- an4 configuration file -->
<!-- ******** -->
<config>
<!-- ******** -->
<!-- frequently tuned properties -->
<!-- ******** -->
<property name="absoluteBeamWidth" value="1500"/>
<property name="relativeBeamWidth" value="1E-120"/>
<property name="absoluteWordBeamWidth" value="20"/>
<property name="relativeWordBeamWidth" value="1E-60"/>
<property name="wordInsertionProbability" value="1E-16"/>
<property name="languageWeight" value="7.0"/>
<property name="silenceInsertionProbability" value=".1"/>
<property name="frontend" value="epFrontEnd"/>
<property name="recognizer" value="recognizer"/>
<property name="showCreations" value="false"/>
</config>
make the following edits to the WAV file:
- trimmed leading and trailing silence
- applied low pass filter at 5kHz
- applied hi pass filter at 800Hz
- normalized resulting file to -3dB peak level
This is the wrong step in completely opposite direction. Model extracts cepstrum from 130 to 6800 Hz, filtering will make recognition much more unstable.
Check my files instead:
http://www.mediafire.com/?lxj52ncnyad
Thanks Nickolay, I've been working on something else and missed your post. So I should not have done the filtering, but normalizing and removing the leading and trailing silence would be good steps, would they not?
I'm also interested in trying Sphinx3 out for this, but I didn't see a binary distribution anywhere under http://cmusphinx.sourceforge.net/html/download.php. Can you provide me a link? (Sorry for the stupid question...)
Thanks!
Carl
> So I should not have done the filtering, but normalizing and removing the leading and trailing silence would be good steps.
Neither of them is a good step.
> I'm also interested in trying Sphinx3 out for this
Hm, sorry, you have to compile it yourself. It's a matter of 5 minutes.
Hmm, I thought this was fixed in SPhinx 3.7 but apparently not.
Sphinx3 has a stupid "feature" which for no good reason makes it refuse to load text-format language models unless you also add '-lminmemory 1' to your command-line or configuration file.
Aha- adding '-lminmemory 1' to the command line did it. Thanks!
Now I have (I hope) a very simple Sphinx3 question. When I run sphinx3_decode on a single WAV file, using the params that Nickolay volunteered above, I get about 1000 lines of output, of which only 2 are the transcription. Is there a switch to reduce the number of info messages produced, or alternately output the result (only) to a separate destination?
Thanks again,
Carl
-hyp test.result ?
That (-hyp <filename>)did the trick- thanks Nickolay! BTW, recognition with Sphinx 3 is much better than the best we'd achieved with Sphinx 4 so far, at least for our application. Next steps I guess will be:
Fiddle with tuning params looking for overall improvement
Try to build a better language model, based on all of our recordings.
Thanks again,
Carl
We really need a proper comparision of sphinxes to make them work. It seems that they all has problems sometimes, but they aren't easily detectable.
About your task 2) is much more important.
Hi Nickolay,
I know you're in major demand here, but did you get a chance to look at my post of 7/9 in this thread? Can you tell if I did something wrong, or if not, what I should do next?
Let me know if you have any problems accessing the files I provided.
Thanks much!
Carl