I'm trying to get Sphinx4 to recognize digits using the Asterisk pbx system. I'm sending PCM-signed little-endian 16bit 8khz audio to Sphinx but the accuracy is pretty bad. Any hints on how to configure sphinx to handle the relatively poor fidelity of my VoIP app?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In additional to configuring S4 to read 8kHz, also try configuring it to read big endian. I'm suspecting that you're using the StreamDataSource, which expects big-endian by default. So change your config file to use:
Hope this helps. Note that it might be possible to convert your 8kHz data to 16kHz using the facilities in Java Sound (look at javax.sound.sampled.AudioSystem.getAudioInputStream() methods). We have this built into the edu.cmu.sphinx.frontend.util.Microphone class, but not the StreamDataSource class. Something we should add in the future.
philip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks Philip, I'll try converting to 16kHz and see If that helps. I had already suspected, as you did, that StreamDataSource expects BigEndian by default and I did change that in my config file. I actually ended up modifying the Microphone front end code to suit my needs instead of the StreamDataSource and that piece now seems to be in place and working OK.
If we can solve this accuracy issue I know there are lots of people in the Asterisk community who would love to be able to use a Sphinx tie-in (for marketing, please press or say "one").
Zac
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
If after converting to 16kHz you still don't see an increase in accuracy, feel free to tar-gz up a few of your 8kHz test files (the ones that have problems), and send it to me to try out, if you would like to.
philip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I tried feeding 8kHz audio data to Sphinx-4 via the live demo. It simply doesn't recognize very well. Trying to upsample it to 16kHz probably won't give you very good results either. The problem is that our acoustic model data is 16kHz, so 8kHz audio data just won't work for now. We're looking into the possibility of training some 8kHz models. So please hold this off for now. We'll let you know what happens.
philip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Do you know why there would be a descrepancy between the live vs. batch accuracy? Sphinx seems to handle the 8kHz data fine in batch mode but has trouble in live situtations.
Zac
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I think the discrepancy you're seeing is due to endian conversion problem. I actually replied to the same question you posted on the other thread. You'll find the fix at:
On the other hand, since the acoustic models are trained on 16kHz data, and your data is 8kHz, even if you fix the endian issue its bound to not work very well (meaning you won't get accuracy in the high 90s, which it should). So I hope you won't give up on Sphinx-4 based on this :-) If you actually feed it 16kHz data, decoding digits works very well (accuracy in the high 90s). We are looking into training some 8kHz models, since you're not the first person who wants to be able to handle audio data from the telephone line. Please bear with us for the moment, and we'll let you know as soon as we can.
philip
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The good news is that I'm now up to speed using SphinxTrain and have been able to create some 16kHz models based on the WSJ corpus. Many thanks to Bhiksha Raj for getting me started and helping me work through the learning curve.
I tested the resulting models against our WSJ5K regression test and it seems as though things worked fine. Of course, this doesn't mean I'm a SphinxTrain expert, but I'm at least able to use it to create models for Sphinx-4. :-)
I'm now working on converting the WSJ training data from 16kHz to 8kHz and will spin up some training sessions on my poor little Linux box at home. If I get something working, I'll figure out a way to get the resulting models into the open source.
Will
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
That would be great! I haven't heard anything lately from the Sphinx4 developers on the 8khz data they promised so if you have any success with this, please share! My email is zacw@comcast.net.
Thanks in advance,
Zac
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yesterday, my little linux box stopped whirring and out popped some 8kHz models trained from the clean channel of the WSJ0 training data. I did some testing today, and they seem to give OK results for 8kHz data.
They're now in the CVS repository under the sphinx4 module. If you do an update followed by an "ant clean all," you should end up with a jar file containing the new model: lib/WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.jar
Since this model was trained by merely downsampling the 16kHz data down to 8kHz, it doesn't include any telephony channel characteristics. So, I'm not sure how well it will work with your VoIP app. But, give it a shot and let me know how it works. If the model gives you better accuracy, but still isn't good enough for digits, I might try training up some 8kHz TIDIGITS models.
Signed,
Will, who's happy he got this far, but still doesn't feel up to the task of being able to answer many questions about SphinxTrain. :-)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm trying to get Sphinx4 to recognize digits using the Asterisk pbx system. I'm sending PCM-signed little-endian 16bit 8khz audio to Sphinx but the accuracy is pretty bad. Any hints on how to configure sphinx to handle the relatively poor fidelity of my VoIP app?
Zac,
In additional to configuring S4 to read 8kHz, also try configuring it to read big endian. I'm suspecting that you're using the StreamDataSource, which expects big-endian by default. So change your config file to use:
<component name="streamDataSource" type="edu.cmu.sphinx.frontend.util.StreamDataSource">
<property name="sampleRate" value="8000"/>
<property name="bigEndianData" value="false"/>
</component>
Hope this helps. Note that it might be possible to convert your 8kHz data to 16kHz using the facilities in Java Sound (look at javax.sound.sampled.AudioSystem.getAudioInputStream() methods). We have this built into the edu.cmu.sphinx.frontend.util.Microphone class, but not the StreamDataSource class. Something we should add in the future.
philip
Thanks Philip, I'll try converting to 16kHz and see If that helps. I had already suspected, as you did, that StreamDataSource expects BigEndian by default and I did change that in my config file. I actually ended up modifying the Microphone front end code to suit my needs instead of the StreamDataSource and that piece now seems to be in place and working OK.
If we can solve this accuracy issue I know there are lots of people in the Asterisk community who would love to be able to use a Sphinx tie-in (for marketing, please press or say "one").
Zac
Zac,
If after converting to 16kHz you still don't see an increase in accuracy, feel free to tar-gz up a few of your 8kHz test files (the ones that have problems), and send it to me to try out, if you would like to.
philip
Unfortunately it seems there's no way to convert "up" to 16kHz using the AudioSystem--you can only downsample, which makes sense I guess.
I think I'll take you up on your offer to try out some of my files. Can I send them to your ppk96 address?
I appreciate it.
Zac
The files I had trouble with seem to work fine in batch mode so there's something about my custom StreamDataSource that's the issue.
Hi Zac,
I tried feeding 8kHz audio data to Sphinx-4 via the live demo. It simply doesn't recognize very well. Trying to upsample it to 16kHz probably won't give you very good results either. The problem is that our acoustic model data is 16kHz, so 8kHz audio data just won't work for now. We're looking into the possibility of training some 8kHz models. So please hold this off for now. We'll let you know what happens.
philip
Thanks Philip,
Do you know why there would be a descrepancy between the live vs. batch accuracy? Sphinx seems to handle the 8kHz data fine in batch mode but has trouble in live situtations.
Zac
Hi Zac,
I think the discrepancy you're seeing is due to endian conversion problem. I actually replied to the same question you posted on the other thread. You'll find the fix at:
http://sourceforge.net/forum/forum.php?thread_id=1089720&forum_id=5471
On the other hand, since the acoustic models are trained on 16kHz data, and your data is 8kHz, even if you fix the endian issue its bound to not work very well (meaning you won't get accuracy in the high 90s, which it should). So I hope you won't give up on Sphinx-4 based on this :-) If you actually feed it 16kHz data, decoding digits works very well (accuracy in the high 90s). We are looking into training some 8kHz models, since you're not the first person who wants to be able to handle audio data from the telephone line. Please bear with us for the moment, and we'll let you know as soon as we can.
philip
I am trying to do the same thing using a Cisco router. Is there anymore news on the 8kHz models yet?
The good news is that I'm now up to speed using SphinxTrain and have been able to create some 16kHz models based on the WSJ corpus. Many thanks to Bhiksha Raj for getting me started and helping me work through the learning curve.
I tested the resulting models against our WSJ5K regression test and it seems as though things worked fine. Of course, this doesn't mean I'm a SphinxTrain expert, but I'm at least able to use it to create models for Sphinx-4. :-)
I'm now working on converting the WSJ training data from 16kHz to 8kHz and will spin up some training sessions on my poor little Linux box at home. If I get something working, I'll figure out a way to get the resulting models into the open source.
Will
That would be great! I haven't heard anything lately from the Sphinx4 developers on the 8khz data they promised so if you have any success with this, please share! My email is zacw@comcast.net.
Thanks in advance,
Zac
Yesterday, my little linux box stopped whirring and out popped some 8kHz models trained from the clean channel of the WSJ0 training data. I did some testing today, and they seem to give OK results for 8kHz data.
They're now in the CVS repository under the sphinx4 module. If you do an update followed by an "ant clean all," you should end up with a jar file containing the new model: lib/WSJ_8gau_13dCep_8kHz_31mel_200Hz_3500Hz.jar
Since this model was trained by merely downsampling the 16kHz data down to 8kHz, it doesn't include any telephony channel characteristics. So, I'm not sure how well it will work with your VoIP app. But, give it a shot and let me know how it works. If the model gives you better accuracy, but still isn't good enough for digits, I might try training up some 8kHz TIDIGITS models.
Signed,
Will, who's happy he got this far, but still doesn't feel up to the task of being able to answer many questions about SphinxTrain. :-)