I am just getting to grips with Sphinx 4 and it is looking quite promising for my needs.
This works:
I record from my machine's webcam an audio clip using windows sound recorder at 22Khz. I upload and pass it through FFMPEG to reduce it down to 16Khz. I then run it through Sphinx 4 and it gives me back some words, great!
Doesn't work:
I record my sound using the same webcam but it is now streamed to the server and saved as an FLV file at 22Khz. I do the same, pass it through FFMPEG and reduce it to 16Khz. I now pass it through Sphinx 4 at there are no matches.
The clips look the same when I view the audio details:
This is not going to work because of audio compression used. It's often ADPCM or CELP at 8 kHz. You need to take a telephone models and adapt them to the audio gone through the channel.
I would better avoid using FLV for transmission/storage.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello
I am just getting to grips with Sphinx 4 and it is looking quite promising for my needs.
This works:
I record from my machine's webcam an audio clip using windows sound recorder at 22Khz. I upload and pass it through FFMPEG to reduce it down to 16Khz. I then run it through Sphinx 4 and it gives me back some words, great!
Doesn't work:
I record my sound using the same webcam but it is now streamed to the server and saved as an FLV file at 22Khz. I do the same, pass it through FFMPEG and reduce it to 16Khz. I now pass it through Sphinx 4 at there are no matches.
The clips look the same when I view the audio details:
Works:
WAVE (.wav) file, byte length: 3260104, data format: PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, little-endian, frame length: 1630030
Doesn't Work:
WAVE (.wav) file, byte length: 723744, data format: PCM_SIGNED 16000.0 Hz, 16 bit, mono, 2 bytes/frame, little-endian, frame length: 361850
BTW when I say it doesn't work, the file is accepted but nothing is matched.
I am running 1.0Beta3 on Linux.
Any help would be really appreciated.
Thank you
Ben
If it helps, I can post it all if needed...
<property name="absoluteBeamWidth" value="5000"/>
<property name="relativeBeamWidth" value="1E-120"/>
<property name="absoluteWordBeamWidth" value="200"/>
<property name="relativeWordBeamWidth" value="1E-80"/>
<property name="wordInsertionProbability" value="0.2"/>
<property name="languageWeight" value="10.5"/>
<property name="silenceInsertionProbability" value=".1"/>
<property name="frontend" value="epFrontEnd"/>
<property name="recognizer" value="recognizer"/>
<property name="showCreations" value="false"/>
Thanks for the info Nickolay.
This is not going to work because of audio compression used. It's often ADPCM or CELP at 8 kHz. You need to take a telephone models and adapt them to the audio gone through the channel.
I would better avoid using FLV for transmission/storage.