I am working on decoding news videos of about 4 minutes long. Currently I am using this one: http://www.youtube.com/watch?v=GrxzWWkZlr0. First I retrieve the video, extract and convert the sound in the suitable wav format.
Then I am using the HUB4 acoustic model and Gigaword language model (http://www.inference.phy.cam.ac.uk/kv227/lm_giga/). I have got so far really low accuracy rate, not to say close to 0%. I have tried to tune the general properties in different ways as well as the frontend but I still don't have any success.
Could it be explained by the duration of the video and its content?
Any tip would be really welcome. Thank you by advance.
I enclose the main parts of my configuration file:
Been a while since I've looked at Sphinx but I seem to remember that the audio used as your input has to be the same format and quality as that with which the AM was trained. Although you are converting to the correct WAV format, the actualy spectral audio features which Sphinx relies on to identify speech components, are modified through the compression that was applied when the original recording was transcoded to FLV. Through this process, vital information is lost and cannot simply be recreated by transcoding or upsampling to the format that is expected.
So in other words, if I record a WAV containing speech data and transcode to MP3 or FLV or anything that involves lossy compression, and then transcode back to the original WAV format, it is not exactly the same file... It has basically reduced to the quality of the compressed format, and is therefore not going to work well with Sphinx, as you have found.
See if you can recordings from other sources which have not compressed the audio track. FM radio might be a good bet (record directly to WAV from an FM radio source).
Regards,
Chris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the reply. That's a part I wasn't really aware of. I will try to get some new data from sources you recommend.
So whatever configuration I manage to have, I won't be able to get any interesting results with HUB4 for audio coming from converted videos retrieved on youtube.
If anyone succeeded to get interesting outcome decoding this way, please let me know.
Regards,
Boris.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Been a while since I've looked at Sphinx but I seem to remember that the audio used as your input has to be the same format and quality as that with which the AM was trained. Although you are converting to the correct WAV format, the actualy spectral audio features which Sphinx relies on to identify speech components, are modified through the compression that was applied when the original recording was transcoded to FLV. Through this process, vital information is lost and cannot simply be recreated by transcoding or upsampling to the format that is expected.
So in other words, if I record a WAV containing speech data and transcode to MP3 or FLV or anything that involves lossy compression, and then transcode back to the original WAV format, it is not exactly the same file... It has basically reduced to the quality of the compressed format, and is therefore not going to work well with Sphinx, as you have found.
See if you can recordings from other sources which have not compressed the audio track. FM radio might be a good bet (record directly to WAV from an FM radio source).
Regards,
Chris
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The accuracy should be low, but it should be around 50% I think, certainly not 0%. Are you sure you didn't make a mistake? For example what is the sample rate of the samples you are decoding.
Also make sure you are using latest trunk.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I am working on decoding news videos of about 4 minutes long. Currently I am using this one: http://www.youtube.com/watch?v=GrxzWWkZlr0. First I retrieve the video, extract and convert the sound in the suitable wav format.
Then I am using the HUB4 acoustic model and Gigaword language model (http://www.inference.phy.cam.ac.uk/kv227/lm_giga/). I have got so far really low accuracy rate, not to say close to 0%. I have tried to tune the general properties in different ways as well as the frontend but I still don't have any success.
Could it be explained by the duration of the video and its content?
Any tip would be really welcome. Thank you by advance.
I enclose the main parts of my configuration file:
<config>
<!-- ******** -->
<!-- frequently tuned properties -->
<!-- ******** -->
<property name="absoluteBeamWidth" value="20000"/>
<property name="relativeBeamWidth" value="1E-60"/>
<property name="absoluteWordBeamWidth" value="25"/>
<property name="relativeWordBeamWidth" value="1E-30"/>
<property name="wordInsertionProbability" value="0.01"/>
<property name="languageWeight" value="7"/>
<property name="silenceInsertionProbability" value=".01"/>
<property name="frontend" value="wavFrontEnd"/>
<property name="recognizer" value="recognizer"/>
<property name="showCreations" value="false"/>
<config>
<property name="logLevel" value="INFO"/>
</config>
<!-- ******** -->
<!-- word recognizer configuration -->
<!-- ******** -->
<component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer">
<property name="decoder" value="decoder"/>
<propertylist name="monitors">
<item>accuracyTracker </item>
<item>speedTracker </item>
<item>memoryTracker </item>
<item>recognizerMonitor </item>
</propertylist>
</component>
Been a while since I've looked at Sphinx but I seem to remember that the audio used as your input has to be the same format and quality as that with which the AM was trained. Although you are converting to the correct WAV format, the actualy spectral audio features which Sphinx relies on to identify speech components, are modified through the compression that was applied when the original recording was transcoded to FLV. Through this process, vital information is lost and cannot simply be recreated by transcoding or upsampling to the format that is expected.
So in other words, if I record a WAV containing speech data and transcode to MP3 or FLV or anything that involves lossy compression, and then transcode back to the original WAV format, it is not exactly the same file... It has basically reduced to the quality of the compressed format, and is therefore not going to work well with Sphinx, as you have found.
See if you can recordings from other sources which have not compressed the audio track. FM radio might be a good bet (record directly to WAV from an FM radio source).
Regards,
Chris
Thanks for the reply. That's a part I wasn't really aware of. I will try to get some new data from sources you recommend.
So whatever configuration I manage to have, I won't be able to get any interesting results with HUB4 for audio coming from converted videos retrieved on youtube.
If anyone succeeded to get interesting outcome decoding this way, please let me know.
Regards,
Boris.
For this you would need an AM trained using YouTube (FLV encoded) samples as far as I know.
Been a while since I've looked at Sphinx but I seem to remember that the audio used as your input has to be the same format and quality as that with which the AM was trained. Although you are converting to the correct WAV format, the actualy spectral audio features which Sphinx relies on to identify speech components, are modified through the compression that was applied when the original recording was transcoded to FLV. Through this process, vital information is lost and cannot simply be recreated by transcoding or upsampling to the format that is expected.
So in other words, if I record a WAV containing speech data and transcode to MP3 or FLV or anything that involves lossy compression, and then transcode back to the original WAV format, it is not exactly the same file... It has basically reduced to the quality of the compressed format, and is therefore not going to work well with Sphinx, as you have found.
See if you can recordings from other sources which have not compressed the audio track. FM radio might be a good bet (record directly to WAV from an FM radio source).
Regards,
Chris
The accuracy should be low, but it should be around 50% I think, certainly not 0%. Are you sure you didn't make a mistake? For example what is the sample rate of the samples you are decoding.
Also make sure you are using latest trunk.