Menu

Decoding News video - Low accuracy

Help
boriguen
2009-05-21
2012-09-22
  • boriguen

    boriguen - 2009-05-21

    Hi all,

    I am working on decoding news videos of about 4 minutes long. Currently I am using this one: http://www.youtube.com/watch?v=GrxzWWkZlr0. First I retrieve the video, extract and convert the sound in the suitable wav format.

    Then I am using the HUB4 acoustic model and Gigaword language model (http://www.inference.phy.cam.ac.uk/kv227/lm_giga/). I have got so far really low accuracy rate, not to say close to 0%. I have tried to tune the general properties in different ways as well as the frontend but I still don't have any success.

    Could it be explained by the duration of the video and its content?

    Any tip would be really welcome. Thank you by advance.

    I enclose the main parts of my configuration file:

    <config>
    <!-- ******** -->
    <!-- frequently tuned properties -->
    <!-- ******** -->
    <property name="absoluteBeamWidth" value="20000"/>
    <property name="relativeBeamWidth" value="1E-60"/>
    <property name="absoluteWordBeamWidth" value="25"/>
    <property name="relativeWordBeamWidth" value="1E-30"/>
    <property name="wordInsertionProbability" value="0.01"/>
    <property name="languageWeight" value="7"/>
    <property name="silenceInsertionProbability" value=".01"/>
    <property name="frontend" value="wavFrontEnd"/>
    <property name="recognizer" value="recognizer"/>
    <property name="showCreations" value="false"/>
    <config>
    <property name="logLevel" value="INFO"/>
    </config>

    <!-- ******** -->
    <!-- word recognizer configuration -->
    <!-- ******** -->
    <component name="recognizer" type="edu.cmu.sphinx.recognizer.Recognizer">
    <property name="decoder" value="decoder"/>
    <propertylist name="monitors">
    <item>accuracyTracker </item>
    <item>speedTracker </item>
    <item>memoryTracker </item>
    <item>recognizerMonitor </item>
    </propertylist>
    </component>

    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Decoder configuration --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;decoder&quot; type=&quot;edu.cmu.sphinx.decoder.Decoder&quot;&gt;
        &lt;property name=&quot;searchManager&quot; value=&quot;wordPruningSearchManager&quot;/&gt;     
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Search Manager --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;wordPruningSearchManager&quot; type=&quot;edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager&quot;&gt;
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
        &lt;property name=&quot;linguist&quot; value=&quot;lexTreeLinguist&quot;/&gt;
        &lt;property name=&quot;pruner&quot; value=&quot;trivialPruner&quot;/&gt;
        &lt;property name=&quot;scorer&quot; value=&quot;threadedScorer&quot;/&gt;
        &lt;property name=&quot;activeListManager&quot; value=&quot;activeListManager&quot;/&gt;
        &lt;property name=&quot;growSkipInterval&quot; value=&quot;0&quot;/&gt;
        &lt;property name=&quot;keepAllTokens&quot; value=&quot;true&quot;/&gt;
        &lt;property name=&quot;checkStateOrder&quot; value=&quot;false&quot;/&gt;
        &lt;property name=&quot;buildWordLattice&quot; value=&quot;false&quot;/&gt;
        &lt;property name=&quot;maxLatticeEdges&quot; value=&quot;3&quot;/&gt;
        &lt;property name=&quot;acousticLookaheadFrames&quot; value=&quot;2.0&quot;/&gt;
        &lt;property name=&quot;relativeBeamWidth&quot; value=&quot;${relativeBeamWidth}&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Active Lists --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;activeListManager&quot; type=&quot;edu.cmu.sphinx.decoder.search.SimpleActiveListManager&quot;&gt;
        &lt;propertylist name=&quot;activeListFactories&quot;&gt;
            &lt;item&gt;standardActiveListFactory&lt;/item&gt;
            &lt;item&gt;wordActiveListFactory&lt;/item&gt;
            &lt;item&gt;wordActiveListFactory&lt;/item&gt;
            &lt;item&gt;standardActiveListFactory&lt;/item&gt;
            &lt;item&gt;standardActiveListFactory&lt;/item&gt;
            &lt;item&gt;standardActiveListFactory&lt;/item&gt;
        &lt;/propertylist&gt;
    &lt;/component&gt;
    
    &lt;component name=&quot;standardActiveListFactory&quot; type=&quot;edu.cmu.sphinx.decoder.search.PartitionActiveListFactory&quot;&gt;
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
        &lt;property name=&quot;absoluteBeamWidth&quot; value=&quot;${absoluteBeamWidth}&quot;/&gt;
        &lt;property name=&quot;relativeBeamWidth&quot; value=&quot;${relativeBeamWidth}&quot;/&gt;
    &lt;/component&gt;
    
    &lt;component name=&quot;wordActiveListFactory&quot; type=&quot;edu.cmu.sphinx.decoder.search.PartitionActiveListFactory&quot;&gt;
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
        &lt;property name=&quot;absoluteBeamWidth&quot; value=&quot;${absoluteWordBeamWidth}&quot;/&gt;
        &lt;property name=&quot;relativeBeamWidth&quot; value=&quot;${relativeWordBeamWidth}&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Pruner --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;trivialPruner&quot; type=&quot;edu.cmu.sphinx.decoder.pruner.SimplePruner&quot;/&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Scorer --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;threadedScorer&quot; type=&quot;edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer&quot;&gt;
        &lt;property name=&quot;frontend&quot; value=&quot;${frontend}&quot;/&gt;
        &lt;property name=&quot;isCpuRelative&quot; value=&quot;false&quot;/&gt;
        &lt;property name=&quot;numThreads&quot; value=&quot;1&quot;/&gt;
        &lt;property name=&quot;minScoreablesPerThread&quot; value=&quot;10&quot;/&gt;
        &lt;property name=&quot;scoreablesKeepFeature&quot; value=&quot;false&quot;/&gt;
    &lt;/component&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The linguist configuration --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;lexTreeLinguist&quot; type=&quot;edu.cmu.sphinx.linguist.lextree.LexTreeLinguist&quot;&gt;
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
        &lt;property name=&quot;acousticModel&quot; value=&quot;hub4&quot;/&gt;
        &lt;property name=&quot;languageModel&quot; value=&quot;gigaModel&quot;/&gt;
        &lt;property name=&quot;dictionary&quot; value=&quot;dictionaryHUB4&quot;/&gt;
        &lt;!-- &lt;property name=&quot;addFillerWords&quot; value=&quot;false&quot;/&gt;
        &lt;property name=&quot;fillerInsertionProbability&quot; value=&quot;1E-10&quot;/&gt; --&gt;
        &lt;property name=&quot;generateUnitStates&quot; value=&quot;true&quot;/&gt;
        &lt;property name=&quot;wantUnigramSmear&quot; value=&quot;true&quot;/&gt;
        &lt;property name=&quot;unigramSmearWeight&quot; value=&quot;1&quot;/&gt;
        &lt;property name=&quot;wordInsertionProbability&quot; value=&quot;${wordInsertionProbability}&quot;/&gt;
        &lt;property name=&quot;silenceInsertionProbability&quot; value=&quot;${silenceInsertionProbability}&quot;/&gt;
        &lt;property name=&quot;languageWeight&quot; value=&quot;${languageWeight}&quot;/&gt;
        &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Dictionary configuration HUB4                        --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;dictionaryHUB4&quot; 
        type=&quot;edu.cmu.sphinx.linguist.dictionary.FullDictionary&quot;&gt;
        &lt;property name=&quot;dictionaryPath&quot;
                  value=&quot;resource:/edu.cmu.sphinx.model.acoustic.HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.Model!/edu/cmu/sphinx/model/acoustic/HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz/cmudict.06d&quot;/&gt;
        &lt;property name=&quot;fillerPath&quot; 
              value=&quot;resource:/edu.cmu.sphinx.model.acoustic.HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.Model!/edu/cmu/sphinx/model/acoustic/HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz/fillerdict&quot;/&gt;
        &lt;property name=&quot;addSilEndingPronunciation&quot; value=&quot;false&quot;/&gt;
        &lt;property name=&quot;wordReplacement&quot; value=&quot;&amp;lt;sil&amp;gt;&quot;/&gt;
        &lt;property name=&quot;allowMissingWords&quot; value=&quot;false&quot;/&gt;        
        &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Language Model configuration HUB4 --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;hub4Model&quot; 
        type=&quot;edu.cmu.sphinx.linguist.language.ngram.large.LargeTrigramModel&quot;&gt;        
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;        
        &lt;property name=&quot;maxDepth&quot; value=&quot;3&quot;/&gt;
        &lt;property name=&quot;unigramWeight&quot; value=&quot;.5&quot;/&gt;
        &lt;property name=&quot;dictionary&quot; value=&quot;dictionaryHUB4&quot;/&gt;
        &lt;property name=&quot;location&quot;
            value=&quot;D:/lectures/Master_Soft_Eng/Thesis_Work(exd950)/ExperimentalSystem/AutoSubGen/models/language/HUB4_trigram_lm/language_model.arpaformat.DMP&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The Language Model configuration GIGA --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;gigaModel&quot; 
        type=&quot;edu.cmu.sphinx.linguist.language.ngram.large.LargeTrigramModel&quot;&gt;        
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;        
        &lt;property name=&quot;maxDepth&quot; value=&quot;3&quot;/&gt;
        &lt;property name=&quot;unigramWeight&quot; value=&quot;.5&quot;/&gt;
        &lt;property name=&quot;dictionary&quot; value=&quot;dictionaryHUB4&quot;/&gt;
        &lt;property name=&quot;location&quot;
            value=&quot;D:/lectures/Master_Soft_Eng/Thesis_Work(exd950)/ExperimentalSystem/AutoSubGen/models/language/lm_giga_64k_vp_3gram.DMP&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The acoustic model configuration HUB4                    --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;hub4&quot;
               type=&quot;edu.cmu.sphinx.model.acoustic.HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.Model&quot;&gt;
        &lt;property name=&quot;loader&quot; value=&quot;hub4Loader&quot;/&gt;
        &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
    &lt;/component&gt;
    
    &lt;component name=&quot;hub4Loader&quot; type=&quot;edu.cmu.sphinx.model.acoustic.HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.ModelLoader&quot;&gt;
        &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
        &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
    &lt;/component&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The unit manager configuration --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;unitManager&quot; type=&quot;edu.cmu.sphinx.linguist.acoustic.UnitManager&quot;/&gt;
    
    &lt;!-- ******************************************************** --&gt;
    &lt;!-- The frontend configuration --&gt;
    &lt;!-- ******************************************************** --&gt;
    &lt;component name=&quot;wavFrontEnd&quot; type=&quot;edu.cmu.sphinx.frontend.FrontEnd&quot;&gt;
        &lt;propertylist name=&quot;pipeline&quot;&gt;
            &lt;item&gt;streamDataSource&lt;/item&gt;           
            &lt;item&gt;speechClassifier&lt;/item&gt;
            &lt;item&gt;speechMarker&lt;/item&gt;
            &lt;item&gt;nonSpeechDataFilter&lt;/item&gt;
            &lt;item&gt;premphasizer&lt;/item&gt;
            &lt;item&gt;windower&lt;/item&gt;
            &lt;item&gt;fft&lt;/item&gt;
            &lt;item&gt;melFilterBank&lt;/item&gt;
            &lt;item&gt;dct&lt;/item&gt;
            &lt;item&gt;liveCMN&lt;/item&gt; &lt;!-- batchCMN for batch mode --&gt;
            &lt;item&gt;featureExtraction&lt;/item&gt;
        &lt;/propertylist&gt;
    &lt;/component&gt;
    
    &lt;component name=&quot;streamDataSource&quot; type=&quot;edu.cmu.sphinx.frontend.util.StreamDataSource&quot;&gt;
        &lt;property name=&quot;sampleRate&quot; value=&quot;16000&quot;/&gt;
        &lt;property name=&quot;bitsPerSample&quot; value=&quot;16&quot;/&gt;
        &lt;property name=&quot;bigEndianData&quot; value=&quot;false&quot;/&gt;
        &lt;property name=&quot;signedData&quot; value=&quot;true&quot;/&gt;
        &lt;property name=&quot;bytesPerRead&quot; value=&quot;320&quot;/&gt;
    &lt;/component&gt;
    
    &lt;component name=&quot;speechClassifier&quot; type=&quot;edu.cmu.sphinx.frontend.endpoint.SpeechClassifier&quot;&gt;
        &lt;property name=&quot;threshold&quot; value=&quot;12&quot;/&gt;       
    &lt;/component&gt;
    
    &lt;component name=&quot;nonSpeechDataFilter&quot; type=&quot;edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter&quot;&gt;        
    &lt;/component&gt;
    
    &lt;component name=&quot;speechMarker&quot; type=&quot;edu.cmu.sphinx.frontend.endpoint.SpeechMarker&quot;&gt;
        &lt;property name=&quot;speechTrailer&quot; value=&quot;50&quot;/&gt;
    &lt;/component&gt;
    
    &lt;component name=&quot;premphasizer&quot; type=&quot;edu.cmu.sphinx.frontend.filter.Preemphasizer&quot;&gt; 
        &lt;property name=&quot;factor&quot; value=&quot;0.9&quot;/&gt; 
    &lt;/component&gt;
    
    &lt;component name=&quot;windower&quot; type=&quot;edu.cmu.sphinx.frontend.window.RaisedCosineWindower&quot;&gt; 
        &lt;!-- &lt;property name=&quot;windowSizeInMs&quot; value=&quot;25&quot;/&gt; --&gt; 
    &lt;/component&gt;
    
    &lt;component name=&quot;fft&quot; type=&quot;edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform&quot;/&gt;
    
    &lt;component name=&quot;melFilterBank&quot; type=&quot;edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank&quot;&gt; 
        &lt;!-- &lt;property name=&quot;numberFilters&quot; value=&quot;40&quot;/&gt; 
        &lt;property name=&quot;minimumFrequency&quot; value=&quot;130.0&quot;/&gt; 
        &lt;property name=&quot;maximumFrequency&quot; value=&quot;6800.0&quot;/&gt; --&gt; 
    &lt;/component&gt;
    
    &lt;component name=&quot;dct&quot; type=&quot;edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform&quot;/&gt;
    
    &lt;component name=&quot;liveCMN&quot; type=&quot;edu.cmu.sphinx.frontend.feature.LiveCMN&quot;/&gt;
    
    &lt;component name=&quot;batchCMN&quot; type=&quot;edu.cmu.sphinx.frontend.feature.BatchCMN&quot;/&gt;
    
    &lt;component name=&quot;featureExtraction&quot; type=&quot;edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor&quot;/&gt;
    
     
    • Chris Deering

      Chris Deering - 2009-05-21

      Been a while since I've looked at Sphinx but I seem to remember that the audio used as your input has to be the same format and quality as that with which the AM was trained. Although you are converting to the correct WAV format, the actualy spectral audio features which Sphinx relies on to identify speech components, are modified through the compression that was applied when the original recording was transcoded to FLV. Through this process, vital information is lost and cannot simply be recreated by transcoding or upsampling to the format that is expected.

      So in other words, if I record a WAV containing speech data and transcode to MP3 or FLV or anything that involves lossy compression, and then transcode back to the original WAV format, it is not exactly the same file... It has basically reduced to the quality of the compressed format, and is therefore not going to work well with Sphinx, as you have found.

      See if you can recordings from other sources which have not compressed the audio track. FM radio might be a good bet (record directly to WAV from an FM radio source).

      Regards,
      Chris

       
    • boriguen

      boriguen - 2009-05-21

      Thanks for the reply. That's a part I wasn't really aware of. I will try to get some new data from sources you recommend.

      So whatever configuration I manage to have, I won't be able to get any interesting results with HUB4 for audio coming from converted videos retrieved on youtube.

      If anyone succeeded to get interesting outcome decoding this way, please let me know.

      Regards,

      Boris.

       
      • Chris Deering

        Chris Deering - 2009-05-21

        For this you would need an AM trained using YouTube (FLV encoded) samples as far as I know.

         
    • Chris Deering

      Chris Deering - 2009-05-21

      Been a while since I've looked at Sphinx but I seem to remember that the audio used as your input has to be the same format and quality as that with which the AM was trained. Although you are converting to the correct WAV format, the actualy spectral audio features which Sphinx relies on to identify speech components, are modified through the compression that was applied when the original recording was transcoded to FLV. Through this process, vital information is lost and cannot simply be recreated by transcoding or upsampling to the format that is expected.

      So in other words, if I record a WAV containing speech data and transcode to MP3 or FLV or anything that involves lossy compression, and then transcode back to the original WAV format, it is not exactly the same file... It has basically reduced to the quality of the compressed format, and is therefore not going to work well with Sphinx, as you have found.

      See if you can recordings from other sources which have not compressed the audio track. FM radio might be a good bet (record directly to WAV from an FM radio source).

      Regards,
      Chris

       
    • Nickolay V. Shmyrev

      The accuracy should be low, but it should be around 50% I think, certainly not 0%. Are you sure you didn't make a mistake? For example what is the sample rate of the samples you are decoding.

      Also make sure you are using latest trunk.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.