Menu

Pointers on how to improve accuracy?

Help
2008-05-06
2012-09-22
  • Carl Fisher

    Carl Fisher - 2008-05-06

    Hi all,

    We have adapted the WavFile demo to transcribe WAV files that contain more than just digits, but our recognition accuracy seems poor, and we thought maybe you folks might see something obvious we could be doing to improve things.

    Some Background:
    We are doing a small experiment to see if we can transcribe audio recorded interviews. We created a short survey and then got a couple dozen volunteers to take it. We recorded each question to a WAV file sampled at 16kHz, 16 bit mono as suggested in the documentation.

    We then took the interview script, added a few words we knew would appear in responses, and used it as the knowledge base for the LM tool to create new .LM and .DIC files.

    We actually replaced the WavFile config.xml with another we found on this forum, and modified it to reference the new LM and DIC files.

    Our Results:
    For a representative WAV file (see link below), the spoken words were

    "To categorize the voice characteristics of the audio recordings we need to collect some data about me as the interviewer and you as the respondent. For example, my gender is male."

    The transcription returns

    RESULT: eight and write voice characteristics the audio record when need collect need of of need

    While it does seem to have recognized a fair number of the words, it's still pretty far from being a useful transcription.

    I've included what I think are the relevant files here as links, for ease in reading this post. Let me know if you'd prefer to have them included inline, or if there's anything else you'd like to see.

    Config.xml:
    http://www.mediafire.com/?nbtzzhxh039

    Quex.dic:
    http://www.mediafire.com/?0ttydtajedx

    Quex.lm:
    http://www.mediafire.com/?fxrzj44yv3n

    Quex.sent:
    http://www.mediafire.com/?bne5zmgmgxe

    Sample WAV file produced in the interview:
    http://www.mediafire.com/?xunghzxt37w

    Same file with leading and trailing junk trimmed off:
    http://www.mediafire.com/?dfx4mc1m02d

    Any help would be appreciated!
    Carl

     
    • Carl Fisher

      Carl Fisher - 2008-07-09

      Nickolay,

      On your suggestion I tried modifying the language model we were using. It had used very long sentences that you implied were causing problems (or at least that's how I interpreted it). So I took the corpus and split it at what seemed like logical phrasing breaks (natural pauses). Then I repeated some words that appeared more frequently in the recordings than in the script (in an attempt to "favor" them as a translation) and I also added a few other words that weren't in the script but appeared frequently in people's free-form responses.

      I used lmtool to build a new LM and dict and re-ran my trial. Some transcriptions were better, but some were worse. I don't understand why they would be worse, and I'd really like to get the feeling that my work is taking me somewhere, that is that I am converging toward maximum accuracy.

      Can you provide me any insight into why these included transcriptions would be worse than in the previous trial? In fairness, I've included only a few cases where the transcription was worse- there were also quite a good number that showed improvement.

      If you have any questions, please ask. I put all the files in a ZIP and uploaded it to:

      http://www.mediafire.com/?bambm2s5wev

      Thanks!
      Carl

       
      • Nickolay V. Shmyrev

        Hm, I checked this, sorry for delays. Actually I'd recommend you to collect more testing data - around 50 files at least to get more or less significant statistics.

        About quality, it recognizes everything quite well. You probably just need to add more sentences to the lm (there is no "for example" in a proper position there) and use more variants in the dictionary. Try to add:

        EXAMPLE(2) IH G Z AE M P L

        in the proper place in the dictionary. Then it will decode first sentence better:

        WE NEED TO CATEGORIZE THE LOUDNESS OF OUR VOICES FOR EXAMPLE I I AM SAYING THIS ITEM IN A TONE OF VOICE THAT IS MEDIUM

         
    • Nickolay V. Shmyrev

      Hm, in your case beams are too narrow, you need wider ones (around 1e-120). Also, with such a bad quality it will be very hard to recognize accurately. Check your file, it has enormous DC. Could you record with better microphone?

      Heh, and to be honest I don't know what to do with sphinx4 to make it work. WSJ unfortunately doens't perform well. But sphinx3 with hub4 works quite well:

      sphinx3_decode \ -adcin yes \ -cepext .wav \ -cepdir . \ -ctl test.ctl \ -dict quex.dic \ -fdict filler.dict \ -remove_dc no \ -lm quex.lm \ -hmm /home/shmyrev/local/share/sphinx3/model/hmm/hub4_cd_continuous_8gau_1s_c_d_dd

      FWDVIT: TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORDINGS
      WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE
      RESPONDENT FOR EXAMPLE MY GENDER IS MALE (test)

      WSJ results are a little worse:

      sphinx3_decode \ -adcin yes \ -cepext .wav \ -cepdir . \ -ctl test.ctl \ -dict quex.dic \ -fdict filler.dict \ -remove_dc no \ -lm quex.lm \ -hmm wsj_all_cd30.mllt_cd_cont_4000 \ -subvq wsj_all_cd30.mllt_cd_cont_4000/subvq \ -beam 1e-80 \ -wbeam 1e-60 \ -subvqbeam 1e-2 \ -maxhmmpf 2500 \ -maxcdsenpf 1500 \ -maxwpf 20

      FWDVIT: A TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORD THE WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE REGION A FOR EXAMPLE MY GENDER IS A MALE A (test)

      But I don't know how to recognize this properly with sphinx4

       
      • Carl Fisher

        Carl Fisher - 2008-05-06

        Wow- thanks for the fast and great response, Nickolay!

        The audio was picked up with the built-in mic on a standard-issue (Windows) IBM laptop, and it is our hope to be able to do the interviews with no other special equipment. Once we have the files off the laptop, though, can you recommend any tools (and, helpfully, what adjustments to make to the files using those tools) that we could do post-collection and pre-recognition? Uh, and what's DC mean, that's so huge on the WAV file (sorry!)? I will make the adjustments to the beam you mention, and I'll try using HUB4 instead of WSJ. I am very impressed with the accuracy you got with Sphinx-3! Unfortunately, I don't think we have any *nix boxes around here (but I will check) so we may be stuck with Sphinx-4...

        Can you suggest any resources that might be able to help with improving Sphinx-4 accuracy? My naive impression was that 3 and 4 were basically the same program, so it should at least be possible to duplicate your impressive results?

        Thanks again,
        Carl

         
    • Nickolay V. Shmyrev

      Well, I looked closer on sphinx4 case and now I understand the reasons.
      As usual they are rather complicated but let me explain.

      First of all about DC. Google for "DC offset", you'll find a lot of
      pictures. Basically waveform is a function moving around zero. In
      perfect case silence has zero values. But due to hardware problems
      sometimes this function is shifted on a constant value. Silences are not
      zero regions but regions of some positive or negative value. Check your
      file with wav editor, you'll see that. Usually DC is not a problem since
      you can easily remove it by substraction of the average or by one-pole
      filter, but in your case it causes problems.

      Now let's look on the frontend configuration:

          <propertylist name="pipeline">
              <item>streamDataSource </item>
              <item>speechClassifier </item>
              <item>speechMarker </item>
              <item>nonSpeechDataFilter </item>
              <item>premphasizer </item>
              <item>windower </item>
              <item>fft </item>
              <item>melFilterBank </item>
              <item>dct </item>
              <item>BatchCMN </item>
              <item>featureExtraction </item>
          </propertylist>
      

      The first part of speechClassifier, speechMarker and nonSpeechData
      filter builds the so called endpointer. Speech is a continuous stream
      with pauses. Decoder can't decode big chunks at once, it usually splits
      big chunks on smaller ones by big pauses. In your case endpointer is not
      correct because of DC, it just can't detect the silence properly. Once
      you'll remove DC it will be more stable, but another problem will
      appear. Your language model. It's too small to cover all variants of
      small chunks. For example if we'll split your recording according to
      pauses, we'll get something like:

      TO CATEGORIZE THE VOICE CHARACTERISTICS OF THE AUDIO RECORDINGS
      WE NEED TO COLLECT SOME DATA ABOUT ME AS THE INTERVIEWER AND YOU AS THE RESPONDENT
      FOR EXAMPLE MY GENDER
      MALE

      But the problem is that your language model doens't cover such chunks,
      it wasn't trained on such text. Instead it only suitable to recognize a
      very big chunk. You need better language model.

      Now, let's try to remove endpointer to cover to be able to reuse the LM
      (sphinx3_decode also have no endpointer, it decodes phrase at once) by
      using the following frontend, alternatively you can try to set the
      property <property name="mergeSpeechSegments" value="true"/> of
      NonSpeechDataFilter, but since speech detector doesn't work it won't
      help.

          &lt;propertylist name=&quot;pipeline&quot;&gt;
              &lt;item&gt;streamDataSource &lt;/item&gt;
              &lt;item&gt;premphasizer &lt;/item&gt;
              &lt;item&gt;windower &lt;/item&gt;
              &lt;item&gt;fft &lt;/item&gt;
              &lt;item&gt;melFilterBank &lt;/item&gt;
              &lt;item&gt;dct &lt;/item&gt;
              &lt;item&gt;BatchCMN &lt;/item&gt;
              &lt;item&gt;featureExtraction &lt;/item&gt;
          &lt;/propertylist&gt;
      

      Ok, now we have the better utterance in a big chunk that is suitable for
      decoding with our LM, but there are another problem - there are big
      parts of silence in the utterance. To let decoder find them, you have to
      build a special dictionary that sphinx4 won't build by default (sphinx3
      will):

      &lt;component name=&quot;dictionary&quot; type=&quot;edu.cmu.sphinx.linguist.dictionary.FastDictionary&quot;&gt;
          &lt;property name=&quot;dictionaryPath&quot; value=&quot;resource:/demo.sphinx.wavfile.WavFile!/demo/sphinx/wavfile/quex.dic&quot;/&gt;
          &lt;property name=&quot;fillerPath&quot; value=&quot;resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict&quot;/&gt;
          &lt;property name=&quot;addSilEndingPronunciation&quot; value=&quot;true&quot;/&gt;
          &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
      &lt;/component&gt;
      

      addSilEndingPronuncation=true is important here. Once you'll do the
      above, it will give you the acceptable results:

      RESULT: to categorize the voice characteristics of the audio recordings
      we need to collect some data of need the an years own for am will my
      gender is male in

      Still, sphinx4 doesn't perform well on utterances with a big silence
      part like yours, sphinx3 is better. For more info, read Javadoc on all
      this. Basically I'd better suggest you to

      1) Get better microphone
      2) Enable endpointer but get better language model (sphinx3 will also require that once
      you'll decode very big utterances).
      3) Use sphinx3, it's available on windows either as a native binary or under cygwin or

       
    • Carl Fisher

      Carl Fisher - 2008-05-12

      Thanks for your help, Nickolay. I haven't figured out how to use the HUB4 model yet, but made the changes you suggest to my config file (pasted below), and used Audacity (another Sourceforge project) to make the following edits to the WAV file:
      - trimmed leading and trailing silence
      - applied low pass filter at 5kHz
      - applied hi pass filter at 800Hz
      - normalized resulting file to -3dB peak level

      The tranlation Sphinx4 came back with was:

      RESULT: categorize voice characteristics during record continue collect need to need being please own for can my gender is male

      I've tried various combinations of mods to the WAV file (including cutting out long pauses within the file), and played around with the beams a bit in the config.xml, but have not stumbled upon a combination that's yielded better results than what's shown above.

      Nickolay, can I ask what you did different from me to produce such better results, even with Sphinx4? If part of it was using HUB4, could you provide an example of a config that uses it? I'd like to get some useful results, but I'm feeling a bit frustrated and what documentation I can find does not seem to provide the answers I'm looking for.

      Thanks again for the help!
      Carl

      Modified WAV file:
      http://www.mediafire.com/?mlmrr2dxxxe

      Config file with mods to beam, plus front end and dictionary section changes:
      <?xml version="1.0" encoding="UTF-8"?>

      <!--
      Sphinx-4 Configuration file
      -->

      <!-- ******** -->
      <!-- an4 configuration file -->
      <!-- ******** -->

      <config>
      <!-- ******** -->
      <!-- frequently tuned properties -->
      <!-- ******** -->
      <property name="absoluteBeamWidth" value="1500"/>
      <property name="relativeBeamWidth" value="1E-120"/>
      <property name="absoluteWordBeamWidth" value="20"/>
      <property name="relativeWordBeamWidth" value="1E-60"/>
      <property name="wordInsertionProbability" value="1E-16"/>
      <property name="languageWeight" value="7.0"/>
      <property name="silenceInsertionProbability" value=".1"/>
      <property name="frontend" value="epFrontEnd"/>
      <property name="recognizer" value="recognizer"/>
      <property name="showCreations" value="false"/>

      &lt;!-- ******************************************************** --&gt;
      &lt;!-- word recognizer configuration                            --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;recognizer&quot; 
                            type=&quot;edu.cmu.sphinx.recognizer.Recognizer&quot;&gt;
          &lt;property name=&quot;decoder&quot; value=&quot;decoder&quot;/&gt;
          &lt;propertylist name=&quot;monitors&quot;&gt;
              &lt;item&gt;speedTracker &lt;/item&gt;
              &lt;item&gt;memoryTracker &lt;/item&gt;
          &lt;/propertylist&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The Decoder   configuration                              --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;decoder&quot; type=&quot;edu.cmu.sphinx.decoder.Decoder&quot;&gt;
          &lt;property name=&quot;searchManager&quot; value=&quot;wordPruningSearchManager&quot;/&gt;
          &lt;property name=&quot;featureBlockSize&quot; value=&quot;50&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The Search Manager                                       --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;wordPruningSearchManager&quot; 
      type=&quot;edu.cmu.sphinx.decoder.search.WordPruningBreadthFirstSearchManager&quot;&gt;
          &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
          &lt;property name=&quot;linguist&quot; value=&quot;lexTreeLinguist&quot;/&gt;
          &lt;property name=&quot;pruner&quot; value=&quot;trivialPruner&quot;/&gt;
          &lt;property name=&quot;scorer&quot; value=&quot;threadedScorer&quot;/&gt;
          &lt;property name=&quot;activeListManager&quot; value=&quot;activeListManager&quot;/&gt;
          &lt;property name=&quot;growSkipInterval&quot; value=&quot;0&quot;/&gt;
          &lt;property name=&quot;checkStateOrder&quot; value=&quot;false&quot;/&gt;
          &lt;property name=&quot;buildWordLattice&quot; value=&quot;false&quot;/&gt;
          &lt;property name=&quot;acousticLookaheadFrames&quot; value=&quot;1.7&quot;/&gt;
          &lt;property name=&quot;relativeBeamWidth&quot; value=&quot;${relativeBeamWidth}&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The Active Lists                                         --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;activeListManager&quot; 
               type=&quot;edu.cmu.sphinx.decoder.search.SimpleActiveListManager&quot;&gt;
          &lt;propertylist name=&quot;activeListFactories&quot;&gt;
          &lt;item&gt;standardActiveListFactory&lt;/item&gt;
          &lt;item&gt;wordActiveListFactory&lt;/item&gt;
          &lt;item&gt;wordActiveListFactory&lt;/item&gt;
          &lt;item&gt;standardActiveListFactory&lt;/item&gt;
          &lt;item&gt;standardActiveListFactory&lt;/item&gt;
          &lt;item&gt;standardActiveListFactory&lt;/item&gt;
      &lt;/propertylist&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;standardActiveListFactory&quot; 
               type=&quot;edu.cmu.sphinx.decoder.search.PartitionActiveListFactory&quot;&gt;
          &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
          &lt;property name=&quot;absoluteBeamWidth&quot; value=&quot;${absoluteBeamWidth}&quot;/&gt;
          &lt;property name=&quot;relativeBeamWidth&quot; value=&quot;${relativeBeamWidth}&quot;/&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;wordActiveListFactory&quot; 
               type=&quot;edu.cmu.sphinx.decoder.search.PartitionActiveListFactory&quot;&gt;
          &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
          &lt;property name=&quot;absoluteBeamWidth&quot; value=&quot;${absoluteWordBeamWidth}&quot;/&gt;
          &lt;property name=&quot;relativeBeamWidth&quot; value=&quot;${relativeWordBeamWidth}&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The Pruner                                               --&gt;
      &lt;!-- ******************************************************** --&gt; 
      &lt;component name=&quot;trivialPruner&quot; 
                  type=&quot;edu.cmu.sphinx.decoder.pruner.SimplePruner&quot;/&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- TheScorer                                                --&gt;
      &lt;!-- ******************************************************** --&gt; 
      &lt;component name=&quot;threadedScorer&quot; 
                  type=&quot;edu.cmu.sphinx.decoder.scorer.ThreadedAcousticScorer&quot;&gt;
          &lt;property name=&quot;frontend&quot; value=&quot;${frontend}&quot;/&gt;
          &lt;property name=&quot;isCpuRelative&quot; value=&quot;true&quot;/&gt;
          &lt;property name=&quot;numThreads&quot; value=&quot;0&quot;/&gt;
          &lt;property name=&quot;minScoreablesPerThread&quot; value=&quot;10&quot;/&gt;
          &lt;property name=&quot;scoreablesKeepFeature&quot; value=&quot;true&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The linguist  configuration                              --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;lexTreeLinguist&quot; 
                  type=&quot;edu.cmu.sphinx.linguist.lextree.LexTreeLinguist&quot;&gt;
          &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
          &lt;property name=&quot;acousticModel&quot; value=&quot;wsj&quot;/&gt;
          &lt;property name=&quot;languageModel&quot; value=&quot;trigramModel&quot;/&gt;
          &lt;property name=&quot;dictionary&quot; value=&quot;dictionary&quot;/&gt;
          &lt;property name=&quot;addFillerWords&quot; value=&quot;false&quot;/&gt;
          &lt;property name=&quot;fillerInsertionProbability&quot; value=&quot;1E-10&quot;/&gt;
          &lt;property name=&quot;generateUnitStates&quot; value=&quot;false&quot;/&gt;
          &lt;property name=&quot;wantUnigramSmear&quot; value=&quot;true&quot;/&gt;
          &lt;property name=&quot;unigramSmearWeight&quot; value=&quot;1&quot;/&gt;
          &lt;property name=&quot;wordInsertionProbability&quot; 
                  value=&quot;${wordInsertionProbability}&quot;/&gt;
          &lt;property name=&quot;silenceInsertionProbability&quot; 
                  value=&quot;${silenceInsertionProbability}&quot;/&gt;
          &lt;property name=&quot;languageWeight&quot; value=&quot;${languageWeight}&quot;/&gt;
          &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The Dictionary configuration                            --&gt;
      &lt;!-- ******************************************************** --&gt;
      &lt;!--component name=&quot;dictionary&quot; 
          type=&quot;edu.cmu.sphinx.linguist.dictionary.FastDictionary&quot;&gt;
          &lt;property name=&quot;dictionaryPath&quot;
                    value=&quot;resource:/demo.sphinx.wavfile.WavFile!/demo/sphinx/wavfile/quex.dic&quot;/&gt;
          &lt;property name=&quot;fillerPath&quot; 
                value=&quot;resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict&quot;/&gt;
          &lt;property name=&quot;addSilEndingPronunciation&quot; value=&quot;false&quot;/&gt;
          &lt;property name=&quot;wordReplacement&quot; value=&quot;&amp;lt;sil&amp;gt;&quot;/&gt;
          &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
      &lt;/component--&gt;
      &lt;component name=&quot;dictionary&quot;
          type=&quot;edu.cmu.sphinx.linguist.dictionary.FastDictionary&quot;&gt;
          &lt;property name=&quot;dictionaryPath&quot;
          value=&quot;resource:/demo.sphinx.wavfile.WavFile!/demo/sphinx/wavfile/quex.dic&quot;/&gt;
          &lt;property name=&quot;fillerPath&quot;
          value=&quot;resource:/edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model!/edu/cmu/sphinx/model/acoustic/WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz/dict/fillerdict&quot;/&gt;
          &lt;property name=&quot;addSilEndingPronunciation&quot; value=&quot;true&quot;/&gt;
          &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The Language Model configuration                         --&gt;
      &lt;!-- ******************************************************** --&gt;
      &lt;component name=&quot;trigramModel&quot; 
          type=&quot;edu.cmu.sphinx.linguist.language.ngram.SimpleNGramModel&quot;&gt;
          &lt;property name=&quot;location&quot; 
              value=&quot;resource:/demo.sphinx.wavfile.WavFile!/demo/sphinx/wavfile/quex.lm&quot;/&gt;
          &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
          &lt;property name=&quot;dictionary&quot; value=&quot;dictionary&quot;/&gt;
          &lt;property name=&quot;maxDepth&quot; value=&quot;3&quot;/&gt;
          &lt;property name=&quot;unigramWeight&quot; value=&quot;.7&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The acoustic model configuration                         --&gt;
      &lt;!-- ******************************************************** --&gt;
      &lt;component name=&quot;wsj&quot;
                 type=&quot;edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model&quot;&gt;
          &lt;property name=&quot;loader&quot; value=&quot;wsjLoader&quot;/&gt;
          &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;wsjLoader&quot; type=&quot;edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader&quot;&gt;
          &lt;property name=&quot;logMath&quot; value=&quot;logMath&quot;/&gt;
          &lt;property name=&quot;unitManager&quot; value=&quot;unitManager&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The unit manager configuration                           --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;unitManager&quot; 
          type=&quot;edu.cmu.sphinx.linguist.acoustic.UnitManager&quot;/&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The live frontend configuration                          --&gt;
      &lt;!-- ******************************************************** --&gt;
      &lt;component name=&quot;epFrontEnd&quot; type=&quot;edu.cmu.sphinx.frontend.FrontEnd&quot;&gt;
          &lt;propertylist name=&quot;pipeline&quot;&gt;
              &lt;item&gt;streamDataSource &lt;/item&gt;
              &lt;!--item&gt;speechClassifier &lt;/item--&gt;
              &lt;!--item&gt;speechMarker &lt;/item--&gt;
              &lt;!--item&gt;nonSpeechDataFilter &lt;/item--&gt;
              &lt;item&gt;premphasizer &lt;/item&gt;
              &lt;item&gt;windower &lt;/item&gt;
              &lt;item&gt;fft &lt;/item&gt;
              &lt;item&gt;melFilterBank &lt;/item&gt;
              &lt;item&gt;dct &lt;/item&gt;
              &lt;item&gt;BatchCMN &lt;/item&gt;
              &lt;item&gt;featureExtraction &lt;/item&gt;
          &lt;/propertylist&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************** --&gt;
      &lt;!-- The frontend pipelines                                   --&gt;
      &lt;!-- ******************************************************** --&gt;
      
      &lt;component name=&quot;streamDataSource&quot;
                  type=&quot;edu.cmu.sphinx.frontend.util.StreamDataSource&quot;&gt;
          &lt;property name=&quot;sampleRate&quot; value=&quot;16000&quot;/&gt;
          &lt;property name=&quot;bitsPerSample&quot; value=&quot;16&quot;/&gt;
          &lt;property name=&quot;bigEndianData&quot; value=&quot;false&quot;/&gt;
          &lt;property name=&quot;signedData&quot; value=&quot;true&quot;/&gt;
          &lt;property name=&quot;bytesPerRead&quot; value=&quot;320&quot;/&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;speechClassifier&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.endpoint.SpeechClassifier&quot;&gt;
          &lt;property name=&quot;threshold&quot; value=&quot;13&quot;/&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;nonSpeechDataFilter&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.endpoint.NonSpeechDataFilter&quot;/&gt;
      
      &lt;component name=&quot;speechMarker&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.endpoint.SpeechMarker&quot; &gt;
          &lt;property name=&quot;speechTrailer&quot; value=&quot;50&quot;/&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;premphasizer&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.filter.Preemphasizer&quot;/&gt;
      
      &lt;component name=&quot;windower&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.window.RaisedCosineWindower&quot;&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;fft&quot; 
              type=&quot;edu.cmu.sphinx.frontend.transform.DiscreteFourierTransform&quot;&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;melFilterBank&quot; 
          type=&quot;edu.cmu.sphinx.frontend.frequencywarp.MelFrequencyFilterBank&quot;&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;dct&quot; 
              type=&quot;edu.cmu.sphinx.frontend.transform.DiscreteCosineTransform&quot;/&gt;
      
      &lt;component name=&quot;BatchCMN&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.feature.BatchCMN&quot;/&gt;
      
      &lt;component name=&quot;featureExtraction&quot; 
                 type=&quot;edu.cmu.sphinx.frontend.feature.DeltasFeatureExtractor&quot;/&gt;
      
      &lt;!-- ******************************************************* --&gt;
      &lt;!--  monitors                                               --&gt;
      &lt;!-- ******************************************************* --&gt;
      
      &lt;component name=&quot;memoryTracker&quot; 
                  type=&quot;edu.cmu.sphinx.instrumentation.MemoryTracker&quot;&gt;
          &lt;property name=&quot;recognizer&quot; value=&quot;${recognizer}&quot;/&gt;
      &lt;property name=&quot;showSummary&quot; value=&quot;false&quot;/&gt;
      &lt;property name=&quot;showDetails&quot; value=&quot;false&quot;/&gt;
      &lt;/component&gt;
      
      &lt;component name=&quot;speedTracker&quot; 
                  type=&quot;edu.cmu.sphinx.instrumentation.SpeedTracker&quot;&gt;
          &lt;property name=&quot;recognizer&quot; value=&quot;${recognizer}&quot;/&gt;
          &lt;property name=&quot;frontend&quot; value=&quot;${frontend}&quot;/&gt;
      &lt;property name=&quot;showSummary&quot; value=&quot;true&quot;/&gt;
      &lt;property name=&quot;showDetails&quot; value=&quot;false&quot;/&gt;
      &lt;/component&gt;
      
      &lt;!-- ******************************************************* --&gt;
      &lt;!--  Miscellaneous components                               --&gt;
      &lt;!-- ******************************************************* --&gt;
      
      &lt;component name=&quot;logMath&quot; type=&quot;edu.cmu.sphinx.util.LogMath&quot;&gt;
          &lt;property name=&quot;logBase&quot; value=&quot;1.0001&quot;/&gt;
          &lt;property name=&quot;useAddTable&quot; value=&quot;true&quot;/&gt;
      &lt;/component&gt;
      

      </config>

       
      • Nickolay V. Shmyrev

        make the following edits to the WAV file:
        - trimmed leading and trailing silence
        - applied low pass filter at 5kHz
        - applied hi pass filter at 800Hz
        - normalized resulting file to -3dB peak level

        This is the wrong step in completely opposite direction. Model extracts cepstrum from 130 to 6800 Hz, filtering will make recognition much more unstable.

        Check my files instead:

        http://www.mediafire.com/?lxj52ncnyad

         
    • Carl Fisher

      Carl Fisher - 2008-06-11

      Thanks Nickolay, I've been working on something else and missed your post. So I should not have done the filtering, but normalizing and removing the leading and trailing silence would be good steps, would they not?

      I'm also interested in trying Sphinx3 out for this, but I didn't see a binary distribution anywhere under http://cmusphinx.sourceforge.net/html/download.php. Can you provide me a link? (Sorry for the stupid question...)

      Thanks!
      Carl

       
      • Nickolay V. Shmyrev

        > So I should not have done the filtering, but normalizing and removing the leading and trailing silence would be good steps.

        Neither of them is a good step.

        > I'm also interested in trying Sphinx3 out for this

        Hm, sorry, you have to compile it yourself. It's a matter of 5 minutes.

         
    • David Huggins-Daines

      Hmm, I thought this was fixed in SPhinx 3.7 but apparently not.

      Sphinx3 has a stupid "feature" which for no good reason makes it refuse to load text-format language models unless you also add '-lminmemory 1' to your command-line or configuration file.

       
      • Carl Fisher

        Carl Fisher - 2008-06-17

        Aha- adding '-lminmemory 1' to the command line did it. Thanks!

        Now I have (I hope) a very simple Sphinx3 question. When I run sphinx3_decode on a single WAV file, using the params that Nickolay volunteered above, I get about 1000 lines of output, of which only 2 are the transcription. Is there a switch to reduce the number of info messages produced, or alternately output the result (only) to a separate destination?

        Thanks again,
        Carl

         
        • Nickolay V. Shmyrev

          -hyp test.result ?

           
          • Carl Fisher

            Carl Fisher - 2008-06-18

            That (-hyp <filename>)did the trick- thanks Nickolay! BTW, recognition with Sphinx 3 is much better than the best we'd achieved with Sphinx 4 so far, at least for our application. Next steps I guess will be:

            1. Fiddle with tuning params looking for overall improvement

            2. Try to build a better language model, based on all of our recordings.

            Thanks again,
            Carl

             
            • Nickolay V. Shmyrev

              We really need a proper comparision of sphinxes to make them work. It seems that they all has problems sometimes, but they aren't easily detectable.

              About your task 2) is much more important.

               
    • Carl Fisher

      Carl Fisher - 2008-07-15

      Hi Nickolay,

      I know you're in major demand here, but did you get a chance to look at my post of 7/9 in this thread? Can you tell if I did something wrong, or if not, what I should do next?

      Let me know if you have any problems accessing the files I provided.

      Thanks much!
      Carl

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.