Menu

Hub4 missing words

Help
2004-06-09
2012-09-22
  • Keith Ponting

    Keith Ponting - 2004-06-09

    Hi,

    I am trying to run the Hub4 system  (tests/performance/hub4) using the trigram LM and find that there are some words in the LM missing from the dictionary. I have checked "abscond" and "<unk>" against the dictionary in HUB4_8gau_13dCep_16k_40mel_133Hz_6855Hz.jar
    and those two at least are indeed missing.

    The system then attempts to run recognition but, possibly as a consequence of the missing words, I get empty HYP output in a very short time!

         [java] 04:45.300 WARNING dictionary        Missing word: <unk>
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.340 WARNING dictionary        Missing word: abidjan
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.343 WARNING dictionary        Missing word: abimael
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.344 WARNING dictionary        Missing word: abiquiu
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.364 WARNING dictionary        Missing word: abridging
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.366 WARNING dictionary        Missing word: abscond
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.367 WARNING dictionary        Missing word: absconded
         [java]                    in edu.cmu.sphinx.linguist.dictionary.FastDictionary:getWord-dictionary
         [java] 04:45.367 WARNING dictionary        Missing word: absconding
    ...
         [java] 04:45.995 WARNING trigramModel      Dictionary is missing 711 words that are contained in the language model.

     
    • Paul Lamere

      Paul Lamere - 2004-06-09

      Keith:

      Thanks for using Sphinx-4!   What you are seeing, the report of 711 missing words is normal behavior for the hub-4 test.  Expected behavior for the hub4 test is to report the missing words and proceed with recognition.  We see about an 18% WER with hub4.   You can view the latest test results here:

      http://cmusphinx.sourceforge.net/LargeVocabResults.html

      The fact that you get an empty HYP in a very short time is indeed indicative of a problem Are you running the test against the hub4 data? Are you running live mode, or against some other data set?

       
    • Keith Ponting

      Keith Ponting - 2004-06-09

      Paul:

      Thanks very much -- I was using an internal data set, which I had massaged to get into big-endian 16kHz form. I have just tried with the AN4 data and am getting sensible recognition results, so its back to the data prep drawing board!

      P.S. I am impressed by the flexible Java/XML setup to put together your choice of recognizer "on the fly".

       
      • Paul Lamere

        Paul Lamere - 2004-06-09

        Thanks for the kudos ... have fun, and let us know how it works out for you.

        Paul

         
        • Keith Ponting

          Keith Ponting - 2004-06-09

          Curiouser and curiouser. I think I have finally pinned down my empty output. If I run using as input the file an4/an4_clstk/fash/an251-fash-b.raw, then I get the correct recognition of "yes". If I append one tenth of a second of zero waveform to that file, it behaves exactly as my data is behaving -- comes back in very quick time with empty recognition:
               [java] REF:       yes
               [java] HYP:       yes

               [java]    Accuracy: 100.000%    Errors: 0  (Sub: 0  Ins: 0  Del: 0)
               [java]    Words: 1   Matches: 1    WER: 0.000%
               [java]    Sentences: 1   Matches: 1   SentenceAcc: 100.000%
               [java]    This  Time Audio: 1.00s  Proc: 9.68s  Speed: 9.68 X real time
               [java]    Total Time Audio: 1.00s  Proc: 9.68s  Speed: 9.68 X real time
               [java]    Mem  Total: 379.75 Mb  Free: 164.26 Mb
               [java]    Used: This: 215.49 Mb  Avg: 215.49 Mb  Max: 215.49 Mb

               [java] REF:       yes
               [java] HYP:

               [java]    Accuracy: 50.000%    Errors: 1  (Sub: 0  Ins: 0  Del: 1)
               [java]    Words: 2   Matches: 1    WER: 50.000%
               [java]    Sentences: 2   Matches: 1   SentenceAcc: 50.000%
               [java]    This  Time Audio: 1.10s  Proc: 0.04s  Speed: 0.04 X real time
               [java]    Total Time Audio: 2.10s  Proc: 9.72s  Speed: 4.63 X real time
               [java]    Mem  Total: 379.75 Mb  Free: 152.62 Mb
               [java]    Used: This: 227.13 Mb  Avg: 221.31 Mb  Max: 227.13 Mb

          Keith

           
          • Anonymous

            Anonymous - 2004-06-09

            I suggest that the "one tenth of a second of zero waveform" may be a problem.  In Sphinx2 and Sphinx3, the features are cepstra, which involves taking the log of the power spectrum.  If you feed it frames that are all zero, it causes overflow errors in the log computation.  This not only messes up those frames, but it writes very large numbers into the cepstral mean used for normalization, which will mess up subsequent speech frames as well.

            *I do not know* whether the Sphinx4 front end has a similar vulnerability to all-zero signals, but it may.  Try splicing some "actual silence" in the front instead of artificial silence.

            cheers,
              jerry wolf
              soliloquy learning, inc.

             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.