Menu

question about Sphinx4 benchmarks

2004-06-30
2012-09-21
  • Val Veattie

    Val Veattie - 2004-06-30

    Hi,
    The Sphinx4 benchmarking tables are very thorough - but they lead to this question: why is Sphinx4 so much slower vs. Sphinx 3.3 on a pentium as opposed to an UltraSparc? Is that simply a Java issue?  How would it change if using Windows rather than Linux on a pentium?

    thanks very much!

    Val Beattie

     
    • Paul Lamere

      Paul Lamere - 2004-06-30

      Val:

      Thanks for your interest in S4 and especially for your interesting question.  There are a number of reasons why the numbers look as they do:

      1) We've recently completed a major performance push. In this push we've concentrated on improving performance of certain of the tests (such as tidigits_wordlist, rm1_trigram and hub4_trigram).  There are a number of tests that just haven't been tuned properly. It would not take much to go back and revisit all of the tests and re-tune and optimize them based upon our better knowledge of how to tune the system, we just have not had time. So, depending upon the test that you look at, you may not be seeing S4  at its best.

      2) We show results for a pentium system called 'mickey'. Mickey was retired 6 months ago before much of our performance work was complete, as such, it is showing much slower numbers than current S4 performance. 

      Lets take a look at rm1 performance on 'george'. Rm1 is 1000 word vocabulary n-gram recognition. This test has been properly tuned and has been run recently on all test systems.

      if we take a look at:

      http://cmusphinx.sourceforge.net/MediumVocabResults.html

      at the medium vocabulary results and scroll down to the latest results for 'george' ( a 2.2 ghz p4 witth 900mb of ram), we see for the rm1_trigram test the following results:

      RT
      s3    0.64
      s3.3 0.52
      s4    0.34

      This shows S4 running quite a bit faster than s3.3 for a 1000 word vocabulary that S3.3.

      My experience is that for tasks up to 5,000 to 10,000 word tasks S4 should give as good or better results than S3.3. 

      If you are interested in the results for a particular test, let me know which test and I'll try to reconcile the numbers.

      And yes ... we should clean this up a bit, it would be nice to have all of the tests tuned to show S4 at its best, but we have such a large number of tests, we'd spend all our time doing that. Sigh ...

      Hope this helps ...

      Paul

       
      • Val Veattie

        Val Veattie - 2004-06-30

        Paul -
          That was quick!  Thanks for the clarification, that does help - I was wondering if it might be a tuning issue.  One follow-on question - is there somewhere where the benchmark/regression tests are described in more detail?  I can figure out the task and ngram part, but what is meant by fst, quick, flat . . .
           thanks,
        Val

         
        • Paul Lamere

          Paul Lamere - 2004-06-30

          Val:

          There are some details on the S4 twiki. Some places to start:

          http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/RegressionTests
          and
          http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/RegressionTestReview

          This may not have all the info you need.  So here's a quick glossary:

          xxx_quick:   a quick version of the test. Sometimes we have more test data than we want to run for instance tidigits has 8700 utterances, the quick version only runs every 5th utterance, leaving 1740 utterances. The overall regression test takes 5 times less time to run.
          xxx_fst:  Uses a finite state transducer to define the grammer. You can read more about FSTs here:

          http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/FstGrammar

          flat_unigram: all words have equal probabilities of occuring
          unigram: each word can have a custom probability of occuring
          bigram: probably of word Y depends on Y and previous word X.

          xxx_jsgf: Uses a JSGF grammar to define the language model
          xxx_wordlist: Uses a wordlist grammar to define the language model
          xxx_rejection: a test for rejection accuracy

          ti46 - isolated digits
          tidigits - connected digits
          an4_full - 100 words
          an4_spelling - just the spelling (a through z) words of an4_full
          an4_words - just the non-spelling words of an4_full
          rm1 - 1000 word vocabulary
          wsj5k - 5,000 word vocabulary
          wsj20k - 20,000 word vocabuarly
          hub4 - 60K vocabulary

          So for instance rm1_unigram_fst_quick is the rm1 data set where the language model is a unigram that is defined by and FST, and only every 5th utterance is decoded.

          HTH

          Paul

           
          • Val Veattie

            Val Veattie - 2004-07-01

            Thanks for the key to the benchmarks, that helps a lot.  I have one more question, don't know if you'll be able to help me with this but my other lead fizzled out.

            I'm currently working on software that uses Sphinx2 and we are looking at a possible move to Sphinx4.   The software requires real-time recognition performance, on fairly low-end PCs.  In terms of benchmarks, the missing link for us is determining how Sphinx4 (or Sphinx3, or Sphinx3.3 for that matter) compares to Sphinx2 on small to medium vocabulary tasks.  Do you have any insight into that?

            Thanks again,

            Val

             
    • Bhiksha Raj

      Bhiksha Raj - 2004-07-01

      We dont have Sphinx4 vs. Sphinx2 comparisons, but we do have Sphinx3.3 to Sphinx2 comparisons (although not on the twiki pages), and from there one can extrapolate..

      Sphinx2 is a semi-continuous HMM based system which uses a shared set of 256 Gaussians for all tied states. It has a parameter called "-top", which determines how many Gaussians are scored for any tied state.
      When this is set to 1, on a vocabulary of about 10000 words, it runs in less than half the time that sphinx3.3 needs, in less than half the memory. The accuracy is also about 10-30% worse (in an absolute sense), however.
      If you set -top to 4, the accuracy catches up (it remains about  3-5% worse), but so does the xRT and the memory - it takes as much time, and as much memory.

      So if your current setup uses -top 1 (which is the default on the open source sphinx package) you can expect to see a signficiant hit in xRT, but a significant improvement in recognition performance in going to S4.
      If you're using -top 4, then you shouldnt see much difference in xRT and some improvement of WER.

       
      • Val Veattie

        Val Veattie - 2004-07-01

        Thanks for the info.  We actually use -top 4 currently.  Are the Sphinx2 v. Sphinx3.3 benchmarks somewhere public, and if so could you point me to them?

        Again, thanks for all the help.

        Val Beattie

         
      • Paul Lamere

        Paul Lamere - 2004-07-01

        I know that you are particularly interested in performance on Pentium class chips.  Just this morning, Evandro and I were comparing results for tests run on  'george' our P4 system for the WSJ 5k tests (5,000 word vocabulary).  We were seeing nearly identical speed and accuracy results between S3.3 and S4.

        WSJ 5K Trigram

        S3.3 WER 7.18
        S3.3 RT    0.72

        S4 WER   6.98
        S4 RT      0.71

        Paul

         

Log in to post a comment.

MongoDB Logo MongoDB