CMU Sphinx / Forums / Sphinx4 Help: question about Sphinx4 benchmarks

Val Veattie - 2004-06-30

Hi,
The Sphinx4 benchmarking tables are very thorough - but they lead to this question: why is Sphinx4 so much slower vs. Sphinx 3.3 on a pentium as opposed to an UltraSparc? Is that simply a Java issue? How would it change if using Windows rather than Linux on a pentium?

thanks very much!

Val Beattie

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Paul Lamere - 2004-06-30
  
  Val:
  
  Thanks for your interest in S4 and especially for your interesting question. There are a number of reasons why the numbers look as they do:
  
  1) We've recently completed a major performance push. In this push we've concentrated on improving performance of certain of the tests (such as tidigits_wordlist, rm1_trigram and hub4_trigram). There are a number of tests that just haven't been tuned properly. It would not take much to go back and revisit all of the tests and re-tune and optimize them based upon our better knowledge of how to tune the system, we just have not had time. So, depending upon the test that you look at, you may not be seeing S4 at its best.
  
  2) We show results for a pentium system called 'mickey'. Mickey was retired 6 months ago before much of our performance work was complete, as such, it is showing much slower numbers than current S4 performance.
  
  Lets take a look at rm1 performance on 'george'. Rm1 is 1000 word vocabulary n-gram recognition. This test has been properly tuned and has been run recently on all test systems.
  
  if we take a look at:
  
  http://cmusphinx.sourceforge.net/MediumVocabResults.html
  
  at the medium vocabulary results and scroll down to the latest results for 'george' ( a 2.2 ghz p4 witth 900mb of ram), we see for the rm1_trigram test the following results:
  
  RT
  s3 0.64
  s3.3 0.52
  s4 0.34
  
  This shows S4 running quite a bit faster than s3.3 for a 1000 word vocabulary that S3.3.
  
  My experience is that for tasks up to 5,000 to 10,000 word tasks S4 should give as good or better results than S3.3.
  
  If you are interested in the results for a particular test, let me know which test and I'll try to reconcile the numbers.
  
  And yes ... we should clean this up a bit, it would be nice to have all of the tests tuned to show S4 at its best, but we have such a large number of tests, we'd spend all our time doing that. Sigh ...
  
  Hope this helps ...
  
  Paul
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Val Veattie - 2004-06-30
    
    Paul -
    That was quick! Thanks for the clarification, that does help - I was wondering if it might be a tuning issue. One follow-on question - is there somewhere where the benchmark/regression tests are described in more detail? I can figure out the task and ngram part, but what is meant by fst, quick, flat . . .
    thanks,
    Val
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - Paul Lamere - 2004-06-30
      
      Val:
      
      There are some details on the S4 twiki. Some places to start:
      
      http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/RegressionTests
      and
      http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/RegressionTestReview
      
      This may not have all the info you need. So here's a quick glossary:
      
      xxx_quick: a quick version of the test. Sometimes we have more test data than we want to run for instance tidigits has 8700 utterances, the quick version only runs every 5th utterance, leaving 1740 utterances. The overall regression test takes 5 times less time to run.
      xxx_fst: Uses a finite state transducer to define the grammer. You can read more about FSTs here:
      
      http://www.speech.cs.cmu.edu/cgi-bin/cmusphinx/twiki/view/Sphinx4/FstGrammar
      
      flat_unigram: all words have equal probabilities of occuring
      unigram: each word can have a custom probability of occuring
      bigram: probably of word Y depends on Y and previous word X.
      
      xxx_jsgf: Uses a JSGF grammar to define the language model
      xxx_wordlist: Uses a wordlist grammar to define the language model
      xxx_rejection: a test for rejection accuracy
      
      ti46 - isolated digits
      tidigits - connected digits
      an4_full - 100 words
      an4_spelling - just the spelling (a through z) words of an4_full
      an4_words - just the non-spelling words of an4_full
      rm1 - 1000 word vocabulary
      wsj5k - 5,000 word vocabulary
      wsj20k - 20,000 word vocabuarly
      hub4 - 60K vocabulary
      
      So for instance rm1_unigram_fst_quick is the rm1 data set where the language model is a unigram that is defined by and FST, and only every 5th utterance is decoded.
      
      HTH
      
      Paul
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
      - Val Veattie - 2004-07-01
        
        Thanks for the key to the benchmarks, that helps a lot. I have one more question, don't know if you'll be able to help me with this but my other lead fizzled out.
        
        I'm currently working on software that uses Sphinx2 and we are looking at a possible move to Sphinx4. The software requires real-time recognition performance, on fairly low-end PCs. In terms of benchmarks, the missing link for us is determining how Sphinx4 (or Sphinx3, or Sphinx3.3 for that matter) compares to Sphinx2 on small to medium vocabulary tasks. Do you have any insight into that?
        
        Thanks again,
        
        Val
        
        If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Bhiksha Raj - 2004-07-01
  
  We dont have Sphinx4 vs. Sphinx2 comparisons, but we do have Sphinx3.3 to Sphinx2 comparisons (although not on the twiki pages), and from there one can extrapolate..
  
  Sphinx2 is a semi-continuous HMM based system which uses a shared set of 256 Gaussians for all tied states. It has a parameter called "-top", which determines how many Gaussians are scored for any tied state.
  When this is set to 1, on a vocabulary of about 10000 words, it runs in less than half the time that sphinx3.3 needs, in less than half the memory. The accuracy is also about 10-30% worse (in an absolute sense), however.
  If you set -top to 4, the accuracy catches up (it remains about 3-5% worse), but so does the xRT and the memory - it takes as much time, and as much memory.
  
  So if your current setup uses -top 1 (which is the default on the open source sphinx package) you can expect to see a signficiant hit in xRT, but a significant improvement in recognition performance in going to S4.
  If you're using -top 4, then you shouldnt see much difference in xRT and some improvement of WER.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Val Veattie - 2004-07-01
    
    Thanks for the info. We actually use -top 4 currently. Are the Sphinx2 v. Sphinx3.3 benchmarks somewhere public, and if so could you point me to them?
    
    Again, thanks for all the help.
    
    Val Beattie
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Paul Lamere - 2004-07-01
    
    I know that you are particularly interested in performance on Pentium class chips. Just this morning, Evandro and I were comparing results for tests run on 'george' our P4 system for the WSJ 5k tests (5,000 word vocabulary). We were seeing nearly identical speed and accuracy results between S3.3 and S4.
    
    WSJ 5K Trigram
    
    S3.3 WER 7.18
    S3.3 RT    0.72
    
    S4 WER   6.98
    S4 RT      0.71
    
    Paul
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

question about Sphinx4 benchmarks

Speech Recognition Toolkit

Forums

Help

question about Sphinx4 benchmarks document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

question about Sphinx4 benchmarks