Menu

Sphinx vs Siri

Li3
2012-01-09
2013-05-11
  • Li3

    Li3 - 2012-01-09

    What does it take get Siri like speech recognition performance
    from Sphinx (say Pocketsphinx) ? Is it just a bigger vocabulary, a good
    language model and and a good acoustic model or something more ?

    Li

     
  • Nickolay V. Shmyrev

    What does it take get Siri like speech recognition performance from Sphinx
    (say Pocketsphinx) ? Is it just a bigger vocabulary, a good language model and
    and a good acoustic model or something more ?

    Commercial ASR services like the one from Nuance use both better databases and
    more advanced algorithms. It's significant effort to reproduce them with
    pocketsphinx. Though the CMUSphinx project is a good foundation for such an
    effort.

     
  • Jon S

    Jon S - 2012-04-13

    Sorry to revive this thread, but when you say significant effort, do you mean
    anything other than the gathering of a training set and implementation of the
    algorithms?

    I somehow doubt that these few companies have come up with significantly
    better algorithms than all of academia that they are keeping secret.

     
  • Nickolay V. Shmyrev

    I somehow doubt that these few companies have come up with significantly
    better algorithms than all of academia that they are keeping secret.

    Algorithms are not secret, they are published. They are not implemented in
    CMUSphinx, that's the problem.

     
  • Nathan Glenn

    Nathan Glenn - 2012-04-30

    Some of the things on the TODO list are on the GSOC ideas page (http://source
    forge.net/projects/cmusphinx/forums).

    It would be nice if there were a wiki-like place to list wanted algorithms or
    other components (which would probably make a much longer list than the GSOC
    one).

     
  • Joseph S. Wisniewski

    I somehow doubt that these few companies have come up with significantly
    better algorithms than all of academia

    I don't doubt it, at all. There are four things going on...

    Like any academic project, Sphinx needs to provide new opportunities for
    theses and dissertations. That means it can't really progress from a well
    defined starting point to a well defined goal, on anything that resembles a
    straight line. Instead, it does the "drunkard's walk": student number one says
    "X will improve recognition" and implements X, does a paper, and goes away.
    Student two can't get X++ approved as a thesis, so he forgets X and does Y,
    and he actually has to make sure that Y bears no resemblance to X, that it
    goes off in an entirely new direction. So, academic projects can burn a ton of
    people-years without actually getting close to any sort of goal. Well, except
    for SONIC, and we all know what happened there.
    Nuance, Google, and Microsoft take advantage of "crowd sourcing" data sets on
    a scale that academics can only dream about. The "Talk to Google" approach got
    them thousands of hours of speech, 10s of millions of utterance, free, and
    basically automated.
    Cults of personality. In an academic situation, it's hard to get a project
    going, or take an existing one in a new direction, if you don't have buy in
    from your adviser. And if that adviser is the person in charge of a particular
    project, well, some highly promising paths are not going to get investigated.
    You'll see this in just about every major academic project, in any field of
    study.
    Scavenging. Not only did Nuance put a ton of people hours into their current
    tech, but they also bought and integrated billions of dollars (literally) of
    other people's tech that they bought cheap. They bought everything L&H had,
    including Dragon, for something like $30M. I was at Ford when we were bidding
    against L&H for Dragon, we both valued just Dragon at over $500M. Do you know
    how many Ph.D students you can sponsor for $500M? About 2,500-5,000. A heck of
    a lot more than has gone into the sum total of Sphinx, Julius, ISIP, iATROS,
    SPRACH, RWTH, etc.

     
  • Joseph S. Wisniewski

    And that's what happens when you use BBCode on a forum that has neither a
    preview function nor an edit function. Thanks, SourceForge.

    Sorry.

     
  • Nickolay V. Shmyrev

    There are four things going on...

    That's true. Though we are trying to change it, quite some problems remain.

     
  • agv123

    agv123 - 2012-07-13

    Which algorithms exactly are giving dragon upper hand ?

     
  • vJaivox

    vJaivox - 2013-01-15

    Sorry about replying to an old thread. I do not know what algorithms are being used by Nuance, but perhaps we can come with an alternate method. We can already make reasonably accurate recognizers on small vocabularies using Sphinx4. What if we make large vocabulary recognizers by making a community of small recognizers (on multiple computers if necessary) that are all given the input to process? Then we can (collaboratively) create open algorithms to pick between results of each recognizer without getting into the harder problem of making a better large recognizer.

     
    • Alex Rudnicky

      Alex Rudnicky - 2013-01-15

      This is reasonable.

      Actually, there's a version of this that uses multiple Sphinxes that's a
      part of RavenClaw/Olympus. It provides for n simultaneous decodings of a
      speech input. Normally this is used to run male + female (+ maybe child)
      acoustic models at the same time: you then pick the decoding with the best
      score (comparable because the acoustics are exactly the same). The same
      idea could be extended to using parallel (say topic-based) language models.

      But two comments:

      1) Places like Nuance have way more training data than we could ever hope
      to acquire. Ultimately that's what makes the difference in general-purpose
      recognition. All contemporary decoders are fundamentally the same. On the
      other hand if you're working in a particular domain (like most of us)
      you're able to specialize models and get really good performance as a
      result.

      2) Siri is not just recognition, it needs two other things: form-based
      understanding plus associated dialog to support filling. Also, back-end
      integration: Siri knows about exactly 14 domains (or is it 15? Their
      documentation is a bit hazy on that). Understanding for each domain has
      been hand-crafted (and continues to be maintained). Anything else you say
      to it that doesn't fit into a form gets thrown to a web search engine so it
      always seems to respond in a vaguely appropriate way. Oh, and Siri also has
      `this really neat implementation of Eliza that keeps you distracted from
      its shortcomings.

      On Mon, Jan 14, 2013 at 7:29 PM, vJaivox vjaivox@users.sf.net wrote:

      Sorry about replying to an old thread. I do not know what algorithms are
      being used by Nuance, but perhaps we can come with an alternate method. We
      can already make reasonably accurate recognizers on small vocabularies
      using Sphinx4. What if we make large vocabulary recognizers by making a
      community of small recognizers (on multiple computers if necessary) that
      are all given the input to process? Then we can (collaboratively) create
      open algorithms to pick between results of each recognizer without getting
      into the harder problem of making a better large recognizer.

      Sphinx vs Sirihttps://sourceforge.net/p/cmusphinx/discussion/speech-recognition/thread/e55f4997/?limit=25#375a

      Sent from sourceforge.net because you indicated interest in
      https://sourceforge.net/p/cmusphinx/discussion/speech-recognition/

      To unsubscribe from further messages, please visit
      https://sourceforge.net/auth/prefs/

       
      • tb123

        tb123 - 2013-05-11

        Interesting thread I'm just coming in to as a new subscriber, and looking into possibly using CMU ASR for domain-specific ASR. I've not yet had any experience with it at all, though I do with Nuance's ASR.

        As has been said, Nuance has been building their language models with huge amounts of data over many years and probably have more than anyone else at this point. So, for general contexts they are pretty good. However...

        My reason for looking at alternatives to Nuance is because, unless you have a very big budget to engage their professional services, there are no facilities to tailor a language model to suit your particular domain, resulting in much poorer, and very variable, recognition accuracy for the domain you're interested in. Such facilities are also not (yet) available from other major ASR contenders like Google and Microsoft. Hence why I'm looking at alternatives like CMUSphinx for ASR, since I gather from the documentation you can build your own language model.

        So I have a few questions which I hope a Sphinx expert might respond to which will help me decide whether I should look further at Sphinx ASR...

        -I'm looking for a server-based ASR solution - which Sphinx version would be best for this?

        -for domain-specific ASR, what accuracy rates might I expect (assuming appropriate language modelling)?

        -is the ASR capable of real-time response for short (1 sentence, <10 words) phrases?

        -how are regional accents handled - is there a way of training the models for them?

        Thanks in advance

         
  • Nickolay V. Shmyrev

    So I have a few questions which I hope a Sphinx expert might respond to which will help me decide whether I should look further at Sphinx ASR...

    To ask for help please start a new thread

    I'm looking for a server-based ASR solution - which Sphinx version would be best for this?

    Sphinx4

    for domain-specific ASR, what accuracy rates might I expect (assuming appropriate language modelling)?

    The expected accuracy depend on vocabulary size and listed in tutorial. For 100 words vocabulary accuracy can be 99%. For 10000 words - 95%.

    is the ASR capable of real-time response for short (1 sentence, <10 words) phrases?

    Yes

    how are regional accents handled - is there a way of training the models for them?

    You have to collect regional data and train your own acoustic model

     

Log in to post a comment.