What does it take get Siri like speech recognition performance
from Sphinx (say Pocketsphinx) ? Is it just a bigger vocabulary, a good
language model and and a good acoustic model or something more ?
Li
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What does it take get Siri like speech recognition performance from Sphinx
(say Pocketsphinx) ? Is it just a bigger vocabulary, a good language model and
and a good acoustic model or something more ?
Commercial ASR services like the one from Nuance use both better databases and
more advanced algorithms. It's significant effort to reproduce them with
pocketsphinx. Though the CMUSphinx project is a good foundation for such an
effort.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry to revive this thread, but when you say significant effort, do you mean
anything other than the gathering of a training set and implementation of the
algorithms?
I somehow doubt that these few companies have come up with significantly
better algorithms than all of academia that they are keeping secret.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Some of the things on the TODO list are on the GSOC ideas page (http://source
forge.net/projects/cmusphinx/forums).
It would be nice if there were a wiki-like place to list wanted algorithms or
other components (which would probably make a much longer list than the GSOC
one).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I somehow doubt that these few companies have come up with significantly
better algorithms than all of academia
I don't doubt it, at all. There are four things going on...
Like any academic project, Sphinx needs to provide new opportunities for
theses and dissertations. That means it can't really progress from a well
defined starting point to a well defined goal, on anything that resembles a
straight line. Instead, it does the "drunkard's walk": student number one says
"X will improve recognition" and implements X, does a paper, and goes away.
Student two can't get X++ approved as a thesis, so he forgets X and does Y,
and he actually has to make sure that Y bears no resemblance to X, that it
goes off in an entirely new direction. So, academic projects can burn a ton of
people-years without actually getting close to any sort of goal. Well, except
for SONIC, and we all know what happened there.
Nuance, Google, and Microsoft take advantage of "crowd sourcing" data sets on
a scale that academics can only dream about. The "Talk to Google" approach got
them thousands of hours of speech, 10s of millions of utterance, free, and
basically automated.
Cults of personality. In an academic situation, it's hard to get a project
going, or take an existing one in a new direction, if you don't have buy in
from your adviser. And if that adviser is the person in charge of a particular
project, well, some highly promising paths are not going to get investigated.
You'll see this in just about every major academic project, in any field of
study.
Scavenging. Not only did Nuance put a ton of people hours into their current
tech, but they also bought and integrated billions of dollars (literally) of
other people's tech that they bought cheap. They bought everything L&H had,
including Dragon, for something like $30M. I was at Ford when we were bidding
against L&H for Dragon, we both valued just Dragon at over $500M. Do you know
how many Ph.D students you can sponsor for $500M? About 2,500-5,000. A heck of
a lot more than has gone into the sum total of Sphinx, Julius, ISIP, iATROS,
SPRACH, RWTH, etc.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry about replying to an old thread. I do not know what algorithms are being used by Nuance, but perhaps we can come with an alternate method. We can already make reasonably accurate recognizers on small vocabularies using Sphinx4. What if we make large vocabulary recognizers by making a community of small recognizers (on multiple computers if necessary) that are all given the input to process? Then we can (collaboratively) create open algorithms to pick between results of each recognizer without getting into the harder problem of making a better large recognizer.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Actually, there's a version of this that uses multiple Sphinxes that's a
part of RavenClaw/Olympus. It provides for n simultaneous decodings of a
speech input. Normally this is used to run male + female (+ maybe child)
acoustic models at the same time: you then pick the decoding with the best
score (comparable because the acoustics are exactly the same). The same
idea could be extended to using parallel (say topic-based) language models.
But two comments:
1) Places like Nuance have way more training data than we could ever hope
to acquire. Ultimately that's what makes the difference in general-purpose
recognition. All contemporary decoders are fundamentally the same. On the
other hand if you're working in a particular domain (like most of us)
you're able to specialize models and get really good performance as a
result.
2) Siri is not just recognition, it needs two other things: form-based
understanding plus associated dialog to support filling. Also, back-end
integration: Siri knows about exactly 14 domains (or is it 15? Their
documentation is a bit hazy on that). Understanding for each domain has
been hand-crafted (and continues to be maintained). Anything else you say
to it that doesn't fit into a form gets thrown to a web search engine so it
always seems to respond in a vaguely appropriate way. Oh, and Siri also has
`this really neat implementation of Eliza that keeps you distracted from
its shortcomings.
Sorry about replying to an old thread. I do not know what algorithms are
being used by Nuance, but perhaps we can come with an alternate method. We
can already make reasonably accurate recognizers on small vocabularies
using Sphinx4. What if we make large vocabulary recognizers by making a
community of small recognizers (on multiple computers if necessary) that
are all given the input to process? Then we can (collaboratively) create
open algorithms to pick between results of each recognizer without getting
into the harder problem of making a better large recognizer.
Interesting thread I'm just coming in to as a new subscriber, and looking into possibly using CMU ASR for domain-specific ASR. I've not yet had any experience with it at all, though I do with Nuance's ASR.
As has been said, Nuance has been building their language models with huge amounts of data over many years and probably have more than anyone else at this point. So, for general contexts they are pretty good. However...
My reason for looking at alternatives to Nuance is because, unless you have a very big budget to engage their professional services, there are no facilities to tailor a language model to suit your particular domain, resulting in much poorer, and very variable, recognition accuracy for the domain you're interested in. Such facilities are also not (yet) available from other major ASR contenders like Google and Microsoft. Hence why I'm looking at alternatives like CMUSphinx for ASR, since I gather from the documentation you can build your own language model.
So I have a few questions which I hope a Sphinx expert might respond to which will help me decide whether I should look further at Sphinx ASR...
-I'm looking for a server-based ASR solution - which Sphinx version would be best for this?
-for domain-specific ASR, what accuracy rates might I expect (assuming appropriate language modelling)?
-is the ASR capable of real-time response for short (1 sentence, <10 words) phrases?
-how are regional accents handled - is there a way of training the models for them?
Thanks in advance
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
What does it take get Siri like speech recognition performance
from Sphinx (say Pocketsphinx) ? Is it just a bigger vocabulary, a good
language model and and a good acoustic model or something more ?
Li
Commercial ASR services like the one from Nuance use both better databases and
more advanced algorithms. It's significant effort to reproduce them with
pocketsphinx. Though the CMUSphinx project is a good foundation for such an
effort.
Sorry to revive this thread, but when you say significant effort, do you mean
anything other than the gathering of a training set and implementation of the
algorithms?
I somehow doubt that these few companies have come up with significantly
better algorithms than all of academia that they are keeping secret.
Algorithms are not secret, they are published. They are not implemented in
CMUSphinx, that's the problem.
Some of the things on the TODO list are on the GSOC ideas page (http://source
forge.net/projects/cmusphinx/forums).
It would be nice if there were a wiki-like place to list wanted algorithms or
other components (which would probably make a much longer list than the GSOC
one).
If you have some feature request, please add it on tracker:
https://sourceforge.net/tracker/?group_id=1904&atid=351904
I don't doubt it, at all. There are four things going on...
Like any academic project, Sphinx needs to provide new opportunities for
theses and dissertations. That means it can't really progress from a well
defined starting point to a well defined goal, on anything that resembles a
straight line. Instead, it does the "drunkard's walk": student number one says
"X will improve recognition" and implements X, does a paper, and goes away.
Student two can't get X++ approved as a thesis, so he forgets X and does Y,
and he actually has to make sure that Y bears no resemblance to X, that it
goes off in an entirely new direction. So, academic projects can burn a ton of
people-years without actually getting close to any sort of goal. Well, except
for SONIC, and we all know what happened there.
Nuance, Google, and Microsoft take advantage of "crowd sourcing" data sets on
a scale that academics can only dream about. The "Talk to Google" approach got
them thousands of hours of speech, 10s of millions of utterance, free, and
basically automated.
Cults of personality. In an academic situation, it's hard to get a project
going, or take an existing one in a new direction, if you don't have buy in
from your adviser. And if that adviser is the person in charge of a particular
project, well, some highly promising paths are not going to get investigated.
You'll see this in just about every major academic project, in any field of
study.
Scavenging. Not only did Nuance put a ton of people hours into their current
tech, but they also bought and integrated billions of dollars (literally) of
other people's tech that they bought cheap. They bought everything L&H had,
including Dragon, for something like $30M. I was at Ford when we were bidding
against L&H for Dragon, we both valued just Dragon at over $500M. Do you know
how many Ph.D students you can sponsor for $500M? About 2,500-5,000. A heck of
a lot more than has gone into the sum total of Sphinx, Julius, ISIP, iATROS,
SPRACH, RWTH, etc.
And that's what happens when you use BBCode on a forum that has neither a
preview function nor an edit function. Thanks, SourceForge.
Sorry.
That's true. Though we are trying to change it, quite some problems remain.
Which algorithms exactly are giving dragon upper hand ?
Sorry about replying to an old thread. I do not know what algorithms are being used by Nuance, but perhaps we can come with an alternate method. We can already make reasonably accurate recognizers on small vocabularies using Sphinx4. What if we make large vocabulary recognizers by making a community of small recognizers (on multiple computers if necessary) that are all given the input to process? Then we can (collaboratively) create open algorithms to pick between results of each recognizer without getting into the harder problem of making a better large recognizer.
This is reasonable.
Actually, there's a version of this that uses multiple Sphinxes that's a
part of RavenClaw/Olympus. It provides for n simultaneous decodings of a
speech input. Normally this is used to run male + female (+ maybe child)
acoustic models at the same time: you then pick the decoding with the best
score (comparable because the acoustics are exactly the same). The same
idea could be extended to using parallel (say topic-based) language models.
But two comments:
1) Places like Nuance have way more training data than we could ever hope
to acquire. Ultimately that's what makes the difference in general-purpose
recognition. All contemporary decoders are fundamentally the same. On the
other hand if you're working in a particular domain (like most of us)
you're able to specialize models and get really good performance as a
result.
2) Siri is not just recognition, it needs two other things: form-based
understanding plus associated dialog to support filling. Also, back-end
integration: Siri knows about exactly 14 domains (or is it 15? Their
documentation is a bit hazy on that). Understanding for each domain has
been hand-crafted (and continues to be maintained). Anything else you say
to it that doesn't fit into a form gets thrown to a web search engine so it
always seems to respond in a vaguely appropriate way. Oh, and Siri also has
`this really neat implementation of Eliza that keeps you distracted from
its shortcomings.
On Mon, Jan 14, 2013 at 7:29 PM, vJaivox vjaivox@users.sf.net wrote:
Interesting thread I'm just coming in to as a new subscriber, and looking into possibly using CMU ASR for domain-specific ASR. I've not yet had any experience with it at all, though I do with Nuance's ASR.
As has been said, Nuance has been building their language models with huge amounts of data over many years and probably have more than anyone else at this point. So, for general contexts they are pretty good. However...
My reason for looking at alternatives to Nuance is because, unless you have a very big budget to engage their professional services, there are no facilities to tailor a language model to suit your particular domain, resulting in much poorer, and very variable, recognition accuracy for the domain you're interested in. Such facilities are also not (yet) available from other major ASR contenders like Google and Microsoft. Hence why I'm looking at alternatives like CMUSphinx for ASR, since I gather from the documentation you can build your own language model.
So I have a few questions which I hope a Sphinx expert might respond to which will help me decide whether I should look further at Sphinx ASR...
-I'm looking for a server-based ASR solution - which Sphinx version would be best for this?
-for domain-specific ASR, what accuracy rates might I expect (assuming appropriate language modelling)?
-is the ASR capable of real-time response for short (1 sentence, <10 words) phrases?
-how are regional accents handled - is there a way of training the models for them?
Thanks in advance
To ask for help please start a new thread
Sphinx4
The expected accuracy depend on vocabulary size and listed in tutorial. For 100 words vocabulary accuracy can be 99%. For 10000 words - 95%.
Yes
You have to collect regional data and train your own acoustic model