I have a JSGF model working, and the control is provides is great compared to
the lm model (which I imagine is more suited to dictation), however, I have an
issue regarding a large (~1000) set of commands that I want recognised.
My grammar is currently as simple as can be, a single category with around
1000 utterances I'd like to be distinguished between. My dictionary file is
similarly large (larger even, as there are alternate word pronunciations).
This means that at startup, I am getting a minute or two delay while I see the
following happen in the console:
...
INFO: fsg_model.c(358): Added 2 alternate word transitions
INFO: fsg_model.c(325): Adding alternate word transitions (DENTISTS(4),DENTISTS(3)) to FSG
...
Secondly, recognition takes a great deal longer (a couple of seconds, compared
to almost instantaneous for an lm model).
And finally, the word returned includes the alternative pronunciation number
e.g. "RESTAURANT(2)", in the recognised string, whereas the lm model did not
do this.
What I want is to be able to restrict a model to only recognise 500-1000
phrases, and not be able to mix and match words from each phrase in the corpus
(which lowers the accuracy for me).
I am using PocketSphinx (wrapped by VocalKit), and my config file looks like
this:
-fwdflat no
-bestpath no
-nfft 512
-lowerf 1
-upperf 4000
-samprate 8000
-nfilt 20
-transform dct
-round_filters no
-remove_dc yes
Any assistance on achieving this, or otherwise overcoming the performance
issues I see with the JSGF model, would be very gratefully received,
Thanks!
P
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
, and the control is provides is great compared to the lm model
This is the case where control is dangerous :)
My grammar is currently as simple as can be, a single category with around
1000 utterances I'd like to be distinguished between.
That doesn't sound like a proper application design. It should be reviewed.
What exactly application are you trying to build and why do you need
distinguish between 1000 variants?
This means that at startup, I am getting a minute or two delay while I see
the following happen in the console
There was performance issue fixed in pocketsphinx trunk. If you are using
pocketsphinx-0.6 it definitely make sense to upgrade to latest version.
Secondly, recognition takes a great deal longer
There are ways to speedup the recognizer described in wiki, they are tradeoffs
for accuracy though
-fwdflat no -bestpath no
If your vocabulary is rather big, it's not a good thign to do. Both fwdflat
and bestpath improve accuracy
rd returned includes the alternative pronunciation number e.g.
"RESTAURANT(2)"
This is a bug if lm indeed returns just words. It needs to be fixed.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
To explain my requirements further, I have a number of category titles that
I'd like recognised - for example 'Restaurant', 'Bar', 'Pet Shop'. There are
between 200 and 500 of these that I'd like recognised. My list balloons
further however, because I have added plurals to the corpus (e.g.
'Restaurants', 'Bars', 'Pet Shops'), as I'm not sure how the user will choose
how to say a category, so this balloons the corpus, and subsequently the
grammar.
Essentially the list is a relatively full list of things you might expect to
find on a high-street.
I don't actually have a requirement for a grammar at all, these category
phrases should be self contained, and won't be joined in a longer grammar of
other phrases.
I started off with an lm model, which actually had very good accuracy, except
for the fact that it would feel free to mix and match words from different
phrases, so I might have two phrases 'Cat Groomers' and 'Italian Restaurants',
and yet it would be possible to recognise 'Cat Restaurants' or 'Italian
Groomers' which is something I want to prevent.
In order to try and get this, it was suggested I try a JSGF grammar, which is
where I'm at at the moment.
As for your comments on performance, I'll definitely take action based on
that, thanks for the advice.
P
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a JSGF model working, and the control is provides is great compared to
the lm model (which I imagine is more suited to dictation), however, I have an
issue regarding a large (~1000) set of commands that I want recognised.
My grammar is currently as simple as can be, a single category with around
1000 utterances I'd like to be distinguished between. My dictionary file is
similarly large (larger even, as there are alternate word pronunciations).
This means that at startup, I am getting a minute or two delay while I see the
following happen in the console:
Secondly, recognition takes a great deal longer (a couple of seconds, compared
to almost instantaneous for an lm model).
And finally, the word returned includes the alternative pronunciation number
e.g. "RESTAURANT(2)", in the recognised string, whereas the lm model did not
do this.
What I want is to be able to restrict a model to only recognise 500-1000
phrases, and not be able to mix and match words from each phrase in the corpus
(which lowers the accuracy for me).
I am using PocketSphinx (wrapped by VocalKit), and my config file looks like
this:
Any assistance on achieving this, or otherwise overcoming the performance
issues I see with the JSGF model, would be very gratefully received,
Thanks!
This is the case where control is dangerous :)
That doesn't sound like a proper application design. It should be reviewed.
What exactly application are you trying to build and why do you need
distinguish between 1000 variants?
There was performance issue fixed in pocketsphinx trunk. If you are using
pocketsphinx-0.6 it definitely make sense to upgrade to latest version.
There are ways to speedup the recognizer described in wiki, they are tradeoffs
for accuracy though
If your vocabulary is rather big, it's not a good thign to do. Both fwdflat
and bestpath improve accuracy
This is a bug if lm indeed returns just words. It needs to be fixed.
Thanks for your input, it's already very helpful.
To explain my requirements further, I have a number of category titles that
I'd like recognised - for example 'Restaurant', 'Bar', 'Pet Shop'. There are
between 200 and 500 of these that I'd like recognised. My list balloons
further however, because I have added plurals to the corpus (e.g.
'Restaurants', 'Bars', 'Pet Shops'), as I'm not sure how the user will choose
how to say a category, so this balloons the corpus, and subsequently the
grammar.
Essentially the list is a relatively full list of things you might expect to
find on a high-street.
I don't actually have a requirement for a grammar at all, these category
phrases should be self contained, and won't be joined in a longer grammar of
other phrases.
I started off with an lm model, which actually had very good accuracy, except
for the fact that it would feel free to mix and match words from different
phrases, so I might have two phrases 'Cat Groomers' and 'Italian Restaurants',
and yet it would be possible to recognise 'Cat Restaurants' or 'Italian
Groomers' which is something I want to prevent.
In order to try and get this, it was suggested I try a JSGF grammar, which is
where I'm at at the moment.
As for your comments on performance, I'll definitely take action based on
that, thanks for the advice.
Ok, let it be so this way. It's also benifical to analyse n-best lists from
the decoder to get more accurate recogntion results
Please do. Latest snapshot should be faster.