I'm wondering if the -fdict behavior in OpenEars is working correctly and
wanted to see if you could clarify the correct behavior for a working filler
dictionary for me.
I am trying to use the noisedict from hub4wsj_sc_8k as my filler dictionary. I
am running Pocketsphinx with the argument -fdict and pointing to the location
of hub4wsj_sc_8k/noisedict. I see that it is received as a command line
argument and it appears in the Current Configuration log as:
-fdict /correctpathto/noisedict
without prompting any errors.
If I run Pocketsphinx with the normal level of verbosity, I never see anything
from the filler dictionary being recognized in the logs. This might be the
correct behavior, to return (null) for a filler noise, not sure. I'm also
getting a lot of reports of noises that are in the noisedict being recognized
as words that are in the language model or grammar, so I've been wondering if
this is the correct behavior (not ever seeing anything from the filler
dictionary being recognized in the Pocketsphinx logging) or if something isn't
working correctly with the filler dictionary as I've configured it. Is there
any way I can verify for sure (besides coughing at it, which sometimes results
in a null or sometimes results in a word being recognized in a large language
model) that the filler dictionary is being used sometimes? Should I be
expecting to see ++COUGH++ in the logging or in the hypotheses sometimes?
As a follow-up question, are there more elaborate filler dictionaries that can
be used with hub4wsj_sc_8k? Perhaps the entries in the filler dictionary are
being heard and disregarded (and returning null) but a larger filler
dictionary would do a better job of throwing off more varied background
noises. Are there instructions or examples for what kind of noises can be
added to a larger hub4wsj_sc_8k filler dictionary or is it already optimal?
Thanks,
Halle
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there any way I can verify for sure (besides coughing at it, which
sometimes results in a null or sometimes results in a word being recognized in
a large language model) that the filler dictionary is being used sometimes?
Run with "-fwdflat no -bestpath no -backtrace yes -fillprob 1.0" and see
something like
so I've been wondering if this is the correct behavior (not ever seeing
anything from the filler dictionary being recognized in the Pocketsphinx
logging)
This is an expected behavior not a correct one. It's expected because you
don't optimize rejection of the out-of-grammar result.
As for fillers, they shouldn't be visible in decoder output since they are
internal thing for the decoder. API user shouldn't care about fillers at all.
Decoder should return NULL if only filler present in the utterance.
there more elaborate filler dictionaries that can be used with
hub4wsj_sc_8k?
Filler dictionary just describes filler phonemes in the acoustic models. You
can't change the dictionary unless you change the acoustic model.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks for the verification method -- that showed that the filler dictionary
is working as expected. A couple more questions.
This is an expected behavior not a correct one. It's expected because you
don't optimize rejection of the out-of-grammar result.
This is true. However, I don't know anything about what language models
OpenEars users are running with so it isn't clear to me how I would offer more
optimal OOV rejection as a library feature. Since the last version they can
programmatically create ARPA models on the fly during their app session and
switch between models in the middle of the continuous loop. So, it's difficult
to even say in the docs "if you're using this kind of model, flip this switch
for better OOV rejection, if you're using this kind of model, flip this one"
since the odds are pretty good that they are going to have their app start
with a more generalized model and then create contextually-useful smaller ones
and swap them in and out.
I've looked into improving OOV rejection and I saw this FAQ entry:
Q: Can pocketsphinx reject out-of-grammar words
There are few ways to deal with OOV rejection, for more details see Rejecting
Out-of-Grammar Utterances. Situation with implementation of those approaches
is:
Garbage Models - requires you to train special model. There is no public model
with garbage phones which can reject OOV words now. There are models with
fillers, but they reject only specific sounds (breath, laught, um). They can't
reject OOV word.
Generic Word Model - same as above, requires you to train special model. There
are no public models yet.
Confidence Scores - confidence score (ps_get_prob) can be reliably calculated
only for a large vocabulary (> 100 words). It doens't work with small grammar.
There are approaches with phone-based confidence and one of them was
implemented in sphixn2, but pocketsphinx doesn't support them. Confidence
scoring also require you to have three-pass recognition (enable both fwdflat
and bestpath).
So for now recommendation for rejection with the small grammar is - train your
own model (make it public). For the large language model (> 100 words) use
confidence score.
So, regarding the options listed:
I don't have a garbage model to offer but there is the included filler dictionary with the hmm which is apparently working as expected.
This looks too specific to any particular language model to be applicable to OpenEars.
I am returning confidence scores in the OpenEars hypothesis-received callback and I've started to attempt to give the developers who use OpenEars some advice on the use of the scores.
Is there another opportunity to improve OOV rejection that is generalized
enough for a framework that I'm missing? Can you make any suggestions here?
My last question is about your argument -fillprob 1.0. I see in the help that
this is "Filler word transition probability, defaults to 1e-8". Looking at my
Pocketsphinx logging I see that I'm not overriding this tiny number anywhere.
For the users who are reporting that noises which are present in the filler
dictionary are being recognized as words present in their language model, will
it be beneficial for them to increase -fillprob, and if so can you recommend a
number for them to start with (with the understanding that they will probably
have to tweak the particular value in accordance with their needs and test
results)?
Thanks again,
Halle
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is there another opportunity to improve OOV rejection that is generalized
enough for a framework that I'm missing? Can you make any suggestions here?
I wrote this answer on a wiki page so I don't think I have anything to add to
it ;) For your type of application you need to implement specific algorithm to
calculate proper confidence score. It's some serious work.
For the users who are reporting that noises which are present in the filler
dictionary are being recognized as words present in their language model, will
it be beneficial for them to increase -fillprob, and if so can you recommend a
number for them to start with (with the understanding that they will probably
have to tweak the particular value in accordance with their needs and test
results)?
I don't think users will be able to tweak this probability properly. Without a
speech database it's hard to find out what is the best value for any of the
decoder parameters.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When I read that I assumed it was just about Sphinx 4 since it is in the
Sphinx 4 section of the wiki. Are we talking about this part:
"Use confidence scores which are calculated post-recognition. This usually
makes use of the word lattice from the decoding. For each word in the best
hypothesis, we form a set of feature vectors by concatenating one or more
basic features related to word confidence. Examples of such features include
(but are not limited to): average acoustic score, average language score, word
length in frames, word length in phones, the number of occurrence of the same
word at the same location of the 10-best results, etc.. Such a feature vector
is then scored against a trained vector to determine whether the word is out-
of-vocabulary."
Would such an algorithm actually be language-model independent (i.e. is that
something that can be developed as a framework feature without any idea of
what kind of language model the user is going to generate or add)?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Would such an algorithm actually be language-model independent (i.e. is that
something that can be developed as a framework feature without any idea of
what kind of language model the user is going to generate or add)?
yes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi Nickolay,
I'm wondering if the -fdict behavior in OpenEars is working correctly and
wanted to see if you could clarify the correct behavior for a working filler
dictionary for me.
I am trying to use the noisedict from hub4wsj_sc_8k as my filler dictionary. I
am running Pocketsphinx with the argument -fdict and pointing to the location
of hub4wsj_sc_8k/noisedict. I see that it is received as a command line
argument and it appears in the Current Configuration log as:
-fdict /correctpathto/noisedict
without prompting any errors.
If I run Pocketsphinx with the normal level of verbosity, I never see anything
from the filler dictionary being recognized in the logs. This might be the
correct behavior, to return (null) for a filler noise, not sure. I'm also
getting a lot of reports of noises that are in the noisedict being recognized
as words that are in the language model or grammar, so I've been wondering if
this is the correct behavior (not ever seeing anything from the filler
dictionary being recognized in the Pocketsphinx logging) or if something isn't
working correctly with the filler dictionary as I've configured it. Is there
any way I can verify for sure (besides coughing at it, which sometimes results
in a null or sometimes results in a word being recognized in a large language
model) that the filler dictionary is being used sometimes? Should I be
expecting to see ++COUGH++ in the logging or in the hypotheses sometimes?
As a follow-up question, are there more elaborate filler dictionaries that can
be used with hub4wsj_sc_8k? Perhaps the entries in the filler dictionary are
being heard and disregarded (and returning null) but a larger filler
dictionary would do a better job of throwing off more varied background
noises. Are there instructions or examples for what kind of noises can be
added to a larger hub4wsj_sc_8k filler dictionary or is it already optimal?
Thanks,
Halle
Run with "-fwdflat no -bestpath no -backtrace yes -fillprob 1.0" and see
something like
This is an expected behavior not a correct one. It's expected because you
don't optimize rejection of the out-of-grammar result.
As for fillers, they shouldn't be visible in decoder output since they are
internal thing for the decoder. API user shouldn't care about fillers at all.
Decoder should return NULL if only filler present in the utterance.
Filler dictionary just describes filler phonemes in the acoustic models. You
can't change the dictionary unless you change the acoustic model.
Hiya,
Thanks for the verification method -- that showed that the filler dictionary
is working as expected. A couple more questions.
This is true. However, I don't know anything about what language models
OpenEars users are running with so it isn't clear to me how I would offer more
optimal OOV rejection as a library feature. Since the last version they can
programmatically create ARPA models on the fly during their app session and
switch between models in the middle of the continuous loop. So, it's difficult
to even say in the docs "if you're using this kind of model, flip this switch
for better OOV rejection, if you're using this kind of model, flip this one"
since the odds are pretty good that they are going to have their app start
with a more generalized model and then create contextually-useful smaller ones
and swap them in and out.
I've looked into improving OOV rejection and I saw this FAQ entry:
There are few ways to deal with OOV rejection, for more details see Rejecting
Out-of-Grammar Utterances. Situation with implementation of those approaches
is:
Garbage Models - requires you to train special model. There is no public model
with garbage phones which can reject OOV words now. There are models with
fillers, but they reject only specific sounds (breath, laught, um). They can't
reject OOV word.
Generic Word Model - same as above, requires you to train special model. There
are no public models yet.
Confidence Scores - confidence score (ps_get_prob) can be reliably calculated
only for a large vocabulary (> 100 words). It doens't work with small grammar.
There are approaches with phone-based confidence and one of them was
implemented in sphixn2, but pocketsphinx doesn't support them. Confidence
scoring also require you to have three-pass recognition (enable both fwdflat
and bestpath).
So for now recommendation for rejection with the small grammar is - train your
own model (make it public). For the large language model (> 100 words) use
confidence score.
So, regarding the options listed:
Is there another opportunity to improve OOV rejection that is generalized
enough for a framework that I'm missing? Can you make any suggestions here?
My last question is about your argument -fillprob 1.0. I see in the help that
this is "Filler word transition probability, defaults to 1e-8". Looking at my
Pocketsphinx logging I see that I'm not overriding this tiny number anywhere.
For the users who are reporting that noises which are present in the filler
dictionary are being recognized as words present in their language model, will
it be beneficial for them to increase -fillprob, and if so can you recommend a
number for them to start with (with the understanding that they will probably
have to tweak the particular value in accordance with their needs and test
results)?
Thanks again,
Halle
I wrote this answer on a wiki page so I don't think I have anything to add to
it ;) For your type of application you need to implement specific algorithm to
calculate proper confidence score. It's some serious work.
I don't think users will be able to tweak this probability properly. Without a
speech database it's hard to find out what is the best value for any of the
decoder parameters.
This page:
http://cmusphinx.sourceforge.net/wiki/sphinx4:rejectionhandling
?
When I read that I assumed it was just about Sphinx 4 since it is in the
Sphinx 4 section of the wiki. Are we talking about this part:
Would such an algorithm actually be language-model independent (i.e. is that
something that can be developed as a framework feature without any idea of
what kind of language model the user is going to generate or add)?
yes