Hi,
I would like to use Pocketsphinx for Keyword Spotting for a Human-Robot-Interaction as part of my master thesis. After doing some research on related work and different methods that can be applied for keyword spotting I am very interested in the method Pocketsphinx uses for their KWS function. I could only find a short paragraph in one paper that told that Pocketsphinx first uses LVCSR and afterwards does a text-based search for the keywords. Is that correct? Could anyone tell me some more details about the method maybe? Or is there some official reference that explains how Pocketsphinx’s keyword spotting works?
I would also be interested in how the confidences that one can get for the keywords are computed.
I hope, anyone knows something about it!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can probably find a more compact description at
Comparison of Keyword Spotting Approaches for Informal Continuous Speech⋆
Igor Szoke, Petr Schwarz, Pavel Matejka, Lukas Burget, Michal Fapso, Martin Karafiat, Jan Cernocky ́ http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.544.3130
I would also be interested in how the confidences that one can get for the keywords are computed.
Confidence in HMM keyword spotting is a difference between word path score and garbage path score.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you!
Another question concerning the reference you mentioned (Rose & Paul, 1992): They mention a background model additionally to the keyword and filler models. I think the filler models are equivalent to the garbage models you mentioned. You said that the confidence is the likelihood ratio between the keyword path score and the garbage path score. But in Rose & Paul they compute the likelihood ratio between the keyword and filler model and the background model (e.g. see Figure 3). This is also mentioned in Szöke, 2005, which was the 2nd reference you mentioned (see Figure 1). I could find a good explanation of this background model in chapter 3.4.3 of this reference: https://pdfs.semanticscholar.org/a6e1/5bdd38110a0e650c3465c7e8fbb48e3cbd12.pdf.
According to this work, the background model serves as an additional check whether a keyword that has a higher score than the filler models is really a keyword and for this the likelihood ration score is used.
Now I am a bit confused, because you said, that likelihood ration is computed between keyword and filler models and the references say it is computed between keyword and background model. What am I missing here? And does Pocketsphinx also use this background model?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Another question to you Nickolay, it's about the confidence values I asked you about already: Is there the possibility to get the confidences for all keywords in the keyword list for an utterance? I'm thinking about a program that returns the probabilities (or confidences) that a keyword was uttered, but for all prespecified keywords (like 'hello' - 0.8, 'house' - 0.1, 'yes' - 0.1, if these are the 3 keywords in my keyword list). I hope, there is a way! Right now, I just get the confidence for each spotted keyword that is part of the hypothesis.
(Maybe this question should be in a new thread? I am not that experienced with these kind of chats.)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I would like to use Pocketsphinx for Keyword Spotting for a Human-Robot-Interaction as part of my master thesis. After doing some research on related work and different methods that can be applied for keyword spotting I am very interested in the method Pocketsphinx uses for their KWS function. I could only find a short paragraph in one paper that told that Pocketsphinx first uses LVCSR and afterwards does a text-based search for the keywords. Is that correct? Could anyone tell me some more details about the method maybe? Or is there some official reference that explains how Pocketsphinx’s keyword spotting works?
I would also be interested in how the confidences that one can get for the keywords are computed.
I hope, anyone knows something about it!
No, pocketsphinx keyword spotting does not use lvcsr.
Pocketsphinx uses HMM keyword spotting or acoustic keyword spotting, the original citation should probably be
A hidden Markov model based keyword recognition system by Rose and Paul 1992
https://sci-hub.tw/https://ieeexplore.ieee.org/document/115555/
You can probably find a more compact description at
Comparison of Keyword Spotting Approaches for Informal Continuous Speech⋆
Igor Szoke, Petr Schwarz, Pavel Matejka, Lukas Burget, Michal Fapso, Martin Karafiat, Jan Cernocky ́
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.544.3130
Confidence in HMM keyword spotting is a difference between word path score and garbage path score.
Thank you very much, Nickolay!
Is the confidence difference you mentioned the cumulative log likelihood ratio?
Exactly!
Thank you!
Another question concerning the reference you mentioned (Rose & Paul, 1992): They mention a background model additionally to the keyword and filler models. I think the filler models are equivalent to the garbage models you mentioned. You said that the confidence is the likelihood ratio between the keyword path score and the garbage path score. But in Rose & Paul they compute the likelihood ratio between the keyword and filler model and the background model (e.g. see Figure 3). This is also mentioned in Szöke, 2005, which was the 2nd reference you mentioned (see Figure 1). I could find a good explanation of this background model in chapter 3.4.3 of this reference: https://pdfs.semanticscholar.org/a6e1/5bdd38110a0e650c3465c7e8fbb48e3cbd12.pdf.
According to this work, the background model serves as an additional check whether a keyword that has a higher score than the filler models is really a keyword and for this the likelihood ration score is used.
Now I am a bit confused, because you said, that likelihood ration is computed between keyword and filler models and the references say it is computed between keyword and background model. What am I missing here? And does Pocketsphinx also use this background model?
Garbage model is the same as filler model and the same as background model in Paul, a model of alternative decoding. This is what he writes:
After re-reading it, I got it now. Thank you!
Chapter 2 of this paper also helped a lot (if anyone else is confused like I was): http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.551.3676&rep=rep1&type=pdf
Another question to you Nickolay, it's about the confidence values I asked you about already: Is there the possibility to get the confidences for all keywords in the keyword list for an utterance? I'm thinking about a program that returns the probabilities (or confidences) that a keyword was uttered, but for all prespecified keywords (like 'hello' - 0.8, 'house' - 0.1, 'yes' - 0.1, if these are the 3 keywords in my keyword list). I hope, there is a way! Right now, I just get the confidence for each spotted keyword that is part of the hypothesis.
(Maybe this question should be in a new thread? I am not that experienced with these kind of chats.)
If you set thresholds large enough you should get confidence scores for all the keyphrases in the list.
Last edit: Susanne Trick 2019-05-22
Last edit: Susanne Trick 2019-05-22