Hi all, I am working with pocketsphinx. I am having an ultra-small vocabulary
(20-30 words) and my end goal is word spotting. I don't know whether my
questions have been asked (or more importantly, answered) by someone but I
couldn't find them anywhere so I am posting the same:
** I am able to train with word models wherein the dictionay maps each word to itself and the phone file contains the list of words. But when I try to run pocketsphinx_continuous, it gives an error of unknown phone(s). It refuses to regard words as phones. This problem was not there when I used sphinx3 decoder. Is it so that pocketsphinx does not support word models or am I doing something wrong ??? Here is the exact format of error:
ERROR: "dict.c", line 556: 'THE': Unknown phone 'THE'
ERROR: "dict.c", line 440: Failed to add THE to dictionary
ERROR: "dict.c", line 556: 'THERE': Unknown phone 'THERE'
ERROR: "dict.c", line 440: Failed to add THERE to dictionary
** Regarding non-speech events, I have added non-speech events to the filler dictionary and I want them to be recognized as filler words
instead of words from the grammar. Do I need train with these non-speech
events (e.g. tap, breath, clap etc.) or is the support for recognizing fillers
inbuilt in pocketsphinx??? Also, -fillprob isn't helping
my cause, whenever such a non-speech event occurs, it is recognized as a
speech word with a high confidence score (i.e. pprob).
Here is my filler dictionary.....
++BADRECORDING++ +GARBAGE+
++BEEP++ +NOISE+
++CHAIRSQUEAK++ +NOISE+
++CLICK++ +SMACK+
++COUGH++ +NOISE+
++CROSSTALK++ +GARBAGE+
++DISKNOISE++ +NOISE+
++DOOROPEN++ +NOISE+
++DOORSLAM++ +NOISE+
++EXHALATION++ +BREATH+
++FOOTSTEPS++ +NOISE+
++KNOCKING++ +NOISE+
++LAUGHING++ +NOISE+
++LAUGHTER++ +NOISE+
++LIPSMACK++ +SMACK+
++LOUDBREATH++ +BREATH+
++MICROPHONEMVT++ +NOISE+
++MISCNOISE++ +NOISE+
++MOVEMENT++ +NOISE+
++OTHERMOUTHSOUND++ +NOISE+
++PAPERRUSTLE++ +NOISE+
++PHONERING++ +NOISE+
++POORMICPOSITION++ +GARBAGE+
++SIGH++ +BREATH+
++SINGING++ +GARBAGE+
++SLOW++ +GARBAGE+
++SNIFF++ +BREATH+
++SOFT++ +GARBAGE+
++SQUEAK++ +NOISE+
++TAP++ +NOISE+
++THROATCLEAR++ +BREATH+
++THUMP++ +NOISE+
++TONGUECLICK++ +SMACK+
++TYPING++ +NOISE+
++UNINTELLIGIBLE++ +GARBAGE+
Thanks in advance.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Is it so that pocketsphinx does not support word models or am I doing
something wrong ???
You are doing something wrong. Most likely you are using wrong acoustic model
which has no phones from your dictionary. You should learn to provide full log
instead of tiny excerpt.
Another thing you are definitely doing wrong is that you are using the single
phone and thus same amount of states per words of different length. Basically
it's not an optimal approach to a model training.
> Do I need train with these non-speech events (e.g. tap, breath, clap
etc.) or is the support for recognizing fillers inbuilt in pocketsphinx???
Model should have such phones trained.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
-- Another thing you are definitely doing wrong is that you are using the
single phone and thus same amount of states per words of different
length. Basically it's not an optimal approach to a model training.
What is the remedy? I mean how can I make the states per words vary
proportional to the length of my words?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all, I am working with pocketsphinx. I am having an ultra-small vocabulary
(20-30 words) and my end goal is word spotting. I don't know whether my
questions have been asked (or more importantly, answered) by someone but I
couldn't find them anywhere so I am posting the same:
** I am able to train with word models wherein the dictionay maps each word to itself and the phone file contains the list of words. But when I try to run pocketsphinx_continuous, it gives an error of unknown phone(s). It refuses to regard words as phones. This problem was not there when I used sphinx3 decoder. Is it so that pocketsphinx does not support word models or am I doing something wrong ??? Here is the exact format of error:
ERROR: "dict.c", line 556: 'THE': Unknown phone 'THE'
ERROR: "dict.c", line 440: Failed to add THE to dictionary
ERROR: "dict.c", line 556: 'THERE': Unknown phone 'THERE'
ERROR: "dict.c", line 440: Failed to add THERE to dictionary
** Regarding non-speech events, I have added non-speech events to the filler dictionary and I want them to be recognized as filler words
instead of words from the grammar. Do I need train with these non-speech
events (e.g. tap, breath, clap etc.) or is the support for recognizing fillers
inbuilt in pocketsphinx??? Also, -fillprob isn't helping
my cause, whenever such a non-speech event occurs, it is recognized as a
speech word with a high confidence score (i.e. pprob).
Here is my filler dictionary.....
++BADRECORDING++ +GARBAGE+
++BEEP++ +NOISE+
++CHAIRSQUEAK++ +NOISE+
++CLICK++ +SMACK+
++COUGH++ +NOISE+
++CROSSTALK++ +GARBAGE+
++DISKNOISE++ +NOISE+
++DOOROPEN++ +NOISE+
++DOORSLAM++ +NOISE+
++EXHALATION++ +BREATH+
++FOOTSTEPS++ +NOISE+
++KNOCKING++ +NOISE+
++LAUGHING++ +NOISE+
++LAUGHTER++ +NOISE+
++LIPSMACK++ +SMACK+
++LOUDBREATH++ +BREATH+
++MICROPHONEMVT++ +NOISE+
++MISCNOISE++ +NOISE+
++MOVEMENT++ +NOISE+
++OTHERMOUTHSOUND++ +NOISE+
++PAPERRUSTLE++ +NOISE+
++PHONERING++ +NOISE+
++POORMICPOSITION++ +GARBAGE+
++SIGH++ +BREATH+
++SINGING++ +GARBAGE+
++SLOW++ +GARBAGE+
++SNIFF++ +BREATH+
++SOFT++ +GARBAGE+
++SQUEAK++ +NOISE+
++TAP++ +NOISE+
++THROATCLEAR++ +BREATH+
++THUMP++ +NOISE+
++TONGUECLICK++ +SMACK+
++TYPING++ +NOISE+
++UNINTELLIGIBLE++ +GARBAGE+
Thanks in advance.
> Is it so that pocketsphinx does not support word models or am I doing
something wrong ???
You are doing something wrong. Most likely you are using wrong acoustic model
which has no phones from your dictionary. You should learn to provide full log
instead of tiny excerpt.
Another thing you are definitely doing wrong is that you are using the single
phone and thus same amount of states per words of different length. Basically
it's not an optimal approach to a model training.
> Do I need train with these non-speech events (e.g. tap, breath, clap
etc.) or is the support for recognizing fillers inbuilt in pocketsphinx???
Model should have such phones trained.
Thanks for the prompt response.
-- Another thing you are definitely doing wrong is that you are using the
single phone and thus same amount of states per words of different
length. Basically it's not an optimal approach to a model training.
What is the remedy? I mean how can I make the states per words vary
proportional to the length of my words?
Edit: Regarding non-speech events, what is the modus operandi to train
fillers?
I am at a loss to figure out the correct way of doing this and this is hurting
my recognition accuracy badly.