Hi I'm working with pocketsphinx. I have configured it correctly as per the tutorial and I am getting decent accuracy on the test file. However you can see it's not a perfect fit (as expected). I'm looking to focus on phoneme recognition, but limit my accuracy to particular phonemes or just to initial phonemes of words (in the case below it would be G, and F etc).
Is it possible to train a model to focus on particular phonemes, or just phonemes at the beginning of words? Or is there a particular configuration that would help me?
Also the confidence is at 1.0 for each phoneme, does pocketpshinx not deliver confidence for phonemes? I was calculating conf by using the code for word conf from pocketsphinx_continuous.c
Recognized: SIL G OW F AO R W ER D T AE NG IY IH ZH ER Z S V SIL
SIL 0.000 0.450 1.000000
G 0.460 0.530 1.000000
OW 0.540 0.630 1.000000
F 0.640 0.770 1.000000
AO 0.780 0.850 1.000000
R 0.860 0.930 1.000000
W 0.940 1.000 1.000000
ER 1.010 1.110 1.000000
D 1.120 1.160 1.000000
T 1.170 1.300 1.000000
AE 1.310 1.390 1.000000
NG 1.400 1.560 1.000000
IY 1.570 1.660 1.000000
IH 1.670 1.700 1.000000
ZH 1.710 1.750 1.000000
ER 1.760 1.890 1.000000
Z 1.900 1.950 1.000000
S 1.960 2.100 1.000000
V 2.110 2.150 1.000000
SIL 2.160 2.600 1.000000
I'm looking to focus on phoneme recognition, but limit my accuracy to particular phonemes or just to initial phonemes of words (in the case below it would be G, and F etc).
Not sure what do you mean by "focus" here. Accurate phoneme recognition is a hard problem. Usually some phonemes are easier to recognize, some phones are more confusable.
Also the confidence is at 1.0 for each phoneme, does pocketpshinx not deliver confidence for phonemes?
Unfortunately, phone confidence is not supported yet.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Could you train a model so it was better at say consonant phonemes over vowel phonemes for instance?
No, there is no such thing. Main phoneme confusion is about really acoustically confusable pairs like Z/S or AH/IH or B/P or D/T. It's not about vowel or consonant.
Is it instead possible to trace back phonemes from recognised words instead?
No, word recognizer does not track phonemes.
Using the word confidence to estimate the phoneme confidence?
I do no think it is possible, sorry.
If you want to implement phoneme confidence, you can implement it from phoneme lattice. You will have to write another search to keep track of phoneme lattice though.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi I'm working with pocketsphinx. I have configured it correctly as per the tutorial and I am getting decent accuracy on the test file. However you can see it's not a perfect fit (as expected). I'm looking to focus on phoneme recognition, but limit my accuracy to particular phonemes or just to initial phonemes of words (in the case below it would be G, and F etc).
Is it possible to train a model to focus on particular phonemes, or just phonemes at the beginning of words? Or is there a particular configuration that would help me?
Also the confidence is at 1.0 for each phoneme, does pocketpshinx not deliver confidence for phonemes? I was calculating conf by using the code for word conf from pocketsphinx_continuous.c
Phonemes
~~~~
config = cmd_ln_init(NULL, ps_args(), TRUE, "-hmm", MODELDIR "/en-us/en-us", "-allphone", MODELDIR "/en-us/en-us-phone.lm.dmp", "-backtrace", "yes", "-beam", "1e-20", "-pbeam", "1e-20", "-lw", "2.0", NULL)
Recognized: SIL G OW F AO R W ER D T AE NG IY IH ZH ER Z S V SIL
SIL 0.000 0.450 1.000000
G 0.460 0.530 1.000000
OW 0.540 0.630 1.000000
F 0.640 0.770 1.000000
AO 0.780 0.850 1.000000
R 0.860 0.930 1.000000
W 0.940 1.000 1.000000
ER 1.010 1.110 1.000000
D 1.120 1.160 1.000000
T 1.170 1.300 1.000000
AE 1.310 1.390 1.000000
NG 1.400 1.560 1.000000
IY 1.570 1.660 1.000000
IH 1.670 1.700 1.000000
ZH 1.710 1.750 1.000000
ER 1.760 1.890 1.000000
Z 1.900 1.950 1.000000
S 1.960 2.100 1.000000
V 2.110 2.150 1.000000
SIL 2.160 2.600 1.000000
config = cmd_ln_init(NULL, ps_args(), TRUE, "-hmm", MODELDIR "/en-us/en-us", "-lm", MODELDIR "/en-us/en-us.lm.dmp", "-dict", MODELDIR "/en-us/cmudict-en-us.dict", NULL);
s> 0.000 0.450 0.999900
go 0.460 0.630 0.999600
forward 0.640 1.160 0.999900
ten 1.170 1.520 0.102605
meters 1.530 2.110 0.297887
/s> 2.120 2.600 1.000000
~~~~~~
Not sure what do you mean by "focus" here. Accurate phoneme recognition is a hard problem. Usually some phonemes are easier to recognize, some phones are more confusable.
Unfortunately, phone confidence is not supported yet.
Could you train a model so it was better at say consonant phonemes over vowel phonemes for instance?
Is it instead possible to trace back phonemes from recognised words instead? Using the word confidence to estimate the phoneme confidence?
Last edit: Benjamin Gorman 2015-07-02
No, there is no such thing. Main phoneme confusion is about really acoustically confusable pairs like Z/S or AH/IH or B/P or D/T. It's not about vowel or consonant.
No, word recognizer does not track phonemes.
I do no think it is possible, sorry.
If you want to implement phoneme confidence, you can implement it from phoneme lattice. You will have to write another search to keep track of phoneme lattice though.
Implementing phoneme confidence could be interesting. Do you happen to know any additional info which would get me started on this?
You need to understand the theory of confidence scoring at least from the following overview
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.93.6890
And you need to understand the code for lattice (dag) construction from decoder search which is available in sphinx3 in srch_allphone.c file.