I would like to use pocketsphinx as hotword detector for a very small set of commands { eg: hello world (or) hey world (or) hi world }, do be detected in continous speech targetting for telelphony applications.
Ive started with using off the shelf acoustic model ( standard US english ) and kws mode for key word spotting. the major problem i could see the accuracy in detection varying from person to person and fine tuning of keyword thresolds is not unique for all the users.
first problem i noticed is a lot of false detections in noisy enviroments even when user is not speaking..to cope up with ive increased the vad thresold and i see some good improvement
the second problem i noticed is when i test with my accent whcih is not perfect US accent but somewhat closer, the accuracy is around 60-70%. with proper US accent the accuracy is around 75%. Decreasing keyword thresolds helps to improve accuracy but incresing false detections as well. Otherway round, detection itself is not happening. In summary, speaker to speaker accuracy is different for the same key word utterances with the same keyword thresolds
as suggested in tutorial i tried to adapt acoustic model augmented it with various speaker utterances for the keywords. this improved accuracy by 5-10% but false detections are also higher.
as a last resort im trying to train a acostic model from scratch with only the command set (hello world/hey world/hi world) from various speakers in mutiple accents. Im not sure if this is a wise thought. i need some insight on how to achieve good accuracy on small vocabulary in key word spotting.?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to use pocketsphinx as hotword detector for a very small set of commands { eg: hello world (or) hey world (or) hi world }, do be detected in continous speech targetting for telelphony applications.
It is better to use something neural network based like honk:
as a last resort im trying to train a acostic model from scratch with only the command set (hello world/hey world/hi world) from various speakers in mutiple accents.Im not sure if this is a wise thought.
Keyword spotting trainng requires you to provide background data too. It is actually critical to have background data in the model.
i need some insight on how to achieve good accuracy on small vocabulary in key word spotting?
You can check the link above and the paper sited there.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
i could able to train with WER as 0%, however detection is not happening after i tried to integrate into the live application..
please note i have a very small vocab (3 sentence each 2 words) and trying to train around 0.01 hrs of audio uttering exactly those sentences and nothing else.
though initially CMU training has reported on insufficient data, ive modified the verify_all.pl to accept the current data size as valid. with this change i could see training happening and decode phase has reported WER as 0 that it is able to detect the setneces..
howevre i could see one error from the logs very often
ERROR: "gauden.c", line 1682: Variance (mgau= 6, feat= 0, density=7, component=15) is less then 0. Most probably the number of senones is too high for such a small training database. Use smaller $CFG_N_TIED_STATES.
initially the value for CFG_N_TIED_STATES is 200, but i tried to reduce it all the way from 200 to 50,20,10 till 1 but still the error persists.
Keyword detection is not happening at all on the trained model
im sharirng the logs. please let me know if there is something evident that im doing wrong or in appropiate
I would like to use pocketsphinx as hotword detector for a very small set of commands { eg: hello world (or) hey world (or) hi world }, do be detected in continous speech targetting for telelphony applications.
Ive started with using off the shelf acoustic model ( standard US english ) and kws mode for key word spotting. the major problem i could see the accuracy in detection varying from person to person and fine tuning of keyword thresolds is not unique for all the users.
first problem i noticed is a lot of false detections in noisy enviroments even when user is not speaking..to cope up with ive increased the vad thresold and i see some good improvement
the second problem i noticed is when i test with my accent whcih is not perfect US accent but somewhat closer, the accuracy is around 60-70%. with proper US accent the accuracy is around 75%. Decreasing keyword thresolds helps to improve accuracy but incresing false detections as well. Otherway round, detection itself is not happening. In summary, speaker to speaker accuracy is different for the same key word utterances with the same keyword thresolds
as suggested in tutorial i tried to adapt acoustic model augmented it with various speaker utterances for the keywords. this improved accuracy by 5-10% but false detections are also higher.
as a last resort im trying to train a acostic model from scratch with only the command set (hello world/hey world/hi world) from various speakers in mutiple accents. Im not sure if this is a wise thought. i need some insight on how to achieve good accuracy on small vocabulary in key word spotting.?
It is better to use something neural network based like honk:
https://github.com/castorini/honk
Keyword spotting trainng requires you to provide background data too. It is actually critical to have background data in the model.
You can check the link above and the paper sited there.
i could able to train with WER as 0%, however detection is not happening after i tried to integrate into the live application..
please note i have a very small vocab (3 sentence each 2 words) and trying to train around 0.01 hrs of audio uttering exactly those sentences and nothing else.
though initially CMU training has reported on insufficient data, ive modified the verify_all.pl to accept the current data size as valid. with this change i could see training happening and decode phase has reported WER as 0 that it is able to detect the setneces..
howevre i could see one error from the logs very often
ERROR: "gauden.c", line 1682: Variance (mgau= 6, feat= 0, density=7, component=15) is less then 0. Most probably the number of senones is too high for such a small training database. Use smaller $CFG_N_TIED_STATES.
initially the value for CFG_N_TIED_STATES is 200, but i tried to reduce it all the way from 200 to 50,20,10 till 1 but still the error persists.
Keyword detection is not happening at all on the trained model
im sharirng the logs. please let me know if there is something evident that im doing wrong or in appropiate