I am currently working on my bachelor thesis and I try to improve the accuracy on hotword recognition in pocketsphinx android.
At first, I was thinking about implementing an algorithm for noise reduction, but after reading the FAQ my first attempt was the adaption of the acoustic model.
I ran the steps described here, recorded a lot (more than 150) of .wav files in noisy environment (car) and some (around 20) 'without' noise.
I used the pocketsphinx_batch.exe on Win10 Pro N 64bit following this tutorial.
The first problem was, that in every sentence in the .transcription file (2 words per sentence/hotword activation) each word was tried to be recognized by itself. For example in "My Hotword", sometimes the word 'My' and sometimes the word 'Hotword' was recognized. Is there a way to recognize both as a whole word? If yes how?
Nevertheless, I tried to improve the Accuracy by adapting the model with the noiseless data.
I used most of the parameters as described, but I replaced the -ts2cbfn .ptm. to -ts2cbfn .cont. and replaced the -svspec 0-12/13-25/26-38 with -lda en-us/feature-transform.
Is adapting the acoustic model the appropriate way to improve keyword listening? If not, please give me another hint.
After adapting the Accuracy was calculated with the pocketsphinx_batch.exe and word_align.pl-perlscript.
Before adapting the accuracy was around 30%, after adaption 100%.
After that I calculated accuracy with pocketsphinx_batch.exe with noisy data which ended in negative accuracy. Then I adapted the model from the first adaption with the 150 noise wav files and ran calculation again with now 81% acc.
At last I ran calculation on the 2nd (noisy) adaption with the noiseless data again => 33% Acc. I assume that the 150 noisy files are weighted more than the 20 noiseless files.
Is that correct?
If this is the appropriate way to improve accuracy, is it better to have the same amount of noiseless and noisy data (for example 150 both) to have the best overall accuracy?
I haven't tested the ACC on android app, but I hope it will be better than before.
Thank you in advance for your answers!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The first problem was, that in every sentence in the .transcription file (2 words per sentence/hotword activation) each word was tried to be recognized by itself. For example in "My Hotword", sometimes the word 'My' and sometimes the word 'Hotword' was recognized. Is there a way to recognize both as a whole word? If yes how?
It depends on the language model you are using for decoding.
Is adapting the acoustic model the appropriate way to improve keyword listening? If not, please give me another hint.
It is an acceptable approach, in the end it all ends with an accuracy of the decoder. The most productive way is to implement more advanced DNN-based acoustic model though, not the adapation.
is it better to have the same amount of noiseless and noisy data (for example 150 both) to have the best overall accuracy?
The data for adaptation should match the actual data you will process with your application. If it all going to be noisy you need only noisy data then.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
It depends on the language model you are using for decoding.
I use the en-us.lm.bin from pocketsphinx, both words are in the cmudict-en-us.dict by default.
Let's say, I would use 'my hotword' as hotword.
In the .transcription file I would describe it as : \<s> my hotword \</s> (wavfile01),(without the backslashes) but I don't want it to be calculated as 2 words (1 correct, 1 error), but as 1 word (correct or false).
How does the language model have an effect?
Last edit: Jean Chung 2017-05-30
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
In the .transcription file I would describe it as : \<s> my hotword \</s> (wavfile01), but I don't want it to be calculated as 2 words (1 correct, 1 error), but as 1 word (correct or false).
You can write your code for that. Overall, keyword spotting performance is calculated with different tools, you need f-score, not word error rate.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello everyone,
I am currently working on my bachelor thesis and I try to improve the accuracy on hotword recognition in pocketsphinx android.
At first, I was thinking about implementing an algorithm for noise reduction, but after reading the FAQ my first attempt was the adaption of the acoustic model.
I ran the steps described here, recorded a lot (more than 150) of .wav files in noisy environment (car) and some (around 20) 'without' noise.
I used the pocketsphinx_batch.exe on Win10 Pro N 64bit following this tutorial.
The first problem was, that in every sentence in the .transcription file (2 words per sentence/hotword activation) each word was tried to be recognized by itself. For example in "My Hotword", sometimes the word 'My' and sometimes the word 'Hotword' was recognized. Is there a way to recognize both as a whole word? If yes how?
Nevertheless, I tried to improve the Accuracy by adapting the model with the noiseless data.
I used most of the parameters as described, but I replaced the -ts2cbfn .ptm. to -ts2cbfn .cont. and replaced the -svspec 0-12/13-25/26-38 with -lda en-us/feature-transform.
Is adapting the acoustic model the appropriate way to improve keyword listening? If not, please give me another hint.
After adapting the Accuracy was calculated with the pocketsphinx_batch.exe and word_align.pl-perlscript.
Before adapting the accuracy was around 30%, after adaption 100%.
After that I calculated accuracy with pocketsphinx_batch.exe with noisy data which ended in negative accuracy. Then I adapted the model from the first adaption with the 150 noise wav files and ran calculation again with now 81% acc.
At last I ran calculation on the 2nd (noisy) adaption with the noiseless data again => 33% Acc. I assume that the 150 noisy files are weighted more than the 20 noiseless files.
Is that correct?
If this is the appropriate way to improve accuracy, is it better to have the same amount of noiseless and noisy data (for example 150 both) to have the best overall accuracy?
I haven't tested the ACC on android app, but I hope it will be better than before.
Thank you in advance for your answers!
It depends on the language model you are using for decoding.
It is an acceptable approach, in the end it all ends with an accuracy of the decoder. The most productive way is to implement more advanced DNN-based acoustic model though, not the adapation.
The data for adaptation should match the actual data you will process with your application. If it all going to be noisy you need only noisy data then.
First, thank you for the quick answer!
I use the en-us.lm.bin from pocketsphinx, both words are in the cmudict-en-us.dict by default.
Let's say, I would use 'my hotword' as hotword.
In the .transcription file I would describe it as : \<s> my hotword \</s> (wavfile01),(without the backslashes) but I don't want it to be calculated as 2 words (1 correct, 1 error), but as 1 word (correct or false).
How does the language model have an effect?
Last edit: Jean Chung 2017-05-30
You can write your code for that. Overall, keyword spotting performance is calculated with different tools, you need f-score, not word error rate.
Okay, I'll try that. Thank you!