CMU Sphinx / Forums / Help: Adapting model for recognition when whisperin

I'm trying to do simple voice to text mapping using pocketsphinx (. The
grammar is very simple such as:

e.g:
Tom Anna Three Three
yields
Tom Anna 33

I adapted the acoustic model (to take into account my foreign accent) and
after that I received decent performance (~94% accuracy). I used training
dataset of ~3minutes.
Right now I'm trying to do the same but by whispering to the microphone. The
accuracy dropped significantly to ~50% w/o training. With training for accent
I got ~60%. I tried other thinks including denoising and boosting volume. I
read the whole docs but was wondering if anyone could answer some questions so
I can
better know in which direction should I got to improve performance.

1) in tutorial you are adapting hub4wsj_sc_8k acustic model. I guess "8k" is a
sampling parameter. When using sphinx_fe you use "-samprate 16000".
Was it used deliberately to train 8k model using data with 16k sampling rate?
Why data with 8k sampling haven't been used? Does it have influence on
performance?
2) in sphinx 4.1 (in comparison to pocketsphinx) there are differenct acoustic
models e.g. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar.
Can those models be used with pocketsphinx? Will acustic model with 16k
sampling have typically better performance with data having 16k sampling rate?
3) when using data for training should I use those with normal speaking mode
(to adapt only for my accent) or with whispering mode (to adapt to whisper and
my accent)?
I think I tried both scenarios and didn't notice any difference to draw any
conclussion but I don't know pocketsphinx internals so I might be doing
something wrong.
4) I used the following script to record adapting training and testing data
from the tutorial:

for i in `seq 1 20`; do 
       fn=`printf arctic_%04d $i`; 
       read sent; echo $sent; 
       rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; 
done < arctic20.txt

I noticed that each time I hit Control-C this keypress is distinct in the
recorded audio that leaded to errors. Trimming audio somtimes helped to
correct to or lead to
other error instead. Is there any requirement that each recording has some few
seconds of quite before and after speaking?

5) When accumulating observation counts is there any settings I can tinker
with to improve performance?
6) What's the difference between semi-continuous and continuous model? Can
pocketsphinx use continuous model?
7) I noticed that 'mixture_weights' file from sphinx4 is much smaller
comparing to the one you got in pocketsphinx-extra. Does it make any
difference?
8) I tried different combination of removing white noise (using 'sox' toolkit
e.g. sox noisy.wav filtered.wav noisered profile.nfo 0.1). Depending on the
last parameter
sometimes it improved a little bit (~3%) and sometimes it makes worse. Is it
good to remove noise or it's something pocketsphinx doing as well? My
environment is quite
is there is only white noise that I guess can have more inpack when audio
recorded whispering.
9) I noticed that boosting volume (gain) alone most of the time only maked the
performance a little bit worse even though for humans it was easier to
distinguish words. Should I avoid it?
10) Overall I tried different combination and the best results I got is ~65%
when only removing noise, so only slight (5%) improvement. Below are some
stats:

//ORIGNAL UNPROCESSED TESTING FILES
TOTAL Words: 111 Correct: 72 Errors: 43
TOTAL Percent correct = 64.86% Error = 38.74% Accuracy = 61.26%
TOTAL Insertions: 4 Deletions: 13 Substitutions: 26


//DENOISED + VOLUME UP
TOTAL Words: 111 Correct: 76 Errors: 42
TOTAL Percent correct = 68.47% Error = 37.84% Accuracy = 62.16%
TOTAL Insertions: 7 Deletions: 4 Substitutions: 31


//VOLUME UP
TOTAL Words: 111 Correct: 69 Errors: 47
TOTAL Percent correct = 62.16% Error = 42.34% Accuracy = 57.66%
TOTAL Insertions: 5 Deletions: 12 Substitutions: 30

//DENOISE, threshold 0.1
TOTAL Words: 111 Correct: 77 Errors: 41
TOTAL Percent correct = 69.37% Error = 36.94% Accuracy = 63.06%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 31


//DENOISE, threshold 0.21
TOTAL Words: 111 Correct: 80 Errors: 38
TOTAL Percent correct = 72.07% Error = 34.23% Accuracy = 65.77%
TOTAL Insertions: 7 Deletions: 3 Substitutions: 28

Those processing I was doing only for testing data. Should the training data
be processed in the same way? I think I tried that but there was barely any
difference.
11) In all those testing I used ARPA language model. When using JGSF results
where usually much worse (I have the latest pocketsphinx branch). Why is that?
12) Because is each sentence the maximum number would be '999' and no more
than 3 names, I modified the JSGF and replaced repetition sign '+' by
repeating content in the parentheses manually. This time the result where much
closer to ARPA. Is there any way in grammar to tell maximum number of
repetition like in regular expression?
13) When using ARPA model I generated it by using all possible combinations
(since dictionary is fixed and really small: ~15 words) but then testing I was
still receiving somtimes illegal results e.g. Tom Anna (without any required
number). Is there any way to enforce some structure using ARPA model?
14) Should the dictionary be limited only to those ~15 words or just full
dictionary will only affect speed but not performance?
15) Is modifying dictionary (phonemes) the way to go to improve recognition
when whispering? (I'm not an expert but when we whisper I guess some words
might sounds different?)
16) Any other tips how to improve accuracy would be really helpful!

Was it used deliberately to train 8k model using data with 16k sampling
rate?

Yes

Why data with 8k sampling haven't been used?

It's a common denominator across sample rates which are used.

Does it have influence on performance?

Yes

Can those models be used with pocketsphinx?

Yes

Will acustic model with 16k sampling have typically better performance with
data having 16k sampling rate?

No, sample rate have less effect than other possible distortions.

when using data for training should I use those with normal speaking mode
(to adapt only for my accent) or with whispering mode (to adapt to whisper and
my accent)?

Whisper requires totally different acoustic model, it's not reasonable to use
adaptation for whisper.

I noticed that each time I hit Control-C this keypress is distinct in the
recorded audio that leaded to errors. Trimming audio somtimes helped to
correct to or lead to other error instead. Is there any requirement that each
recording has some few seconds of quite before and after speaking?

The length of the silence on ends must be 0.5 secs.

When accumulating observation counts is there any settings I can tinker with
to improve performance?

What's the difference between semi-continuous and continuous model?

Semi-continuous models use different senone scoring method.

Can pocketsphinx use continuous model?

Yes

I noticed that 'mixture_weights' file from sphinx4 is much smaller comparing
to the one you got in pocketsphinx-extra. Does it make any difference?

For continuous models mixture file is smaller

Is it good to remove noise or it's something pocketsphinx doing as well?

Pocketsphinx doesn't remove noise. The effect of noise removal algorithm
depends on the type of the
noise and the algorithm. It doesn't necessary improve the accuracy.

I noticed that boosting volume (gain) alone most of the time only maked the
performance a little bit worse even though for humans it was easier to
distinguish words. Should I avoid it?

It shouldn't matter unless you clip the audio

In all those testing I used ARPA language model. When using JGSF results
where usually much worse (I have the latest pocketsphinx branch). Why is that?

Without audio it's hard to say that

Is there any way in grammar to tell maximum number of repetition like in
regular expression?

When using ARPA model I generated it by using all possible combinations
(since dictionary is fixed and really small: ~15 words) but then testing I was
still receiving somtimes illegal results e.g. Tom Anna (without any required
number). Is there any way to enforce some structure using ARPA model?

Should the dictionary be limited only to those ~15 words or just full
dictionary will only affect speed but not performance?

Dictionary size only affects memory size required for decoding

Is modifying dictionary (phonemes) the way to go to improve recognition when
whispering? (I'm not an expert but when we whisper I guess some words might
sounds different?)

Yes

Adapting model for recognition when whisperin

Speech Recognition Toolkit

Forums

Help

Adapting model for recognition when whisperin document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Adapting model for recognition when whisperin