Menu

Adapting model for recognition when whisperin

Help
zacos
2012-07-30
2012-09-22
  • zacos

    zacos - 2012-07-30

    I'm trying to do simple voice to text mapping using pocketsphinx (. The
    grammar is very simple such as:

    public <grammar> = (Matt, Anna, Tom, Christine)+ (One | Two | Three | Four |
    Five | Six | Seven | Eight | Nine | Zero)+ ; </grammar>

    e.g:
    Tom Anna Three Three
    yields
    Tom Anna 33

    I adapted the acoustic model (to take into account my foreign accent) and
    after that I received decent performance (~94% accuracy). I used training
    dataset of ~3minutes.
    Right now I'm trying to do the same but by whispering to the microphone. The
    accuracy dropped significantly to ~50% w/o training. With training for accent
    I got ~60%. I tried other thinks including denoising and boosting volume. I
    read the whole docs but was wondering if anyone could answer some questions so
    I can
    better know in which direction should I got to improve performance.

    1) in tutorial you are adapting hub4wsj_sc_8k acustic model. I guess "8k" is a
    sampling parameter. When using sphinx_fe you use "-samprate 16000".
    Was it used deliberately to train 8k model using data with 16k sampling rate?
    Why data with 8k sampling haven't been used? Does it have influence on
    performance?
    2) in sphinx 4.1 (in comparison to pocketsphinx) there are differenct acoustic
    models e.g. WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.jar.
    Can those models be used with pocketsphinx? Will acustic model with 16k
    sampling have typically better performance with data having 16k sampling rate?
    3) when using data for training should I use those with normal speaking mode
    (to adapt only for my accent) or with whispering mode (to adapt to whisper and
    my accent)?
    I think I tried both scenarios and didn't notice any difference to draw any
    conclussion but I don't know pocketsphinx internals so I might be doing
    something wrong.
    4) I used the following script to record adapting training and testing data
    from the tutorial:

    for i in `seq 1 20`; do 
           fn=`printf arctic_%04d $i`; 
           read sent; echo $sent; 
           rec -r 16000 -e signed-integer -b 16 -c 1 $fn.wav 2>/dev/null; 
    done < arctic20.txt
    

    I noticed that each time I hit Control-C this keypress is distinct in the
    recorded audio that leaded to errors. Trimming audio somtimes helped to
    correct to or lead to
    other error instead. Is there any requirement that each recording has some few
    seconds of quite before and after speaking?

    5) When accumulating observation counts is there any settings I can tinker
    with to improve performance?
    6) What's the difference between semi-continuous and continuous model? Can
    pocketsphinx use continuous model?
    7) I noticed that 'mixture_weights' file from sphinx4 is much smaller
    comparing to the one you got in pocketsphinx-extra. Does it make any
    difference?
    8) I tried different combination of removing white noise (using 'sox' toolkit
    e.g. sox noisy.wav filtered.wav noisered profile.nfo 0.1). Depending on the
    last parameter
    sometimes it improved a little bit (~3%) and sometimes it makes worse. Is it
    good to remove noise or it's something pocketsphinx doing as well? My
    environment is quite
    is there is only white noise that I guess can have more inpack when audio
    recorded whispering.
    9) I noticed that boosting volume (gain) alone most of the time only maked the
    performance a little bit worse even though for humans it was easier to
    distinguish words. Should I avoid it?
    10) Overall I tried different combination and the best results I got is ~65%
    when only removing noise, so only slight (5%) improvement. Below are some
    stats:

    //ORIGNAL UNPROCESSED TESTING FILES
    TOTAL Words: 111 Correct: 72 Errors: 43
    TOTAL Percent correct = 64.86% Error = 38.74% Accuracy = 61.26%
    TOTAL Insertions: 4 Deletions: 13 Substitutions: 26
    
    
    //DENOISED + VOLUME UP
    TOTAL Words: 111 Correct: 76 Errors: 42
    TOTAL Percent correct = 68.47% Error = 37.84% Accuracy = 62.16%
    TOTAL Insertions: 7 Deletions: 4 Substitutions: 31
    
    
    //VOLUME UP
    TOTAL Words: 111 Correct: 69 Errors: 47
    TOTAL Percent correct = 62.16% Error = 42.34% Accuracy = 57.66%
    TOTAL Insertions: 5 Deletions: 12 Substitutions: 30
    
    //DENOISE, threshold 0.1
    TOTAL Words: 111 Correct: 77 Errors: 41
    TOTAL Percent correct = 69.37% Error = 36.94% Accuracy = 63.06%
    TOTAL Insertions: 7 Deletions: 3 Substitutions: 31
    
    
    //DENOISE, threshold 0.21
    TOTAL Words: 111 Correct: 80 Errors: 38
    TOTAL Percent correct = 72.07% Error = 34.23% Accuracy = 65.77%
    TOTAL Insertions: 7 Deletions: 3 Substitutions: 28
    

    Those processing I was doing only for testing data. Should the training data
    be processed in the same way? I think I tried that but there was barely any
    difference.
    11) In all those testing I used ARPA language model. When using JGSF results
    where usually much worse (I have the latest pocketsphinx branch). Why is that?
    12) Because is each sentence the maximum number would be '999' and no more
    than 3 names, I modified the JSGF and replaced repetition sign '+' by
    repeating content in the parentheses manually. This time the result where much
    closer to ARPA. Is there any way in grammar to tell maximum number of
    repetition like in regular expression?
    13) When using ARPA model I generated it by using all possible combinations
    (since dictionary is fixed and really small: ~15 words) but then testing I was
    still receiving somtimes illegal results e.g. Tom Anna (without any required
    number). Is there any way to enforce some structure using ARPA model?
    14) Should the dictionary be limited only to those ~15 words or just full
    dictionary will only affect speed but not performance?
    15) Is modifying dictionary (phonemes) the way to go to improve recognition
    when whispering? (I'm not an expert but when we whisper I guess some words
    might sounds different?)
    16) Any other tips how to improve accuracy would be really helpful!

     
  • Nickolay V. Shmyrev

    Was it used deliberately to train 8k model using data with 16k sampling
    rate?

    Yes

    Why data with 8k sampling haven't been used?

    It's a common denominator across sample rates which are used.

    Does it have influence on performance?

    Yes

    Can those models be used with pocketsphinx?

    Yes

    Will acustic model with 16k sampling have typically better performance with
    data having 16k sampling rate?

    No, sample rate have less effect than other possible distortions.

    when using data for training should I use those with normal speaking mode
    (to adapt only for my accent) or with whispering mode (to adapt to whisper and
    my accent)?

    Whisper requires totally different acoustic model, it's not reasonable to use
    adaptation for whisper.

    I noticed that each time I hit Control-C this keypress is distinct in the
    recorded audio that leaded to errors. Trimming audio somtimes helped to
    correct to or lead to other error instead. Is there any requirement that each
    recording has some few seconds of quite before and after speaking?

    The length of the silence on ends must be 0.5 secs.

    When accumulating observation counts is there any settings I can tinker with
    to improve performance?

    No

    What's the difference between semi-continuous and continuous model?

    Semi-continuous models use different senone scoring method.

    Can pocketsphinx use continuous model?

    Yes

    I noticed that 'mixture_weights' file from sphinx4 is much smaller comparing
    to the one you got in pocketsphinx-extra. Does it make any difference?

    For continuous models mixture file is smaller

    Is it good to remove noise or it's something pocketsphinx doing as well?

    Pocketsphinx doesn't remove noise. The effect of noise removal algorithm
    depends on the type of the
    noise and the algorithm. It doesn't necessary improve the accuracy.

    I noticed that boosting volume (gain) alone most of the time only maked the
    performance a little bit worse even though for humans it was easier to
    distinguish words. Should I avoid it?

    It shouldn't matter unless you clip the audio

    In all those testing I used ARPA language model. When using JGSF results
    where usually much worse (I have the latest pocketsphinx branch). Why is that?

    Without audio it's hard to say that

    Is there any way in grammar to tell maximum number of repetition like in
    regular expression?

    No

    When using ARPA model I generated it by using all possible combinations
    (since dictionary is fixed and really small: ~15 words) but then testing I was
    still receiving somtimes illegal results e.g. Tom Anna (without any required
    number). Is there any way to enforce some structure using ARPA model?

    No

    Should the dictionary be limited only to those ~15 words or just full
    dictionary will only affect speed but not performance?

    Dictionary size only affects memory size required for decoding

    Is modifying dictionary (phonemes) the way to go to improve recognition when
    whispering? (I'm not an expert but when we whisper I guess some words might
    sounds different?)

    Yes

     

Log in to post a comment.