Menu

Sampling GSM audio for Acoustic Model Trainin

Help
850mph
2006-09-09
2012-09-22
  • 850mph

    850mph - 2006-09-09

    Sampling GSM audio for Acoustic Model Training

    Hello:

    I’m fortunate enough to have access to a large professional recording studio where multiple voice personalities breeze in and out everyday to record radio commercials. A variety of these voices (about 50 people) have graciously agreed to help me record a small (about 20 words) command and control ASR vocabulary for a telephony application.

    THE PROJECT:

    Briefly:
    The (distributed)ASR side of our app uses the existing Perl interface to feed .wav audio files (Re-sampled from GSM audio files) to SPHINX-II via a TCP/IP network connection. SPHINX-II decodes and pass back a result hypothesis.

    Existing Platform;
    Acoustic Model: 8khz/8bit Communicator
    http://www.speech.cs.cmu.edu/sphinx/models/hmm/communicator-2000-11-17-2.tgz

    Perl Code: Speech::Recognizer::SPX
    http://search.cpan.org/~djhd/Speech-Recognizer-SPX-0.0602/SPX.pm

    Decoder: Sphinx-II
    http://rpm.pbone.net/index.php3/stat/4/idpl/669453/com/sphinx2-0.3-2.i386.rpm.html

    Platform: Fedora Core-3, Linux Kernel 2.6
    http://download.fedora.redhat.com/pub/fedora/linux/core/3/

    Example:
    The following example ALREADY works (with high WERs) with the following 8Khz/8Bit/Mono Communicator Acoustic Model and Perl TCP/IP Client/Server code:

    1) User SPEAKS audio Commands into cell phone connected to GSM network
    -- Limited GSM bandwidth discards most of the audio characteristics (data)

    2) Telephony Server SAVES GSM audio
    -- This file has a unique acoustic signature which is NOT the same as wav sampled audio

    3) Sox CONVERTS GSM audio into PCM .wav files
    -- Conversion creates a file which SPHINX-II can decode

    4) .Wav files are FED to SPHINX-II over the network for hypothesis creation
    -- The existing Perl client-server interface works pretty well already for the (d)ASR

    5) Hypothesis is PASSED back to telephony server for Command execution
    -- Generally, the commands play pre-recorded audio back to the user

    THE AUDIO-CREATION/MODEL TRAINING PLAN:

    Obviously my main concern is that the mobile-GSM environment deprecates the audio signal leaving very few acoustic characteristics to input into the SPHINX-III Trainer.

    I would like to create the best possible acoustic model for my limited C&C vocabulary.

    I am MOST interested in group comments on the following “Audio-Creation/Model Training-Plan’ in the hopes of saving myself, and others a bit of time:

    1) SAMPLE Command Set via ProTools-HD to wav
    -- File Format=.wav/linear 192.0khz/24bit/Mono NonCompressed aprox:3Mbs
    -- Linear sampling preserves all acoustic characteristics/lossless
    The theory here is that the source sampling should be archived at highest quality possible

    2) DOWN-SAMPLE Command Set via SOX to uLaw
    -- File Format=.ulaw/nonLinear 8khz/8bit/Mono Compressed to aprox: 64Kbs
    -- Convert to a wire-line POTS bitrate
    -- Semilogarithmic Compression/lossy
    The theory here is that this will preserve the highest possible acoustic characteristics from the source audio for a POTS telephony environment.

    3) CONVERT Command Set via SOX to GSM
    -- File Format= GSM/nonlinear 8khz/13bit/Mono Compressed to aprox:13Kbs
    -- Convert to a mobile wireless GSM bitrate
    -- Semilogarithmic Compression/lossy
    -- Non-linear compression within GSM codec creates unique acoustic characteristics
    The theory here is that the GSM conversion will significantly “distort” the original sampled source audio files—creating a spectrum (And corresponding Mel-cepstrum Feature set) which will be closer to what the decoder actually receives when app is live.

    4) UPSAMPLE Command Set via SOX to 16khz WAVE (RIFF)
    -- File Format=16.0khz/16bit/Mono NonCompressed aprox:256Kbs
    -- Convert to format used by SPHINX-III TRAINER and corresponding DECODER
    -- Linear PCM re-sampling preserves the unique GSM audio characteristics/lossless
    The theory here is that although we obviously do not gain any acoustic definition by up-sampling, we do gain “bits” from the higher granularity—potentially enabling the Trainer to compare finer resolution data (sampled bits) frames in the creation of a semi-continuous HMM phone-model

    OR

    4) RESAMPLE Command Set via SOX to 8khz WAVE (RIFF)
    -- File Format=8.0khz/8bit/Mono NonCompressed aprox: 64Kbs
    -- Convert to format used by SPHINX-III TRAINER and corresponding DECODER
    -- Linear re-sampling preserves the unique GSM audio characteristics/lossless
    The theory here is that the Trainer will NOT gain any benefit from Upsampling the GSM audio files produced in Step 3.

    5) TRAIN Command Set with SphinxTrain
    -- The trainer accepts 16khz/16bit/Mono or 8khz/8bit/Mono WAVE or RAW files

    Have I “over-thought” this procedure?

    Can I simply Down-Sample the source to 8Khz/8Bit/Mono wav files—Train and Decode?

    Does the Intermediate Conversion Step-- Upsampling-- GSM to WAVE (step 4) actual CREATE any SAMPLE data beneficial to the training process? We know it will not create any beneficial audio CHARACTERISTIC data, but does the additional GRANULARITY of the samples aid in the MFCC creation process?

    I’m especially interested in ASR success stories within the GSM environment.

    Comments?

    Thanks-
    850mph

    (cross posted to c.s.research 9-9-06)
    (cross posted to CMU Sphinx Forum at Sourceforge)
    Thanks!

     
    • David Huggins-Daines

      If I were you, I'd downsample the audio to 8kHz/16bit, encode it with GSM, decode it back to 8kHz/16bit, and then train with that. In theory this will give you a model that is somewhat attuned to the distortion caused by GSM.

      Upsampling will do exactly nothing and actually might hurt you a bit if you use the default MFCC parameters for 16kHz audio, because it will result in a lot of filterbanks with zero power.

      Actually, SphinxTrain needs its input to be 16-bit linear PCM, u-Law and 8-bit are not supported (and 8-bit linear should never be used for speech, or anything else really :-))

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.