CMU Sphinx / Forums / Help: Sampling GSM audio for Acoustic Model Trainin

Sampling GSM audio for Acoustic Model Training

Hello:

I’m fortunate enough to have access to a large professional recording studio where multiple voice personalities breeze in and out everyday to record radio commercials. A variety of these voices (about 50 people) have graciously agreed to help me record a small (about 20 words) command and control ASR vocabulary for a telephony application.

THE PROJECT:

Briefly:
The (distributed)ASR side of our app uses the existing Perl interface to feed .wav audio files (Re-sampled from GSM audio files) to SPHINX-II via a TCP/IP network connection. SPHINX-II decodes and pass back a result hypothesis.

Existing Platform;
Acoustic Model: 8khz/8bit Communicator
http://www.speech.cs.cmu.edu/sphinx/models/hmm/communicator-2000-11-17-2.tgz

Perl Code: Speech::Recognizer::SPX
http://search.cpan.org/~djhd/Speech-Recognizer-SPX-0.0602/SPX.pm

Decoder: Sphinx-II
http://rpm.pbone.net/index.php3/stat/4/idpl/669453/com/sphinx2-0.3-2.i386.rpm.html

Platform: Fedora Core-3, Linux Kernel 2.6
http://download.fedora.redhat.com/pub/fedora/linux/core/3/

Example:
The following example ALREADY works (with high WERs) with the following 8Khz/8Bit/Mono Communicator Acoustic Model and Perl TCP/IP Client/Server code:

1) User SPEAKS audio Commands into cell phone connected to GSM network
-- Limited GSM bandwidth discards most of the audio characteristics (data)

2) Telephony Server SAVES GSM audio
-- This file has a unique acoustic signature which is NOT the same as wav sampled audio

3) Sox CONVERTS GSM audio into PCM .wav files
-- Conversion creates a file which SPHINX-II can decode

4) .Wav files are FED to SPHINX-II over the network for hypothesis creation
-- The existing Perl client-server interface works pretty well already for the (d)ASR

5) Hypothesis is PASSED back to telephony server for Command execution
-- Generally, the commands play pre-recorded audio back to the user

THE AUDIO-CREATION/MODEL TRAINING PLAN:

Obviously my main concern is that the mobile-GSM environment deprecates the audio signal leaving very few acoustic characteristics to input into the SPHINX-III Trainer.

I would like to create the best possible acoustic model for my limited C&C vocabulary.

I am MOST interested in group comments on the following “Audio-Creation/Model Training-Plan’ in the hopes of saving myself, and others a bit of time:

1) SAMPLE Command Set via ProTools-HD to wav
-- File Format=.wav/linear 192.0khz/24bit/Mono NonCompressed aprox:3Mbs
-- Linear sampling preserves all acoustic characteristics/lossless
The theory here is that the source sampling should be archived at highest quality possible

2) DOWN-SAMPLE Command Set via SOX to uLaw
-- File Format=.ulaw/nonLinear 8khz/8bit/Mono Compressed to aprox: 64Kbs
-- Convert to a wire-line POTS bitrate
-- Semilogarithmic Compression/lossy
The theory here is that this will preserve the highest possible acoustic characteristics from the source audio for a POTS telephony environment.

3) CONVERT Command Set via SOX to GSM
-- File Format= GSM/nonlinear 8khz/13bit/Mono Compressed to aprox:13Kbs
-- Convert to a mobile wireless GSM bitrate
-- Semilogarithmic Compression/lossy
-- Non-linear compression within GSM codec creates unique acoustic characteristics
The theory here is that the GSM conversion will significantly “distort” the original sampled source audio files—creating a spectrum (And corresponding Mel-cepstrum Feature set) which will be closer to what the decoder actually receives when app is live.

4) UPSAMPLE Command Set via SOX to 16khz WAVE (RIFF)
-- File Format=16.0khz/16bit/Mono NonCompressed aprox:256Kbs
-- Convert to format used by SPHINX-III TRAINER and corresponding DECODER
-- Linear PCM re-sampling preserves the unique GSM audio characteristics/lossless
The theory here is that although we obviously do not gain any acoustic definition by up-sampling, we do gain “bits” from the higher granularity—potentially enabling the Trainer to compare finer resolution data (sampled bits) frames in the creation of a semi-continuous HMM phone-model

** OR **

4) RESAMPLE Command Set via SOX to 8khz WAVE (RIFF)
-- File Format=8.0khz/8bit/Mono NonCompressed aprox: 64Kbs
-- Convert to format used by SPHINX-III TRAINER and corresponding DECODER
-- Linear re-sampling preserves the unique GSM audio characteristics/lossless
The theory here is that the Trainer will NOT gain any benefit from Upsampling the GSM audio files produced in Step 3.

5) TRAIN Command Set with SphinxTrain
-- The trainer accepts 16khz/16bit/Mono or 8khz/8bit/Mono WAVE or RAW files

Have I “over-thought” this procedure?

Can I simply Down-Sample the source to 8Khz/8Bit/Mono wav files—Train and Decode?

Does the Intermediate Conversion Step-- Upsampling-- GSM to WAVE (step 4) actual CREATE any SAMPLE data beneficial to the training process? We know it will not create any beneficial audio CHARACTERISTIC data, but does the additional GRANULARITY of the samples aid in the MFCC creation process?

I’m especially interested in ASR success stories within the GSM environment.

Comments?

Thanks-
850mph

(cross posted to c.s.research 9-9-06)
(cross posted to CMU Sphinx Forum at Sourceforge)
Thanks!

Sampling GSM audio for Acoustic Model Trainin

Speech Recognition Toolkit

Forums

Help

Sampling GSM audio for Acoustic Model Trainin document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Sampling GSM audio for Acoustic Model Trainin