Training a language model from synthesized speech

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

Training a language model from synthesized speech

Forum: Speech Recognition Theory

Creator: Harri Pasanen

Created: 2013-02-02

Updated: 2013-02-05

Harri Pasanen - 2013-02-02

I wonder what sort of results would one expect by training a model solely by synthesized speech? Say targeting something like 500-1000 word vocabulary.

One could use multiple available tts voices, and simulate more by applying effects like changing pitch and/or tempo etc.

Basically trying to outsource all the tedious work to computers...

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-02-02

I wonder what sort of results would one expect by training a model solely by synthesized speech? Say targeting something like 500-1000 word vocabulary.
One could use multiple available tts voices, and simulate more by applying effects like changing pitch and/or tempo etc.
Basically trying to outsource all the tedious work to computers...

There are more natural ways to obtain large amount of speech data required for training. The one we pursue in CMUSphinx is the automatic alignment and training on the transcribed recordings. There are many public transcribed recordings avialable, so we can easily reuse transcriptions to build very accurate models.

A proper tools have to be implemented to support that, but they are in development right now and will reach production state quite soon. Your help is welcome.

It's worth to notice that just data will not help to build an accurate recognizer. Advanced algorithms and features have to be implemented too.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Harri Pasanen - 2013-02-04

I noticed you neatly sidestepped answering the question ;)

What I have in mind right now are limited vocabulary language models for command and control, using pocketsphinx, not restricted to English only.

Another approach and question: Say I were to crowd source the training to smartphone users. Would audio compression via opus or mp3 vbr introduce significant degradation to training material?

As to request for help, I'll nibble at the bait... What kind of help would be required? Is there a list of tasks somewhere?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-02-05

I noticed you neatly sidestepped answering the question ;)

I'm trying to explain you the right way

Another approach and question: Say I were to crowd source the training to smartphone users. Would audio compression via opus or mp3 vbr introduce significant degradation to training material?

No, it doesn't really matter

As to request for help, I'll nibble at the bait... What kind of help would be required?

Software development

Is there a list of tasks somewhere?

The task is listed as I named it "Make sure that automatic alignment works".

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.