TIDIGITS information

Speech Recognition Toolkit

Brought to you by: air, arthchan2003, awb, bhiksha, and 5 others

This project can now be found here.

TIDIGITS information

Forum: Help

Creator: ubaid mahmood

Created: 2010-04-18

Updated: 2012-09-22

ubaid mahmood - 2010-04-18

Helloo Nickolay,

I noticed that there is a suggested training procedure for a small vocabulary
set. I have searched the forums but haven't found too much information on
this. I have only found references to TIDIGITS but not too much information on
how to configure the environment. Is there a resource for TIDIGITS? I assume
TIDIGITS is only a sample.

All the best.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-18

how to configure the environment.

Tidigits training template is in sphinxtrain/templates/tidigits

Is there a resource for TIDIGITS?

I'm not sure what kind of resource are you looking for

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ubaid mahmood - 2010-04-18

Thanks for the info.

The directory reference is what I was looking for.

After taking a look at it, I am a little bit confused. It seems like it is
tied down to a a specific type of database, but I am not sure if it can be
adapted differently to a custom db.

For example in my dictionary, eight is defined as:

EIGHT EIGHT

It is defined to itself because my vocabulary set is small. The phoneset are
the words themselves. In the TIDIGITS sample, eight is defined as:

eight EY_eight T_eight

It seems that two different type of feature files are being used. Or maybe it
is not using the phoneset as the words themselves (as was suggested for small
vocabulary), but are actually defining the phoneset to use multiple phones?

Are there two different approaches for a small vocabulary? Am i missing
something here?

Thanks.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-18

For example in my dictionary, eight is defined as:

EIGHT EIGHT

It is defined to itself because my vocabulary set is small.

That's not the optimal way taking into account sphinxtrain uses 3 states per
phone

The phoneset are the words themselves. In the TIDIGITS sample, eight is
defined as:

eight EY_eight T_eight

This one is better for Sphinxtrain.

Using single phone for word is a common practice for HTK where you usually
need to define various number of states per phone later (8 states for EIGHT,
10 states for SEVEN). In CMUSphinx, different approach is used.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ubaid mahmood - 2010-04-18

Hmm. Seems like maybe there are few things that I am not quite familiar with.

How is Eight defined in TIDIGITS:

eight EY_eight T_eight

different than eight defined in an4:

EIGHT EY T

?

It seems like they follow the same principle in defining multiple states.

Also, so that I can understand, what would the 8 enumerated states for EIGHT
be? I assume though that I would use the triphone approach for the sphinx
trainder and decoder.

Also, does that mean that the following:

If you have only about 50-60 words in your vocabulary, and if your entire test data vocabulary is covered by the training data, then you are probably better off training word models rather than phone models. To do this, simply define the phoneset as your set of words themselves and have a dictionary that maps each word to itself and train. Also, use a lesser number of fillers, and if you do need to train phone models make sure that each of your tied states has enough counts (at least 5 or 10 instances of each).

Is intended for HTK model and not for sphinx?

I appreciate your assistance in clarifying these models.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-18

How is Eight defined in TIDIGITS:

eight EY_eight T_eight

different than eight defined in an4:

EIGHT EY T

In an4

EIGHT EY T
EIGHTEEN EY T IY N

in both words EY is the same phone with the same models. This is an approach
for large vocabulary

In tidigits

eight EY_eight T_eight
two T_two OO_two

Here T is different in eight and two. This is an approach for small vocabulary
to model context dependence of phones. T in eight has context that makes it
different from T in two.

Also, so that I can understand, what would the 8 enumerated states for EIGHT
be?

Not sure what do you mean by would. 8 states in HTK model are just states they
have no no name.

I assume though that I would use the triphone approach for the sphinx
trainder and decoder.

Be careful with your assumptions, small vocabulary recognizers don't use
triphones

Also, does that mean that the following: If you have only about 50-60 words
in your vocabulary, and if your entire test data vocabulary is covered by the
training data, then you are probably better off training word models rather
than phone models.

You also need to be careful when you rely on old obsolete documentation like
this one.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ubaid mahmood - 2010-04-18

Thanks, that makes a lot of sense.

I was training with some of the older documentation, along with the newer one
because it seemed like some of the old documentation applies, but obviously I
run the risk of using out dated information.

I noticed that the feat.params is different for tidigits and an4. Is it
necesary to use the TIDIGITS feat.params file? Is there information on how the
different parameters are interpreted? (for an example, I do not see behavior
of "dither" option)

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2010-04-18

I noticed that the feat.params is different for tidigits and an4

Yes, an4 variant is older. tidigits uses more modern feature extraction that
is proven to be a little more accurate. pocketsphinx tidigits model is trained
this way. The reasoning for change could be found here:

http://lima-2.speech.cs.cmu.edu/moinmoin/SphinxHTK

Basically it raised from attempt to follow HTK

Is it necesary to use the TIDIGITS feat.params file?

No, but it gives better accuracy than other values known.

Is there information on how the different parameters are interpreted? (for
an example, I do not see behavior of "dither" option)

Dither is a random noise added to speech to avoid numerical overflow on
processing zero energy regions caused by silence supression in telephone
recordings. As usual you can run wave2feat without arguments to get the
embedded help.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

ubaid mahmood - 2010-04-19

Quick follow up. I was able to get it setup and OK results for now.

Thanks for the clarifications.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Log in to post a comment.