Menu

TIDIGITS information

Help
2010-04-18
2012-09-22
  • ubaid mahmood

    ubaid mahmood - 2010-04-18

    Helloo Nickolay,

    I noticed that there is a suggested training procedure for a small vocabulary
    set. I have searched the forums but haven't found too much information on
    this. I have only found references to TIDIGITS but not too much information on
    how to configure the environment. Is there a resource for TIDIGITS? I assume
    TIDIGITS is only a sample.

    All the best.

     
  • Nickolay V. Shmyrev

    how to configure the environment.

    Tidigits training template is in sphinxtrain/templates/tidigits

    Is there a resource for TIDIGITS?

    I'm not sure what kind of resource are you looking for

     
  • ubaid mahmood

    ubaid mahmood - 2010-04-18

    Thanks for the info.

    The directory reference is what I was looking for.

    After taking a look at it, I am a little bit confused. It seems like it is
    tied down to a a specific type of database, but I am not sure if it can be
    adapted differently to a custom db.

    For example in my dictionary, eight is defined as:

    EIGHT EIGHT

    It is defined to itself because my vocabulary set is small. The phoneset are
    the words themselves. In the TIDIGITS sample, eight is defined as:

    eight EY_eight T_eight

    It seems that two different type of feature files are being used. Or maybe it
    is not using the phoneset as the words themselves (as was suggested for small
    vocabulary), but are actually defining the phoneset to use multiple phones?

    Are there two different approaches for a small vocabulary? Am i missing
    something here?

    Thanks.

     
  • Nickolay V. Shmyrev

    For example in my dictionary, eight is defined as:

    EIGHT EIGHT

    It is defined to itself because my vocabulary set is small.

    That's not the optimal way taking into account sphinxtrain uses 3 states per
    phone

    The phoneset are the words themselves. In the TIDIGITS sample, eight is
    defined as:

    eight EY_eight T_eight

    This one is better for Sphinxtrain.

    Using single phone for word is a common practice for HTK where you usually
    need to define various number of states per phone later (8 states for EIGHT,
    10 states for SEVEN). In CMUSphinx, different approach is used.

     
  • ubaid mahmood

    ubaid mahmood - 2010-04-18

    Hmm. Seems like maybe there are few things that I am not quite familiar with.

    How is Eight defined in TIDIGITS:

    eight EY_eight T_eight

    different than eight defined in an4:

    EIGHT EY T

    ?

    It seems like they follow the same principle in defining multiple states.

    Also, so that I can understand, what would the 8 enumerated states for EIGHT
    be? I assume though that I would use the triphone approach for the sphinx
    trainder and decoder.

    Also, does that mean that the following:

    If you have only about 50-60 words in your vocabulary, and if your entire test data vocabulary is covered by the training data, then you are probably better off training word models rather than phone models. To do this, simply define the phoneset as your set of words themselves and have a dictionary that maps each word to itself and train. Also, use a lesser number of fillers, and if you do need to train phone models make sure that each of your tied states has enough counts (at least 5 or 10 instances of each).
    

    Is intended for HTK model and not for sphinx?

    I appreciate your assistance in clarifying these models.

     
  • Nickolay V. Shmyrev

    How is Eight defined in TIDIGITS:

    eight EY_eight T_eight

    different than eight defined in an4:

    EIGHT EY T

    In an4

    EIGHT EY T
    EIGHTEEN EY T IY N

    in both words EY is the same phone with the same models. This is an approach
    for large vocabulary

    In tidigits

    eight EY_eight T_eight
    two T_two OO_two

    Here T is different in eight and two. This is an approach for small vocabulary
    to model context dependence of phones. T in eight has context that makes it
    different from T in two.

    Also, so that I can understand, what would the 8 enumerated states for EIGHT
    be?

    Not sure what do you mean by would. 8 states in HTK model are just states they
    have no no name.

    I assume though that I would use the triphone approach for the sphinx
    trainder and decoder.

    Be careful with your assumptions, small vocabulary recognizers don't use
    triphones

    Also, does that mean that the following: If you have only about 50-60 words
    in your vocabulary, and if your entire test data vocabulary is covered by the
    training data, then you are probably better off training word models rather
    than phone models.

    You also need to be careful when you rely on old obsolete documentation like
    this one.

     
  • ubaid mahmood

    ubaid mahmood - 2010-04-18

    Thanks, that makes a lot of sense.

    I was training with some of the older documentation, along with the newer one
    because it seemed like some of the old documentation applies, but obviously I
    run the risk of using out dated information.

    I noticed that the feat.params is different for tidigits and an4. Is it
    necesary to use the TIDIGITS feat.params file? Is there information on how the
    different parameters are interpreted? (for an example, I do not see behavior
    of "dither" option)

     
  • Nickolay V. Shmyrev

    I noticed that the feat.params is different for tidigits and an4

    Yes, an4 variant is older. tidigits uses more modern feature extraction that
    is proven to be a little more accurate. pocketsphinx tidigits model is trained
    this way. The reasoning for change could be found here:

    http://lima-2.speech.cs.cmu.edu/moinmoin/SphinxHTK

    Basically it raised from attempt to follow HTK

    Is it necesary to use the TIDIGITS feat.params file?

    No, but it gives better accuracy than other values known.

    Is there information on how the different parameters are interpreted? (for
    an example, I do not see behavior of "dither" option)

    Dither is a random noise added to speech to avoid numerical overflow on
    processing zero energy regions caused by silence supression in telephone
    recordings. As usual you can run wave2feat without arguments to get the
    embedded help.

     
  • ubaid mahmood

    ubaid mahmood - 2010-04-19

    Quick follow up. I was able to get it setup and OK results for now.

    Thanks for the clarifications.

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.