Menu

Adapting the Model

Help
Nicholas
2012-07-16
2012-09-22
  • Nicholas

    Nicholas - 2012-07-16

    I have a number of question that would really help me to understand adapting
    the model to additional users.

    Order of voices added to model

    Does the order in which additional voices are added to the model matter?

    I have two users Nicholas and Charlotte, I added these two voices to the
    original model using the sentences from the arctic example on http://cmusphin
    x.sourceforge.net/wiki/tutorialadapt.

    Firstly I added Nicholas and then Charlotte to the original model, I then used
    some difference sentences to test the accuracy. I then repeated the same
    experiment but this time adding Charlotte and then Nicholas to the original
    model. This created two different adapted models.

    When testing Nicholas against both models the accuracy was the same both
    times:

    TOTAL Words: 49 Correct: 48 Errors: 2
    TOTAL Percent correct = 97.96% Error = 4.08% Accuracy = 95.92%
    TOTAL Insertions: 1 Deletions: 1 Substitutions: 0

    When testing Charlotte against both models the accuracy was different:

    Charlotte added after Nicholas
    TOTAL Words: 49 Correct: 41 Errors: 9
    TOTAL Percent correct = 83.67% Error = 18.37% Accuracy = 81.63%
    TOTAL Insertions: 1 Deletions: 0 Substitutions: 8

    Nicholas added after Charlotte
    TOTAL Words: 49 Correct: 44 Errors: 6
    TOTAL Percent correct = 89.80% Error = 12.24% Accuracy = 87.76%
    TOTAL Insertions: 1 Deletions: 0 Substitutions: 5

    What is the cause for this difference?

    Internals of adding to a model

    How does adding new voices to a model work?

    If you record a sentence and create the required languagemodels and dictionary
    files in order to add to the model how does it know which parts of the sound
    relates to which word? How can this be accurate? If it could do this surely
    you would never need to add any new voices to the model?

    Surely the better option would be to split the sentences up into individual
    word (one word per recording) and adapt the model like this?

    Speech impediments

    _If a user has an impediment such as lisp and they’re added to the model how
    does this affect the resulting model for users without a lisp?
    _In the tests I tried the accuracy for others users decreased, how can the
    model be improved for impediments without impacting other users? Would
    recording each word individually and adding them to the model help with this
    situation?

    Thank you for your time and providing a quality solution.

     
  • Nickolay V. Shmyrev

    When testing Nicholas against both models the accuracy was the same both
    times. When testing Charlotte against both models the accuracy was different.
    What is the cause for this difference?

    I think it's just an issue in testing. With more tests accuracy should be
    different in both cases. There is no reason for it to be the same, it can only
    happen accidentally.

    How does adding new voices to a model work? If you record a sentence and
    create the required languagemodels and dictionary files in order to add to the
    model how does it know which parts of the sound relates to which word? How can
    this be accurate? If it could do this surely you would never need to add any
    new voices to the model? Surely the better option would be to split the
    sentences up into individual word (one word per recording) and adapt the model
    like this?

    The "adding" is a misconception here. Acoustic model is a statistically
    estimated "average" of the human voices. It's not a database. You can not
    "add" two voices, their differences will create a bias which will reduce the
    accuracy on both voices because you will try to average just two voices. That
    means you can either adapt to just single speaker thus shifting average closer
    to this speaker or you need to adapt to many speakers in order to shift
    average to some reasonable mean.

    There is no need to split to words because the segmentation on states is done
    automatically with Baum-Welch algorithm during adaptation when you invoke "bw"
    command. To modify models you need to reestimate distributions of senones, not
    words. So if you want to split you need to split to subphonemes (impossible to
    do it right manually), not words.

    You can get initial information about concepts of speech recognition from
    tutorial:

    http://cmusphinx.sourceforge.net/wiki/tutorial

    If a user has an impediment such as lisp and they’re added to the model how
    does this affect the resulting model for users without a lisp? In the tests I
    tried the accuracy for others users decreased,

    It's expected to decrease because the "average" becomes more dirty than it
    could be.

    how can the model be improved for impediments without impacting other users?

    The model should be designed and trained in a special way. It should include
    the notion of "impedment" as a hidden variable and the training algorithm
    should be able to distinguish it from the average audio.

    Would recording each word individually and adding them to the model help
    with this situation?

    No

     
  • Nicholas

    Nicholas - 2012-07-16

    Thank you for your answers it has been very helpful.

    The "adding" is a misconception here. Acoustic model is a statistically
    estimated "average" of the human voices. It's not a database.

    Could you explain a little more about how this average works? If one voice is
    added how is the whole average affected? Does the new voice proportionally
    adjust the model?

    i.e. if the model was built from 1000 voices would the addition of one voice
    be worth 0.1% or 50%?

    The model should be designed and trained in a special way. It should include
    the notion of "impedment" as a hidden variable and the training algorithm
    should be able to distinguish it from the average audio.

    Do you have an example on how to implement this? Would the algorithm determine
    if the user has an impediment and then decide on which model to use?

     
  • Nickolay V. Shmyrev

    Hello

    Could you explain a little more about how this average works? If one voice
    is added how is the whole average affected? Does the new voice proportionally
    adjust the model? i.e. if the model was built from 1000 voices would the
    addition of one voice be worth 0.1% or 50%?

    The smoothing with the original model is controlled by a parameter tau. The
    tau is selected to maximize a posterior probability of the adaptation data by
    default or can be set automatically from a command line parameters

    Do you have an example on how to implement this?

    No

    Would the algorithm determine if the user has an impediment and then decide
    on which model to use?

    This is one of the possible ways.

     
  • Hitarth

    Hitarth - 2012-07-21

    Hi,

    Sorry to butt in like this, but I had a related question.

    Is it possible to adapt an acoustic model "too much" so that the accuracy of
    the resulting model is worse than what we started off the adaption with?

    If yes, then is it a bad idea to adapt a model to the voices of a lot of
    people, if the aim of the adaption is to ensure that the model becomes more
    accurate for people with different accents etc?

    Cheers

     
  • Nickolay V. Shmyrev

    Is it possible to adapt an acoustic model "too much" so that the accuracy of
    the resulting model is worse than what we started off the adaption with?

    Yes

    If yes, then is it a bad idea to adapt a model to the voices of a lot of
    people, if the aim of the adaption is to ensure that the model becomes more
    accurate for people with different accents etc?

    Sometimes yes

     

Log in to post a comment.