Firstly I added Nicholas and then Charlotte to the original model, I then used
some difference sentences to test the accuracy. I then repeated the same
experiment but this time adding Charlotte and then Nicholas to the original
model. This created two different adapted models.
When testing Nicholas against both models the accuracy was the same both
times:
TOTAL Words: 49 Correct: 48 Errors: 2
TOTAL Percent correct = 97.96% Error = 4.08% Accuracy = 95.92%
TOTAL Insertions: 1 Deletions: 1 Substitutions: 0
When testing Charlotte against both models the accuracy was different:
Charlotte added after Nicholas
TOTAL Words: 49 Correct: 41 Errors: 9
TOTAL Percent correct = 83.67% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 8
Nicholas added after Charlotte
TOTAL Words: 49 Correct: 44 Errors: 6
TOTAL Percent correct = 89.80% Error = 12.24% Accuracy = 87.76%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 5
What is the cause for this difference?
Internals of adding to a model
How does adding new voices to a model work?
If you record a sentence and create the required languagemodels and dictionary
files in order to add to the model how does it know which parts of the sound
relates to which word? How can this be accurate? If it could do this surely
you would never need to add any new voices to the model?
Surely the better option would be to split the sentences up into individual
word (one word per recording) and adapt the model like this?
Speech impediments
_If a user has an impediment such as lisp and they’re added to the model how
does this affect the resulting model for users without a lisp?
_In the tests I tried the accuracy for others users decreased, how can the
model be improved for impediments without impacting other users? Would
recording each word individually and adding them to the model help with this
situation?
Thank you for your time and providing a quality solution.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
When testing Nicholas against both models the accuracy was the same both
times. When testing Charlotte against both models the accuracy was different.
What is the cause for this difference?
I think it's just an issue in testing. With more tests accuracy should be
different in both cases. There is no reason for it to be the same, it can only
happen accidentally.
How does adding new voices to a model work? If you record a sentence and
create the required languagemodels and dictionary files in order to add to the
model how does it know which parts of the sound relates to which word? How can
this be accurate? If it could do this surely you would never need to add any
new voices to the model? Surely the better option would be to split the
sentences up into individual word (one word per recording) and adapt the model
like this?
The "adding" is a misconception here. Acoustic model is a statistically
estimated "average" of the human voices. It's not a database. You can not
"add" two voices, their differences will create a bias which will reduce the
accuracy on both voices because you will try to average just two voices. That
means you can either adapt to just single speaker thus shifting average closer
to this speaker or you need to adapt to many speakers in order to shift
average to some reasonable mean.
There is no need to split to words because the segmentation on states is done
automatically with Baum-Welch algorithm during adaptation when you invoke "bw"
command. To modify models you need to reestimate distributions of senones, not
words. So if you want to split you need to split to subphonemes (impossible to
do it right manually), not words.
You can get initial information about concepts of speech recognition from
tutorial:
If a user has an impediment such as lisp and they’re added to the model how
does this affect the resulting model for users without a lisp? In the tests I
tried the accuracy for others users decreased,
It's expected to decrease because the "average" becomes more dirty than it
could be.
how can the model be improved for impediments without impacting other users?
The model should be designed and trained in a special way. It should include
the notion of "impedment" as a hidden variable and the training algorithm
should be able to distinguish it from the average audio.
Would recording each word individually and adding them to the model help
with this situation?
No
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thank you for your answers it has been very helpful.
The "adding" is a misconception here. Acoustic model is a statistically
estimated "average" of the human voices. It's not a database.
Could you explain a little more about how this average works? If one voice is
added how is the whole average affected? Does the new voice proportionally
adjust the model?
i.e. if the model was built from 1000 voices would the addition of one voice
be worth 0.1% or 50%?
The model should be designed and trained in a special way. It should include
the notion of "impedment" as a hidden variable and the training algorithm
should be able to distinguish it from the average audio.
Do you have an example on how to implement this? Would the algorithm determine
if the user has an impediment and then decide on which model to use?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Could you explain a little more about how this average works? If one voice
is added how is the whole average affected? Does the new voice proportionally
adjust the model? i.e. if the model was built from 1000 voices would the
addition of one voice be worth 0.1% or 50%?
The smoothing with the original model is controlled by a parameter tau. The
tau is selected to maximize a posterior probability of the adaptation data by
default or can be set automatically from a command line parameters
Do you have an example on how to implement this?
No
Would the algorithm determine if the user has an impediment and then decide
on which model to use?
This is one of the possible ways.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sorry to butt in like this, but I had a related question.
Is it possible to adapt an acoustic model "too much" so that the accuracy of
the resulting model is worse than what we started off the adaption with?
If yes, then is it a bad idea to adapt a model to the voices of a lot of
people, if the aim of the adaption is to ensure that the model becomes more
accurate for people with different accents etc?
Cheers
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Is it possible to adapt an acoustic model "too much" so that the accuracy of
the resulting model is worse than what we started off the adaption with?
Yes
If yes, then is it a bad idea to adapt a model to the voices of a lot of
people, if the aim of the adaption is to ensure that the model becomes more
accurate for people with different accents etc?
Sometimes yes
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have a number of question that would really help me to understand adapting
the model to additional users.
Order of voices added to model
Does the order in which additional voices are added to the model matter?
I have two users Nicholas and Charlotte, I added these two voices to the
original model using the sentences from the arctic example on http://cmusphin
x.sourceforge.net/wiki/tutorialadapt.
Firstly I added Nicholas and then Charlotte to the original model, I then used
some difference sentences to test the accuracy. I then repeated the same
experiment but this time adding Charlotte and then Nicholas to the original
model. This created two different adapted models.
When testing Nicholas against both models the accuracy was the same both
times:
TOTAL Words: 49 Correct: 48 Errors: 2
TOTAL Percent correct = 97.96% Error = 4.08% Accuracy = 95.92%
TOTAL Insertions: 1 Deletions: 1 Substitutions: 0
When testing Charlotte against both models the accuracy was different:
Charlotte added after Nicholas
TOTAL Words: 49 Correct: 41 Errors: 9
TOTAL Percent correct = 83.67% Error = 18.37% Accuracy = 81.63%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 8
Nicholas added after Charlotte
TOTAL Words: 49 Correct: 44 Errors: 6
TOTAL Percent correct = 89.80% Error = 12.24% Accuracy = 87.76%
TOTAL Insertions: 1 Deletions: 0 Substitutions: 5
What is the cause for this difference?
Internals of adding to a model
How does adding new voices to a model work?
If you record a sentence and create the required languagemodels and dictionary
files in order to add to the model how does it know which parts of the sound
relates to which word? How can this be accurate? If it could do this surely
you would never need to add any new voices to the model?
Surely the better option would be to split the sentences up into individual
word (one word per recording) and adapt the model like this?
Speech impediments
_If a user has an impediment such as lisp and they’re added to the model how
does this affect the resulting model for users without a lisp?
_In the tests I tried the accuracy for others users decreased, how can the
model be improved for impediments without impacting other users? Would
recording each word individually and adding them to the model help with this
situation?
Thank you for your time and providing a quality solution.
I think it's just an issue in testing. With more tests accuracy should be
different in both cases. There is no reason for it to be the same, it can only
happen accidentally.
The "adding" is a misconception here. Acoustic model is a statistically
estimated "average" of the human voices. It's not a database. You can not
"add" two voices, their differences will create a bias which will reduce the
accuracy on both voices because you will try to average just two voices. That
means you can either adapt to just single speaker thus shifting average closer
to this speaker or you need to adapt to many speakers in order to shift
average to some reasonable mean.
There is no need to split to words because the segmentation on states is done
automatically with Baum-Welch algorithm during adaptation when you invoke "bw"
command. To modify models you need to reestimate distributions of senones, not
words. So if you want to split you need to split to subphonemes (impossible to
do it right manually), not words.
You can get initial information about concepts of speech recognition from
tutorial:
http://cmusphinx.sourceforge.net/wiki/tutorial
It's expected to decrease because the "average" becomes more dirty than it
could be.
The model should be designed and trained in a special way. It should include
the notion of "impedment" as a hidden variable and the training algorithm
should be able to distinguish it from the average audio.
No
Thank you for your answers it has been very helpful.
Could you explain a little more about how this average works? If one voice is
added how is the whole average affected? Does the new voice proportionally
adjust the model?
i.e. if the model was built from 1000 voices would the addition of one voice
be worth 0.1% or 50%?
Do you have an example on how to implement this? Would the algorithm determine
if the user has an impediment and then decide on which model to use?
Hello
The smoothing with the original model is controlled by a parameter tau. The
tau is selected to maximize a posterior probability of the adaptation data by
default or can be set automatically from a command line parameters
No
This is one of the possible ways.
Hi,
Sorry to butt in like this, but I had a related question.
Is it possible to adapt an acoustic model "too much" so that the accuracy of
the resulting model is worse than what we started off the adaption with?
If yes, then is it a bad idea to adapt a model to the voices of a lot of
people, if the aim of the adaption is to ensure that the model becomes more
accurate for people with different accents etc?
Cheers
Yes
Sometimes yes