Let's say I have done an adaptation as per the adaptation tutorial for
pocketsphinx. Could I then do another adaptation by generating another set of
files using lmmtool, say another 50 lines, generate the wav files, then
appending the new txt, fileids, transcription and dic files to the files that
were used in the original adaptation and then running the appended files with
the original and new wav files through the adaptation process? Could I expect
an increase in accuracy again? If so, at what point would I stop seeing
reasonable improvements in accuracy (say 1% or better)?
I guess another way to ask the question, the arctic 20 files had maybe ten
lines of text used to adapt the model. I'm presuming more is better up to a
point. Must this be done all at once (say 20 hours of wav files for example)
or can the adaptation be done over a period of time, for example 20 minutes at
a time, by appending the new to the old files, or is it better to just keep
adapting the model and appending is not necessary as long as you adapt the new
(already adapted) model?
regards, Richard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
or is it better to just keep adapting the model and appending is not
necessary as long as you adapt the new (already adapted) model?
Adaptation is an estimation of the "average" speech parameters. If you split
your adapation data the average of chunks does not necessary match the average
of a whole. Sometimes average of a part is a better estimation of the proper
parameters (in case of noisy data). Sometimes it's better to consider the
dataset as a whole (in case of clean data). For speech parameters it's better
to adapt to a whole set, the adaptation of part by part does not work.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Okay, I understand. So how would you suggest one proceed? In the example of
robot control, for the sake of discussion. Let's say I were to use the default
language model with pocketsphinx and I adapt it using the Arctic20 tutorial
with some added lines such as "robot move forward", "robot turn left" etc, say
another 40 lines of text to the Arctic20 file, run that through lmmtool and
then do the acoustic adaptation. Presumably, if I stuck to just the commands
used in the file, I'd have a pretty high accuracy rate if not nigh perfect.
What then would be the course of action if I wanted to add new commands or say
a chatbot function to the robot which would need a much larger vocabulary.
Would I change the language model or do another adaptation of the acoustic
model or something else altogether?
In layman's terms I'm asking how does one increase the vocabulary of the
pocketsphinx implementation (running on continuous stream) while maintaining a
low WER? I think I'm still subconciously using the paradigm that there is a
one-to-one correspondence between what I teach/adapt sphinx to and what it can
understand, allowing somewhat for further understanding using hmm's to
extrapolate, but that's not quite the case, is it?
Also, out of curiousity, just how big is the default language model?
regards, Richard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Would I change the language model or do another adaptation of the acoustic
model or something else altogether?
You do not need to run another adaptation. Voice statistics remains the same,
it doesn't depend on the vocabulary
In layman's terms I'm asking how does one increase the vocabulary of the
pocketsphinx implementation (running on continuous stream) while maintaining a
low WER? I think I'm still subconciously using the paradigm that there is a
one-to-one correspondence between what I teach/adapt sphinx to and what it can
understand, allowing somewhat for further understanding using hmm's to
extrapolate, but that's not quite the case, is it?
There is no such correspondence. Acoustic model describes sounds, language
model describe words, they both work together to restrict the search space.
You may want to read the tutorial about that:
Let's say I have done an adaptation as per the adaptation tutorial for
pocketsphinx. Could I then do another adaptation by generating another set of
files using lmmtool, say another 50 lines, generate the wav files, then
appending the new txt, fileids, transcription and dic files to the files that
were used in the original adaptation and then running the appended files with
the original and new wav files through the adaptation process? Could I expect
an increase in accuracy again? If so, at what point would I stop seeing
reasonable improvements in accuracy (say 1% or better)?
I guess another way to ask the question, the arctic 20 files had maybe ten
lines of text used to adapt the model. I'm presuming more is better up to a
point. Must this be done all at once (say 20 hours of wav files for example)
or can the adaptation be done over a period of time, for example 20 minutes at
a time, by appending the new to the old files, or is it better to just keep
adapting the model and appending is not necessary as long as you adapt the new
(already adapted) model?
regards, Richard
Adaptation is an estimation of the "average" speech parameters. If you split
your adapation data the average of chunks does not necessary match the average
of a whole. Sometimes average of a part is a better estimation of the proper
parameters (in case of noisy data). Sometimes it's better to consider the
dataset as a whole (in case of clean data). For speech parameters it's better
to adapt to a whole set, the adaptation of part by part does not work.
Okay, I understand. So how would you suggest one proceed? In the example of
robot control, for the sake of discussion. Let's say I were to use the default
language model with pocketsphinx and I adapt it using the Arctic20 tutorial
with some added lines such as "robot move forward", "robot turn left" etc, say
another 40 lines of text to the Arctic20 file, run that through lmmtool and
then do the acoustic adaptation. Presumably, if I stuck to just the commands
used in the file, I'd have a pretty high accuracy rate if not nigh perfect.
What then would be the course of action if I wanted to add new commands or say
a chatbot function to the robot which would need a much larger vocabulary.
Would I change the language model or do another adaptation of the acoustic
model or something else altogether?
In layman's terms I'm asking how does one increase the vocabulary of the
pocketsphinx implementation (running on continuous stream) while maintaining a
low WER? I think I'm still subconciously using the paradigm that there is a
one-to-one correspondence between what I teach/adapt sphinx to and what it can
understand, allowing somewhat for further understanding using hmm's to
extrapolate, but that's not quite the case, is it?
Also, out of curiousity, just how big is the default language model?
regards, Richard
You do not need to run another adaptation. Voice statistics remains the same,
it doesn't depend on the vocabulary
There is no such correspondence. Acoustic model describes sounds, language
model describe words, they both work together to restrict the search space.
You may want to read the tutorial about that:
http://cmusphinx.sourceforge.net/wiki/tutorialconcepts