Menu

Is adaptation cumulative

Help
2012-07-27
2012-09-22
  • Richard Kappler

    Richard Kappler - 2012-07-27

    Let's say I have done an adaptation as per the adaptation tutorial for
    pocketsphinx. Could I then do another adaptation by generating another set of
    files using lmmtool, say another 50 lines, generate the wav files, then
    appending the new txt, fileids, transcription and dic files to the files that
    were used in the original adaptation and then running the appended files with
    the original and new wav files through the adaptation process? Could I expect
    an increase in accuracy again? If so, at what point would I stop seeing
    reasonable improvements in accuracy (say 1% or better)?

    I guess another way to ask the question, the arctic 20 files had maybe ten
    lines of text used to adapt the model. I'm presuming more is better up to a
    point. Must this be done all at once (say 20 hours of wav files for example)
    or can the adaptation be done over a period of time, for example 20 minutes at
    a time, by appending the new to the old files, or is it better to just keep
    adapting the model and appending is not necessary as long as you adapt the new
    (already adapted) model?

    regards, Richard

     
  • Nickolay V. Shmyrev

    or is it better to just keep adapting the model and appending is not
    necessary as long as you adapt the new (already adapted) model?

    Adaptation is an estimation of the "average" speech parameters. If you split
    your adapation data the average of chunks does not necessary match the average
    of a whole. Sometimes average of a part is a better estimation of the proper
    parameters (in case of noisy data). Sometimes it's better to consider the
    dataset as a whole (in case of clean data). For speech parameters it's better
    to adapt to a whole set, the adaptation of part by part does not work.

     
  • Richard Kappler

    Richard Kappler - 2012-07-27

    Okay, I understand. So how would you suggest one proceed? In the example of
    robot control, for the sake of discussion. Let's say I were to use the default
    language model with pocketsphinx and I adapt it using the Arctic20 tutorial
    with some added lines such as "robot move forward", "robot turn left" etc, say
    another 40 lines of text to the Arctic20 file, run that through lmmtool and
    then do the acoustic adaptation. Presumably, if I stuck to just the commands
    used in the file, I'd have a pretty high accuracy rate if not nigh perfect.

    What then would be the course of action if I wanted to add new commands or say
    a chatbot function to the robot which would need a much larger vocabulary.
    Would I change the language model or do another adaptation of the acoustic
    model or something else altogether?

    In layman's terms I'm asking how does one increase the vocabulary of the
    pocketsphinx implementation (running on continuous stream) while maintaining a
    low WER? I think I'm still subconciously using the paradigm that there is a
    one-to-one correspondence between what I teach/adapt sphinx to and what it can
    understand, allowing somewhat for further understanding using hmm's to
    extrapolate, but that's not quite the case, is it?

    Also, out of curiousity, just how big is the default language model?

    regards, Richard

     
  • Nickolay V. Shmyrev

    Would I change the language model or do another adaptation of the acoustic
    model or something else altogether?

    You do not need to run another adaptation. Voice statistics remains the same,
    it doesn't depend on the vocabulary

    In layman's terms I'm asking how does one increase the vocabulary of the
    pocketsphinx implementation (running on continuous stream) while maintaining a
    low WER? I think I'm still subconciously using the paradigm that there is a
    one-to-one correspondence between what I teach/adapt sphinx to and what it can
    understand, allowing somewhat for further understanding using hmm's to
    extrapolate, but that's not quite the case, is it?

    There is no such correspondence. Acoustic model describes sounds, language
    model describe words, they both work together to restrict the search space.
    You may want to read the tutorial about that:

    http://cmusphinx.sourceforge.net/wiki/tutorialconcepts

     

Log in to post a comment.