Menu

Continuous training

Help
2012-08-14
2012-09-22
  • Richard Kappler

    Richard Kappler - 2012-08-14

    Stop cursing Nikolay, mind your blood pressure. :-) It might not be as dumb a
    question as you usually get from me. Then again it might.

    I have a good background in image/facial rec and general machine learning. I
    am, as you know, struggling at times to apply what I know to speech rec and
    learn more. With image rec, I can run an image against a database, for sake of
    discussion we'll say I'm using eigenfaces for facial recognition. Now, if I
    run into a new face, say I wish to add a new person to the database, it's a
    fairly trivial exercise to add some images of the new person's face to the
    database and then retrain the model. Is anything along these lines feasible
    for speech recognition?

    My example may be way off base here, but if I were running pocketsphinx and
    the recognizer returned a faulty text, maybe I said "do you like baseball" and
    the recognizer's output were "do you might face fall." Would it be possible to
    build in a script of some sort that would allow me to say a key phrase, we'll
    call it "bad response" that would stop recognition phase and enter some sort
    of training phase to, I dunno, have pocketshpinx pull up the previous acoustic
    input and ask for me to enter matching text, then have pocketsphinx assimilate
    this new data, thus "training" the program as we go? Or am I way out in left
    field here.

    Feel free to yell now.

    regards, Richard

     
  • Nickolay V. Shmyrev

    Now, if I run into a new face, say I wish to add a new person to the
    database, it's a fairly trivial exercise to add some images of the new
    person's face to the database and then retrain the model. Is anything along
    these lines feasible for speech recognition?

    It's the same

    Would it be possible to build in a script of some sort that would allow me
    to say a key phrase, we'll call it "bad response" that would stop recognition
    phase and enter some sort of training phase to, I dunno, have pocketshpinx
    pull up the previous acoustic input and ask for me to enter matching text,
    then have pocketsphinx assimilate this new data, thus "training" the program
    as we go?

    Yes, that's what MAP adaptation is doing. With a carefully selected smoothing
    factor (-tau parameter) you can control the convergence speed and avoid
    overtraining. You can periodically retrain the model on the whole database
    too, though it's slower.

     
  • Richard Kappler

    Richard Kappler - 2012-08-20

    Yes, that's what MAP adaptation is doing. With a carefully selected
    smoothing factor (-tau parameter) you can control the convergence speed and
    avoid overtraining.

    Any guidance on how to go about this? I get that you answered my question, I
    just don't know enough yet to understand the answer. What I think you're
    saying is that, if I can come up with a script that stops recognition, or,
    with my knowledge level, exits recognition if it hears "bad response", I'd
    have to add a new wave file, run mapadapt to generate new means, variances,
    mixture weights etc, and then periodically retrain the model with the new
    phrases added to the train.txt, new dic etc, or do I have to do a full
    adaptation each time (based on previous discussions best done on the original
    acoustical model, not the already adapted one?).

    I'm a little lost here Nikolay, sorry to be a pain in the neck, but throw me a
    bone, eh?

    regards, Richard

     
  • Nickolay V. Shmyrev

    I'd have to add a new wave file, run mapadapt to generate new means,
    variances, mixture weights etc, and then periodically retrain the model with
    the new phrases added to the train.txt, new dic etc, or do I have to do a full
    adaptation each time (based on previous discussions best done on the original
    acoustical model, not the already adapted one?).

    I don't quite see any difference because I don't understand what is
    "train.txt" and what is the difference between "full adaptation" and "to run
    mapadapt".

    From what I understood, I think it's a good strategy to improve the core model
    over time, you just need to do it carefully because you can screw the model.
    If you want to feel safe the second approach is better

    This paper might be a good reading too:

    http://eprints.sics.se/3600/

     
  • Richard Kappler

    Richard Kappler - 2012-08-20

    Thanks Nikolay. I think I get it better now. train.txt was my attempt at a
    generic equivalent of the arctic20.txt from the adaptation tutorial.

    So, to be sure I understand, (and yes, I'm off to read that pdf immediately),
    my best bet is to keep track of "bad responses" add them to my adaptation set
    and basically do a new adaptation on the original acoustical model, the idea
    being that the larger (and more pointed to my use) the adaptation set the
    better. Yes?

    I think I get it. thanks again.

    regards, Richard

     
  • Pang Lei

    Pang Lei - 2012-08-28

    Hi, I currently want to transcribe news videos downloaded from YouTube with
    sphinx 4. However, the performance is so poor that I can't believe it.

    Following is what I've done. I use the hub4 trigram language model as well as
    hub4 acoustic model. As for the dictionary, I use the cmu07a.dic is used. I
    just make very small changes to the transcriber demo provided in the source
    code and there is no difference except for the language model path, acoustic
    model path and the dictionary path.

    Here are two links used in my test.
    http://www.youtube.com/watch?v=2vmuFpY428g&feature=youtube_gdata_player
    http://www.youtube.com/watch?v=6MVoHFbBrP8&feature=youtube_gdata_player

    Can anyone tell me your result based on your configure. I'm confused now and I
    just make clear whether there is a mistake or the sphinx is just perform
    poorly. If you have a good result, it will be greatly appreciated if you tell
    me your config and code.

     
  • Nickolay V. Shmyrev

    However, the performance is so poor that I can't believe it.

    You need to read the FAQ

    http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor

    Can anyone tell me your result based on your configure. I'm confused now and
    I just make clear whether there is a mistake or the sphinx is just perform
    poorly. If you have a good result, it will be greatly appreciated if you tell
    me your config and code.

    with hub4 WER must be about 30-35%

     
  • Nickolay V. Shmyrev

    if you tell me your config and code.

    Config file for hub4 is available in sphinx4/tests/performance/hub4

     

Log in to post a comment.