Stop cursing Nikolay, mind your blood pressure. :-) It might not be as dumb a
question as you usually get from me. Then again it might.
I have a good background in image/facial rec and general machine learning. I
am, as you know, struggling at times to apply what I know to speech rec and
learn more. With image rec, I can run an image against a database, for sake of
discussion we'll say I'm using eigenfaces for facial recognition. Now, if I
run into a new face, say I wish to add a new person to the database, it's a
fairly trivial exercise to add some images of the new person's face to the
database and then retrain the model. Is anything along these lines feasible
for speech recognition?
My example may be way off base here, but if I were running pocketsphinx and
the recognizer returned a faulty text, maybe I said "do you like baseball" and
the recognizer's output were "do you might face fall." Would it be possible to
build in a script of some sort that would allow me to say a key phrase, we'll
call it "bad response" that would stop recognition phase and enter some sort
of training phase to, I dunno, have pocketshpinx pull up the previous acoustic
input and ask for me to enter matching text, then have pocketsphinx assimilate
this new data, thus "training" the program as we go? Or am I way out in left
field here.
Feel free to yell now.
regards, Richard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Now, if I run into a new face, say I wish to add a new person to the
database, it's a fairly trivial exercise to add some images of the new
person's face to the database and then retrain the model. Is anything along
these lines feasible for speech recognition?
It's the same
Would it be possible to build in a script of some sort that would allow me
to say a key phrase, we'll call it "bad response" that would stop recognition
phase and enter some sort of training phase to, I dunno, have pocketshpinx
pull up the previous acoustic input and ask for me to enter matching text,
then have pocketsphinx assimilate this new data, thus "training" the program
as we go?
Yes, that's what MAP adaptation is doing. With a carefully selected smoothing
factor (-tau parameter) you can control the convergence speed and avoid
overtraining. You can periodically retrain the model on the whole database
too, though it's slower.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yes, that's what MAP adaptation is doing. With a carefully selected
smoothing factor (-tau parameter) you can control the convergence speed and
avoid overtraining.
Any guidance on how to go about this? I get that you answered my question, I
just don't know enough yet to understand the answer. What I think you're
saying is that, if I can come up with a script that stops recognition, or,
with my knowledge level, exits recognition if it hears "bad response", I'd
have to add a new wave file, run mapadapt to generate new means, variances,
mixture weights etc, and then periodically retrain the model with the new
phrases added to the train.txt, new dic etc, or do I have to do a full
adaptation each time (based on previous discussions best done on the original
acoustical model, not the already adapted one?).
I'm a little lost here Nikolay, sorry to be a pain in the neck, but throw me a
bone, eh?
regards, Richard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'd have to add a new wave file, run mapadapt to generate new means,
variances, mixture weights etc, and then periodically retrain the model with
the new phrases added to the train.txt, new dic etc, or do I have to do a full
adaptation each time (based on previous discussions best done on the original
acoustical model, not the already adapted one?).
I don't quite see any difference because I don't understand what is
"train.txt" and what is the difference between "full adaptation" and "to run
mapadapt".
From what I understood, I think it's a good strategy to improve the core model
over time, you just need to do it carefully because you can screw the model.
If you want to feel safe the second approach is better
Thanks Nikolay. I think I get it better now. train.txt was my attempt at a
generic equivalent of the arctic20.txt from the adaptation tutorial.
So, to be sure I understand, (and yes, I'm off to read that pdf immediately),
my best bet is to keep track of "bad responses" add them to my adaptation set
and basically do a new adaptation on the original acoustical model, the idea
being that the larger (and more pointed to my use) the adaptation set the
better. Yes?
I think I get it. thanks again.
regards, Richard
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi, I currently want to transcribe news videos downloaded from YouTube with
sphinx 4. However, the performance is so poor that I can't believe it.
Following is what I've done. I use the hub4 trigram language model as well as
hub4 acoustic model. As for the dictionary, I use the cmu07a.dic is used. I
just make very small changes to the transcriber demo provided in the source
code and there is no difference except for the language model path, acoustic
model path and the dictionary path.
Can anyone tell me your result based on your configure. I'm confused now and I
just make clear whether there is a mistake or the sphinx is just perform
poorly. If you have a good result, it will be greatly appreciated if you tell
me your config and code.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Can anyone tell me your result based on your configure. I'm confused now and
I just make clear whether there is a mistake or the sphinx is just perform
poorly. If you have a good result, it will be greatly appreciated if you tell
me your config and code.
with hub4 WER must be about 30-35%
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Stop cursing Nikolay, mind your blood pressure. :-) It might not be as dumb a
question as you usually get from me. Then again it might.
I have a good background in image/facial rec and general machine learning. I
am, as you know, struggling at times to apply what I know to speech rec and
learn more. With image rec, I can run an image against a database, for sake of
discussion we'll say I'm using eigenfaces for facial recognition. Now, if I
run into a new face, say I wish to add a new person to the database, it's a
fairly trivial exercise to add some images of the new person's face to the
database and then retrain the model. Is anything along these lines feasible
for speech recognition?
My example may be way off base here, but if I were running pocketsphinx and
the recognizer returned a faulty text, maybe I said "do you like baseball" and
the recognizer's output were "do you might face fall." Would it be possible to
build in a script of some sort that would allow me to say a key phrase, we'll
call it "bad response" that would stop recognition phase and enter some sort
of training phase to, I dunno, have pocketshpinx pull up the previous acoustic
input and ask for me to enter matching text, then have pocketsphinx assimilate
this new data, thus "training" the program as we go? Or am I way out in left
field here.
Feel free to yell now.
regards, Richard
It's the same
Yes, that's what MAP adaptation is doing. With a carefully selected smoothing
factor (-tau parameter) you can control the convergence speed and avoid
overtraining. You can periodically retrain the model on the whole database
too, though it's slower.
Any guidance on how to go about this? I get that you answered my question, I
just don't know enough yet to understand the answer. What I
thinkyou'resaying is that, if I can come up with a script that stops recognition, or,
with my knowledge level, exits recognition if it hears "bad response", I'd
have to add a new wave file, run mapadapt to generate new means, variances,
mixture weights etc, and then periodically retrain the model with the new
phrases added to the train.txt, new dic etc, or do I have to do a full
adaptation each time (based on previous discussions best done on the original
acoustical model, not the already adapted one?).
I'm a little lost here Nikolay, sorry to be a pain in the neck, but throw me a
bone, eh?
regards, Richard
I don't quite see any difference because I don't understand what is
"train.txt" and what is the difference between "full adaptation" and "to run
mapadapt".
From what I understood, I think it's a good strategy to improve the core model
over time, you just need to do it carefully because you can screw the model.
If you want to feel safe the second approach is better
This paper might be a good reading too:
http://eprints.sics.se/3600/
Thanks Nikolay. I think I get it better now. train.txt was my attempt at a
generic equivalent of the arctic20.txt from the adaptation tutorial.
So, to be sure I understand, (and yes, I'm off to read that pdf immediately),
my best bet is to keep track of "bad responses" add them to my adaptation set
and basically do a new adaptation on the original acoustical model, the idea
being that the larger (and more pointed to my use) the adaptation set the
better. Yes?
I think I get it. thanks again.
regards, Richard
Hi, I currently want to transcribe news videos downloaded from YouTube with
sphinx 4. However, the performance is so poor that I can't believe it.
Following is what I've done. I use the hub4 trigram language model as well as
hub4 acoustic model. As for the dictionary, I use the cmu07a.dic is used. I
just make very small changes to the transcriber demo provided in the source
code and there is no difference except for the language model path, acoustic
model path and the dictionary path.
Here are two links used in my test.
http://www.youtube.com/watch?v=2vmuFpY428g&feature=youtube_gdata_player
http://www.youtube.com/watch?v=6MVoHFbBrP8&feature=youtube_gdata_player
Can anyone tell me your result based on your configure. I'm confused now and I
just make clear whether there is a mistake or the sphinx is just perform
poorly. If you have a good result, it will be greatly appreciated if you tell
me your config and code.
You need to read the FAQ
http://cmusphinx.sourceforge.net/wiki/faq#qwhy_my_accuracy_is_poor
with hub4 WER must be about 30-35%
Config file for hub4 is available in sphinx4/tests/performance/hub4