I have a very general (and possibly very obvious) question, but I wouldn't mind some input nonetheless.
I'm creating a very simple phoneme recognizer based loosely on the yesno example. At the moment it's trained only on audio samples of "s". The LM and grammar are restricted so that it can only return "s" or "not-s".
The problem is that one held-out "s" test file is being incorrectly recognized as "not-s".
I would say the sound is sufficiently like the rest of the training data to be considered an "s", but obviously the recognizer thinks otherwise.
My question (I invite anyone else reading this to weigh in):
If "not-s" isn't modelled, does the test sentence have to be really exemplary of "s" in order to be let into the "s" group?
And if I provide the system with "not-s" training data, will the forced choice result in my test sentence being placed in the "s" category as the lesser of two evils?
Cheers.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The problem is that one held-out "s" test file is being incorrectly recognized as "not-s".
It is not quite clear how are you training, you probably need to provide more details on this. For example it is not quite clear how did you train model with "not-s" without any examples of "not-s". You should have a lot of warnings in the log. It is not quite clear what model did you train (how many mixtures, is it mono or tied) and so on. To make things clear you could share your whole experiment folder in archive.
If "not-s" isn't modelled, does the test sentence have to be really exemplary of "s" in order to be let into the "s" group?
No, it should not be like that. There is still some variance which detectors allow.
And if I provide the system with "not-s" training data, will the forced choice result in my test sentence being placed in the "s" category as the lesser of two evils?
Yes.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
So if test sentence not being recognised, is it a good plan do one of the following:
1. test with better example
2. train with worse examples
3. model the alternatives to force a choice?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
You can better say "train with more representative examples". This is a good idea.
model the alternatives to force a choice?
This is a good idea too. Overall, HMM decoding is designed to discriminate between different classes of sounds, it has no support for detection of a single class versus everything else. If you need that you need to extend the algorithm itself. For example, there are large margin estimators.
Hi Nickolay, just saw your responses now. Thanks for the input. Let me explain further.
I guess it's incorrect of me to say that I trained on "not-s". In fact I only trained on "s" (a monophone HMM), but I allowed for "not-s" in the language model. I've attached the experiment folder if you want to take a look.
Thanks for the pdf. It looks like something i would like to learn more about. I'm just doing such a simple "s"/"not-s" because i want to be sure that i'm doing the acoustic modelling correctly (getting answers that i expect). However, from what you say it looks like i'm introducing more problems by modelling a scenario that HMMs aren't equipped to deal with.
I do wonder though why, for this one audio sample, the best path is given as "not-s". Here is the lattice:
0 1 not-s 3.24568,2689.9,3_1_1_1_1_1_12_10_18_17_17_17_17_20_22_24_23_23_23_23_23_23_23_23_23_23_23
0 2 s 3.22173,2741.65,3_1_1_1_1_1_12_10_18_17_17_17_17_26_28_27_27_27_30_29_29_29_29_29_29_29_29
I had understood that the acoustic cost reflects the probability that the test audio file was produced by the acoustic model of "not-s". If "not-s" isn't even modelled (recall that no "not-s" audio files were actually used in training; it's just that "not-s" is allowed as a possible transcription), how could an acoustic cost be calculated for not-s?
Am I doing something very silly here to allow for this? Or do HMMs just do something I don't yet know about?
Looking on your folder I can tell you what happens:
1) On initialization all gaussians are initialized with equal distribution as usual for HMM training (flat start).
2) Since 'a' state has zero occupancy it is never reestimated and warning is displayed in the log:
WARNING (gmm-est:MleDiagGmmUpdate():mle-diag-gmm.cc:365) Gaussian has too little data but not removing it because it is the last G aussian: i = 0, occ = 0, weight = 1
So the gaussians for 'a' remain as they were initialized. You can print 0.mdl gaussians and final.mdl gaussians and check they still remain almost unchanged.
3) On decoding you essentially compare those trained gaussians for 's' with gaussians for 'a' estimated from whole data with equal distribution during flat initialization. Sometimes 'a' wins.
Last edit: Nickolay V. Shmyrev 2015-03-01
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi all,
I have a very general (and possibly very obvious) question, but I wouldn't mind some input nonetheless.
I'm creating a very simple phoneme recognizer based loosely on the yesno example. At the moment it's trained only on audio samples of "s". The LM and grammar are restricted so that it can only return "s" or "not-s".
The problem is that one held-out "s" test file is being incorrectly recognized as "not-s".
I would say the sound is sufficiently like the rest of the training data to be considered an "s", but obviously the recognizer thinks otherwise.
My question (I invite anyone else reading this to weigh in):
If "not-s" isn't modelled, does the test sentence have to be really exemplary of "s" in order to be let into the "s" group?
And if I provide the system with "not-s" training data, will the forced choice result in my test sentence being placed in the "s" category as the lesser of two evils?
Cheers.
It is not quite clear how are you training, you probably need to provide more details on this. For example it is not quite clear how did you train model with "not-s" without any examples of "not-s". You should have a lot of warnings in the log. It is not quite clear what model did you train (how many mixtures, is it mono or tied) and so on. To make things clear you could share your whole experiment folder in archive.
No, it should not be like that. There is still some variance which detectors allow.
Yes.
Update: answer = yes (for my experiment anyway)
So if test sentence not being recognised, is it a good plan do one of the following:
1. test with better example
2. train with worse examples
3. model the alternatives to force a choice?
This is cheating
You can better say "train with more representative examples". This is a good idea.
This is a good idea too. Overall, HMM decoding is designed to discriminate between different classes of sounds, it has no support for detection of a single class versus everything else. If you need that you need to extend the algorithm itself. For example, there are large margin estimators.
Large Margin Algorithms for Discriminative Continuous Speech Recognition
Thesis
by
Joseph Keshet
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.116.8585&rep=rep1&type=pdf
Hi Nickolay, just saw your responses now. Thanks for the input. Let me explain further.
I guess it's incorrect of me to say that I trained on "not-s". In fact I only trained on "s" (a monophone HMM), but I allowed for "not-s" in the language model. I've attached the experiment folder if you want to take a look.
Thanks for the pdf. It looks like something i would like to learn more about. I'm just doing such a simple "s"/"not-s" because i want to be sure that i'm doing the acoustic modelling correctly (getting answers that i expect). However, from what you say it looks like i'm introducing more problems by modelling a scenario that HMMs aren't equipped to deal with.
I do wonder though why, for this one audio sample, the best path is given as "not-s". Here is the lattice:
0 1 not-s 3.24568,2689.9,3_1_1_1_1_1_12_10_18_17_17_17_17_20_22_24_23_23_23_23_23_23_23_23_23_23_23
0 2 s 3.22173,2741.65,3_1_1_1_1_1_12_10_18_17_17_17_17_26_28_27_27_27_30_29_29_29_29_29_29_29_29
I had understood that the acoustic cost reflects the probability that the test audio file was produced by the acoustic model of "not-s". If "not-s" isn't even modelled (recall that no "not-s" audio files were actually used in training; it's just that "not-s" is allowed as a possible transcription), how could an acoustic cost be calculated for not-s?
Am I doing something very silly here to allow for this? Or do HMMs just do something I don't yet know about?
Thanks!
Looking on your folder I can tell you what happens:
1) On initialization all gaussians are initialized with equal distribution as usual for HMM training (flat start).
2) Since 'a' state has zero occupancy it is never reestimated and warning is displayed in the log:
So the gaussians for 'a' remain as they were initialized. You can print 0.mdl gaussians and final.mdl gaussians and check they still remain almost unchanged.
3) On decoding you essentially compare those trained gaussians for 's' with gaussians for 'a' estimated from whole data with equal distribution during flat initialization. Sometimes 'a' wins.
Last edit: Nickolay V. Shmyrev 2015-03-01
That makes sense! Thanks for all your help.