I tested the new WSJ-acoustic models cmu published, expecting they would outperform the existing wsj-models trained by Keith Vertanen [1], because of additional MLLT transformation. But on my task I get evaluation results about 2 times worse in comparison to [1] (new model: WER 8.6% SER 27.6%, [1]: WER 4.2% SER 14.0%). However, this is a bit confusing, perhaps not the whole wsj corpus was used for training of the new models? I'll give details on my configuration, so perhaps David or Nickolay can tell me, if the results are somehow reasonable, or if I'm doing something wrong:
[1] model used for comparison:
WSJ all;3 states, no skips;8000 senones;16 Gaussians
config used for decoding (with new models):
-hmm wsj_all_cd30.mllt_cd_cont_4000
-lda wsj_all_cd30.mllt_cd_cont_4000/feature_transform
-fdict wsj_all_cd30.mllt_cd_cont_4000/noisedict
-lw 15
-feat 1s_c_d_dd
-dict navigate-go7.dic
-fsg nav.fsg
-wip 0.2
-beam 1e-120
-pbeam 1e-120
-wbeam 1e-100
-varnorm no
-cmn current
-hyp result
-op_mode 2
thanks and regards
Masrur D.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Looking at your FSG it does seem unlikely that the mdef file will make a difference.
I wonder if there is an integer overflow in HMM evaluation or something... Unfortunately I don't have time to debug this fully. But that would be something to look for.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
http://www-users.rwth-aachen.de/Masrur.Doostdar/selecion.tar
here you find 14 raw files of utterances where with your model GO was recognized instead of DRIVE. Transcript, fileids and logs/alignment of decoding on those 14 utterances are also appended. What is kind of strange: 4 out of this 14 where recognized correctly now, i.e. in decoding run with only this 14 utterances. Perhaps its about various cmn-values(I decode with livepretend)?
regards
Masrur D.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm. On the actual WSJ test set these models do significantly better than Keith's, but I should go back and recheck that.
Unfortunately, I can think of one very good reason why this is happening.
Keith's models were trained using the full CMU dictionary (about 100k words), while my models were just trained using the subset of the dictionary present in the training corpus (27046 words). So it's likely that there are a lot of missing triphones, particularly when you use these models on a domain other than WSJ.
This is kind of a dumb mistake on my part. Actually, we have a new version of the CMU dictionary now too, so I ought to go back and retrain everything with that anyway.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hmm, I dont understand what difference it makes in the training-process to do or do not assume the bigger 100k dictionary, if you just see a subset of the words and thus a subset of the triphones. Is it about triphone-tying?
However, I compared the scilte-aligments for the results of yours and Keith's Model, in order to see perhaps some conspicuity of the wrong recognitions. But I'm not able for a jugement. I uploaded the two alligments, perhaps it can give you some hints, if you have a look on it. You should know that my test-corpus consist of about 730 sentences, all generated from a not very big grammar[2]. So among these 720 there are many sentence with similar or equal content, this can be a reason for some flaw in the accoustic model showing bigger influence in the error-rate.
By the way, David, I asked some time ago about lattice and N-best generation for fsg-decoding. I read in your wiki about your MMIE-project and your need for lattices there. Can one hope, that with this project there will be some contribution to sphinx3 by means of incorporation of fsg lattice/n-best generation and perhaps even posterior probabilities? If yes, any rough Idea about how long you think it may take?
Yeah, it has to do with triphone tying. The set of triphones in the final mdef file is actually determined by the dictionary used in training, not the actual data. The use of decision trees for state tying means that it's possible to generalize even to unseen triphones when doing state tying.
This is actually a weakness of Sphinx versus HTK - if a particular triphone does not occur in the mdef file then Sphinx will just use the context-independent phone, whereas HTK is able to back off to a more general model (i.e. one which at least shares the same left or right context).
We could actually work around this in the decoder fairly easily if we knew which partial triphone was the best one to back off to.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One very interesting thing about the alignment files you posted. It appears that almost all of the additional errors from my models are due to the word "GO" being recognized in place of something else. This might indicate that likelihood values are getting screwed up internally somehow. Can you put the decoding logs up somewhere?
I'm unsure about any timeline about FSG N-best and posterior probabilities, but it's likely to happen soon, as it will be an integral part of my thesis proposal.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I tested the new WSJ-acoustic models cmu published, expecting they would outperform the existing wsj-models trained by Keith Vertanen [1], because of additional MLLT transformation. But on my task I get evaluation results about 2 times worse in comparison to [1] (new model: WER 8.6% SER 27.6%, [1]: WER 4.2% SER 14.0%). However, this is a bit confusing, perhaps not the whole wsj corpus was used for training of the new models? I'll give details on my configuration, so perhaps David or Nickolay can tell me, if the results are somehow reasonable, or if I'm doing something wrong:
[1] model used for comparison:
WSJ all;3 states, no skips;8000 senones;16 Gaussians
config used for decoding (with new models):
-hmm wsj_all_cd30.mllt_cd_cont_4000
-lda wsj_all_cd30.mllt_cd_cont_4000/feature_transform
-fdict wsj_all_cd30.mllt_cd_cont_4000/noisedict
-lw 15
-feat 1s_c_d_dd
-dict navigate-go7.dic
-fsg nav.fsg
-wip 0.2
-beam 1e-120
-pbeam 1e-120
-wbeam 1e-100
-varnorm no
-cmn current
-hyp result
-op_mode 2
thanks and regards
Masrur D.
Looking at your FSG it does seem unlikely that the mdef file will make a difference.
I wonder if there is an integer overflow in HMM evaluation or something... Unfortunately I don't have time to debug this fully. But that would be something to look for.
youre right, there is just very minor difference in the error rate (0.3%) with the other mdef
http://www-users.rwth-aachen.de/Masrur.Doostdar/output_fsg_5.1.6
here is a log of my decoding run with the 723 utterances
http://www-users.rwth-aachen.de/Masrur.Doostdar/selecion.tar
here you find 14 raw files of utterances where with your model GO was recognized instead of DRIVE. Transcript, fileids and logs/alignment of decoding on those 14 utterances are also appended. What is kind of strange: 4 out of this 14 where recognized correctly now, i.e. in decoding run with only this 14 utterances. Perhaps its about various cmn-values(I decode with livepretend)?
regards
Masrur D.
Hmm. On the actual WSJ test set these models do significantly better than Keith's, but I should go back and recheck that.
Unfortunately, I can think of one very good reason why this is happening.
Keith's models were trained using the full CMU dictionary (about 100k words), while my models were just trained using the subset of the dictionary present in the training corpus (27046 words). So it's likely that there are a lot of missing triphones, particularly when you use these models on a domain other than WSJ.
This is kind of a dumb mistake on my part. Actually, we have a new version of the CMU dictionary now too, so I ought to go back and retrain everything with that anyway.
Hmm, I dont understand what difference it makes in the training-process to do or do not assume the bigger 100k dictionary, if you just see a subset of the words and thus a subset of the triphones. Is it about triphone-tying?
However, I compared the scilte-aligments for the results of yours and Keith's Model, in order to see perhaps some conspicuity of the wrong recognitions. But I'm not able for a jugement. I uploaded the two alligments, perhaps it can give you some hints, if you have a look on it. You should know that my test-corpus consist of about 730 sentences, all generated from a not very big grammar[2]. So among these 720 there are many sentence with similar or equal content, this can be a reason for some flaw in the accoustic model showing bigger influence in the error-rate.
By the way, David, I asked some time ago about lattice and N-best generation for fsg-decoding. I read in your wiki about your MMIE-project and your need for lattices there. Can one hope, that with this project there will be some contribution to sphinx3 by means of incorporation of fsg lattice/n-best generation and perhaps even posterior probabilities? If yes, any rough Idea about how long you think it may take?
thanks again
regards
Masrur D.
[1] http://www-users.rwth-aachen.de/Masrur.Doostdar/alignment_davids_wsj-mllr_model
http://www-users.rwth-aachen.de/Masrur.Doostdar/alignment_keith_wsj_model
[2] http://www-users.rwth-aachen.de/Masrur.Doostdar/nav-withoutstop.fsg
Yeah, it has to do with triphone tying. The set of triphones in the final mdef file is actually determined by the dictionary used in training, not the actual data. The use of decision trees for state tying means that it's possible to generalize even to unseen triphones when doing state tying.
This is actually a weakness of Sphinx versus HTK - if a particular triphone does not occur in the mdef file then Sphinx will just use the context-independent phone, whereas HTK is able to back off to a more general model (i.e. one which at least shares the same left or right context).
We could actually work around this in the decoder fairly easily if we knew which partial triphone was the best one to back off to.
Please try this new model definition file and tell me if you get any differences:
http://www.cs.cmu.edu/~dhuggins/Projects/wsj_all_cd30.mllt.4000.mdef
One very interesting thing about the alignment files you posted. It appears that almost all of the additional errors from my models are due to the word "GO" being recognized in place of something else. This might indicate that likelihood values are getting screwed up internally somehow. Can you put the decoding logs up somewhere?
I'm unsure about any timeline about FSG N-best and posterior probabilities, but it's likely to happen soon, as it will be an integral part of my thesis proposal.