Menu

performance of new WSJ-models

Help
2008-04-08
2012-09-22
  • Masrur Doostdar

    Masrur Doostdar - 2008-04-08

    Hi,

    I tested the new WSJ-acoustic models cmu published, expecting they would outperform the existing wsj-models trained by Keith Vertanen [1], because of additional MLLT transformation. But on my task I get evaluation results about 2 times worse in comparison to [1] (new model: WER 8.6% SER 27.6%, [1]: WER 4.2% SER 14.0%). However, this is a bit confusing, perhaps not the whole wsj corpus was used for training of the new models? I'll give details on my configuration, so perhaps David or Nickolay can tell me, if the results are somehow reasonable, or if I'm doing something wrong:

    [1] model used for comparison:
    WSJ all;3 states, no skips;8000 senones;16 Gaussians

    config used for decoding (with new models):

    -hmm wsj_all_cd30.mllt_cd_cont_4000
    -lda wsj_all_cd30.mllt_cd_cont_4000/feature_transform
    -fdict wsj_all_cd30.mllt_cd_cont_4000/noisedict
    -lw 15
    -feat 1s_c_d_dd
    -dict navigate-go7.dic
    -fsg nav.fsg
    -wip 0.2
    -beam 1e-120
    -pbeam 1e-120
    -wbeam 1e-100
    -varnorm no
    -cmn current
    -hyp result
    -op_mode 2

    thanks and regards
    Masrur D.

     
    • David Huggins-Daines

      Looking at your FSG it does seem unlikely that the mdef file will make a difference.

      I wonder if there is an integer overflow in HMM evaluation or something... Unfortunately I don't have time to debug this fully. But that would be something to look for.

       
      • Masrur Doostdar

        Masrur Doostdar - 2008-04-08

        youre right, there is just very minor difference in the error rate (0.3%) with the other mdef

        http://www-users.rwth-aachen.de/Masrur.Doostdar/output_fsg_5.1.6
        here is a log of my decoding run with the 723 utterances

        http://www-users.rwth-aachen.de/Masrur.Doostdar/selecion.tar
        here you find 14 raw files of utterances where with your model GO was recognized instead of DRIVE. Transcript, fileids and logs/alignment of decoding on those 14 utterances are also appended. What is kind of strange: 4 out of this 14 where recognized correctly now, i.e. in decoding run with only this 14 utterances. Perhaps its about various cmn-values(I decode with livepretend)?

        regards
        Masrur D.

         
    • David Huggins-Daines

      Hmm. On the actual WSJ test set these models do significantly better than Keith's, but I should go back and recheck that.

      Unfortunately, I can think of one very good reason why this is happening.

      Keith's models were trained using the full CMU dictionary (about 100k words), while my models were just trained using the subset of the dictionary present in the training corpus (27046 words). So it's likely that there are a lot of missing triphones, particularly when you use these models on a domain other than WSJ.

      This is kind of a dumb mistake on my part. Actually, we have a new version of the CMU dictionary now too, so I ought to go back and retrain everything with that anyway.

       
      • Masrur Doostdar

        Masrur Doostdar - 2008-04-08

        Hmm, I dont understand what difference it makes in the training-process to do or do not assume the bigger 100k dictionary, if you just see a subset of the words and thus a subset of the triphones. Is it about triphone-tying?

        However, I compared the scilte-aligments for the results of yours and Keith's Model, in order to see perhaps some conspicuity of the wrong recognitions. But I'm not able for a jugement. I uploaded the two alligments, perhaps it can give you some hints, if you have a look on it. You should know that my test-corpus consist of about 730 sentences, all generated from a not very big grammar[2]. So among these 720 there are many sentence with similar or equal content, this can be a reason for some flaw in the accoustic model showing bigger influence in the error-rate.

        By the way, David, I asked some time ago about lattice and N-best generation for fsg-decoding. I read in your wiki about your MMIE-project and your need for lattices there. Can one hope, that with this project there will be some contribution to sphinx3 by means of incorporation of fsg lattice/n-best generation and perhaps even posterior probabilities? If yes, any rough Idea about how long you think it may take?

        thanks again
        regards
        Masrur D.

        [1] http://www-users.rwth-aachen.de/Masrur.Doostdar/alignment_davids_wsj-mllr_model
        http://www-users.rwth-aachen.de/Masrur.Doostdar/alignment_keith_wsj_model

        [2] http://www-users.rwth-aachen.de/Masrur.Doostdar/nav-withoutstop.fsg

         
        • David Huggins-Daines

          Yeah, it has to do with triphone tying. The set of triphones in the final mdef file is actually determined by the dictionary used in training, not the actual data. The use of decision trees for state tying means that it's possible to generalize even to unseen triphones when doing state tying.

          This is actually a weakness of Sphinx versus HTK - if a particular triphone does not occur in the mdef file then Sphinx will just use the context-independent phone, whereas HTK is able to back off to a more general model (i.e. one which at least shares the same left or right context).

          We could actually work around this in the decoder fairly easily if we knew which partial triphone was the best one to back off to.

           
    • David Huggins-Daines

      Please try this new model definition file and tell me if you get any differences:

      http://www.cs.cmu.edu/~dhuggins/Projects/wsj_all_cd30.mllt.4000.mdef

       
    • David Huggins-Daines

      One very interesting thing about the alignment files you posted. It appears that almost all of the additional errors from my models are due to the word "GO" being recognized in place of something else. This might indicate that likelihood values are getting screwed up internally somehow. Can you put the decoding logs up somewhere?

      I'm unsure about any timeline about FSG N-best and posterior probabilities, but it's likely to happen soon, as it will be an integral part of my thesis proposal.

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.