Menu

Adapted acoustic model leads to worse recognition

Help
2014-05-21
2014-05-27
  • Jeff Acquaviva

    Jeff Acquaviva - 2014-05-21

    Is there a reason why an acoustic model adapted to a specific speaker would do worse than the un-adapted for that speaker on a given test set?

    I created 3 adaptations of the EN-US acoustic model with 5, 10, 15 and 20 minutes of speech from president GW Bush's 2002 State of the Union. I then ran sphinx4 with each of these adapted AMs on Bush's 2007 State of the Union. Here were my Word Error Rate results:
    Unadapted: 49.290%
    5 minutes: 49.290%
    10 minutes: untested
    15 minutes: 56.781%
    20 minutes: 53.724%

    Do you have any suggestions as to what might cause this decrease in performance?

    Also, I really appreciate how much time you have given to help me with my project.

     
  • Nickolay V. Shmyrev

    There is something wrong with your setup. You can download Bush's test set from the 2002 set you shared before here:

    https://dl.dropboxusercontent.com/u/26073448/bush2002.tar.gz

    Results are way more reasonable than your 49%. And take into account the default en-us.lm.dmp vocabulary is pretty small for this task, it must be extended with required words.

     [java]    Accuracy: 75,448%    Errors: 187  (Sub: 141  Ins: 9  Del: 37)
     [java]    Words: 725   Matches: 547    WER: 25,793%
     [java]    Sentences: 41   Matches: 3   SentenceAcc: 7,317%
     [java]    Total Time Audio: 315,83s  Proc: 1379,74s  Speed: 4,37 X real time
     [java]    Mem  Total: 1102,94 Mb  Free: 624,71 Mb
     [java]    Used: This: 478,22 Mb  Avg: 645,86 Mb  Max: 1061,47 Mb
    
     
  • Jeff Acquaviva

    Jeff Acquaviva - 2014-05-22

    I'm sorry, I'm confused. Did you test on the bush2002 data? I didn't think it was correct to test on your training data. I was adapting the AM with the bush2002 data and using it on another test set, the bush2007 data. For reference here is the bush 2007 data: https://www.dropbox.com/s/as4pb7l1qyy6lwp/bush2007.tar.gz

    Also, are there any tutorials for adapting the language model to include out of vocabulary words?

    I see that the HUB 4 language model is about 3X the size of the generic english. Would it be better to use that one?

     

    Last edit: Jeff Acquaviva 2014-05-22
    • Nickolay V. Shmyrev

      Well, I run through bush2007, the results are:

       [java]    Accuracy: 62,175%    Errors: 1804  (Sub: 1167  Ins: 72  Del: 565)
       [java]    Words: 4579   Matches: 2847    WER: 39,397%
       [java]    Sentences: 248   Matches: 7   SentenceAcc: 2,823%
       [java]    Total Time Audio: 1552,89s  Proc: 8513,45s  Speed: 5,48 X real time
       [java]    Mem  Total: 1235,44 Mb  Free: 480,01 Mb
       [java]    Used: This: 755,43 Mb  Avg: 718,46 Mb  Max: 1179,77 Mb
      

      Not good indeed. I think that the main issue here is strong reverberation in audio, that should affect the accuracy significantly. For that reason I don't think that adaptation should help a lot here, it's more a matter of proper dereverberation.

       
      • Jeff Acquaviva

        Jeff Acquaviva - 2014-05-27

        If I created an adapted acoustic model that mirrored a similar reverberation as seen in this 2007 data, would this help?

         
  • Nickolay V. Shmyrev

    I was adapting the AM with the bush2002 data and using it on another test set, the bush2007 data. For reference here is the bush 2007 data: https://www.dropbox.com/s/as4pb7l1qyy6lwp/bush2007.tar.gz

    It doesn't matter which data to use. The unadapted WER must be in range of 20-30%, adapted WER must be about 15%.

    Also, are there any tutorials for adapting the language model to include out of vocabulary words?

    You just build the language model itself and choose the vocabulary you need.

    I see that the HUB 4 language model is about 3X the size of the generic english. Would it be better to use that one?

    Yes, that sounds more reasonable for this kind of task. Also you can try full en-us generic language model with 70k vocabulary available in our tracker http://cmusphinx.info:

    http://cmusphinx.info/file?info_hash=X%C0s%CB%FF%E3%84%3D%03oA%8By%CB%0C%3F%5B%27t%A1

     
  • Jeff Acquaviva

    Jeff Acquaviva - 2014-05-27

    Thanks again for your help. I'm going to try the HUB 4 LM to see if that helps.

    You just build the language model itself and choose the vocabulary you need

    More for my personal understanding, there is no way to adapt a language model like there is an acoustic model?

    The unadapted WER must be in range of 20-30%, adapted WER must be about 15%

    Does this mean that in order for an adapted acoustic model to be helpful, the original recogition error rate should be between 20-30% without adaption?

    So for example: If I had a task that received a WER of 45%, acoustic model adaption would not improve the error rate? Why is that?

     

    Last edit: Jeff Acquaviva 2014-05-27

Log in to post a comment.