CMU Sphinx / Forums / Help: Adapted acoustic model leads to worse recognition

Jeff Acquaviva - 2014-05-21

Is there a reason why an acoustic model adapted to a specific speaker would do worse than the un-adapted for that speaker on a given test set?

I created 3 adaptations of the EN-US acoustic model with 5, 10, 15 and 20 minutes of speech from president GW Bush's 2002 State of the Union. I then ran sphinx4 with each of these adapted AMs on Bush's 2007 State of the Union. Here were my Word Error Rate results:
Unadapted: 49.290%
5 minutes: 49.290%
10 minutes: untested
15 minutes: 56.781%
20 minutes: 53.724%

Do you have any suggestions as to what might cause this decrease in performance?

Also, I really appreciate how much time you have given to help me with my project.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-05-21

There is something wrong with your setup. You can download Bush's test set from the 2002 set you shared before here:

https://dl.dropboxusercontent.com/u/26073448/bush2002.tar.gz

Results are way more reasonable than your 49%. And take into account the default en-us.lm.dmp vocabulary is pretty small for this task, it must be extended with required words.

[java] Accuracy: 75,448% Errors: 187 (Sub: 141 Ins: 9 Del: 37) [java] Words: 725 Matches: 547 WER: 25,793% [java] Sentences: 41 Matches: 3 SentenceAcc: 7,317% [java] Total Time Audio: 315,83s Proc: 1379,74s Speed: 4,37 X real time [java] Mem Total: 1102,94 Mb Free: 624,71 Mb [java] Used: This: 478,22 Mb Avg: 645,86 Mb Max: 1061,47 Mb
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Acquaviva - 2014-05-22

I'm sorry, I'm confused. Did you test on the bush2002 data? I didn't think it was correct to test on your training data. I was adapting the AM with the bush2002 data and using it on another test set, the bush2007 data. For reference here is the bush 2007 data: https://www.dropbox.com/s/as4pb7l1qyy6lwp/bush2007.tar.gz

Also, are there any tutorials for adapting the language model to include out of vocabulary words?

I see that the HUB 4 language model is about 3X the size of the generic english. Would it be better to use that one?

Last edit: Jeff Acquaviva 2014-05-22

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Nickolay V. Shmyrev - 2014-05-25
  
  Well, I run through bush2007, the results are:
  
  [java] Accuracy: 62,175% Errors: 1804 (Sub: 1167 Ins: 72 Del: 565) [java] Words: 4579 Matches: 2847 WER: 39,397% [java] Sentences: 248 Matches: 7 SentenceAcc: 2,823% [java] Total Time Audio: 1552,89s Proc: 8513,45s Speed: 5,48 X real time [java] Mem Total: 1235,44 Mb Free: 480,01 Mb [java] Used: This: 755,43 Mb Avg: 718,46 Mb Max: 1179,77 Mb
  
  Not good indeed. I think that the main issue here is strong reverberation in audio, that should affect the accuracy significantly. For that reason I don't think that adaptation should help a lot here, it's more a matter of proper dereverberation.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Jeff Acquaviva - 2014-05-27
    
    If I created an adapted acoustic model that mirrored a similar reverberation as seen in this 2007 data, would this help?
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2014-05-24

I was adapting the AM with the bush2002 data and using it on another test set, the bush2007 data. For reference here is the bush 2007 data: https://www.dropbox.com/s/as4pb7l1qyy6lwp/bush2007.tar.gz

It doesn't matter which data to use. The unadapted WER must be in range of 20-30%, adapted WER must be about 15%.

Also, are there any tutorials for adapting the language model to include out of vocabulary words?

You just build the language model itself and choose the vocabulary you need.

I see that the HUB 4 language model is about 3X the size of the generic english. Would it be better to use that one?

Yes, that sounds more reasonable for this kind of task. Also you can try full en-us generic language model with 70k vocabulary available in our tracker http://cmusphinx.info:

http://cmusphinx.info/file?info_hash=X%C0s%CB%FF%E3%84%3D%03oA%8By%CB%0C%3F%5B%27t%A1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Jeff Acquaviva - 2014-05-27

Thanks again for your help. I'm going to try the HUB 4 LM to see if that helps.

You just build the language model itself and choose the vocabulary you need

More for my personal understanding, there is no way to adapt a language model like there is an acoustic model?

The unadapted WER must be in range of 20-30%, adapted WER must be about 15%

Does this mean that in order for an adapted acoustic model to be helpful, the original recogition error rate should be between 20-30% without adaption?

So for example: If I had a task that received a WER of 45%, acoustic model adaption would not improve the error rate? Why is that?

Last edit: Jeff Acquaviva 2014-05-27

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Adapted acoustic model leads to worse recognition

Speech Recognition Toolkit

Forums

Help

Adapted acoustic model leads to worse recognition document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Adapted acoustic model leads to worse recognition