CMU Sphinx / Forums / Speech Recognition Theory: Acoustic model reduction

Gabriel Marques - 2013-01-16

Hello folks,

I'm trying to run an ASR in a memory-constrained device, in my home language.
One of the problems I'm facing is the size of the models used by the system, that need a lot more RAM than available to be loaded.

As I'm using the ASR for just a small set of commands, so a simple grammar got me rid of the language model.

The audio model available, though, is one built for large vocabulary user-independent recognition. And it's big.
Even considering that the AM describes phone pronunciations, not related to a set of words, is it possible to reduce the AM by narrowing it to a set of words?
I may be confusing concepts, but maybe some phonemes are never spoken in a small set of words.

Another approach is reducing the AM by compromising the ASR error rate - it may sound the opposite of everyone's work, but speed and memory gains may justify this.
The question is, how that can be accomplished on a HTK ASCII AM already 'compiled'.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-01-16

Even considering that the AM describes phone pronunciations, not related to a set of words, is it possible to reduce the AM by narrowing it to a set of words?

Yes, you can remove senones and triphones which you will never see with a custom tool

The question is, how that can be accomplished on a HTK ASCII AM already 'compiled'.

You need to be more precise about type of the model you have to get more definite advice. What is 'compiled' in HTK model, is it converted to binary? You can convert it back easily.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Gabriel Marques - 2013-01-17
  
  Yes, you can remove senones and triphones which you will never see with a custom tool
  
  Looking at the ascii MMF file it doesn't seems to be complex, just large.
  "...Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes..."
  So, given a word 'hello' there's a unique triphone representation of it? (sil-H-E / H-E-L / E-L-L..)
  
  Then I can find all triphones that can be generated by the grammar I've defined, study the HTK file format, find them on the file with associated data (senones) and keep the ones I need.
  
  If I got it right, the MMF file contains a big set of triphones, and each one is composed by HMM transition data and the audio representation as senones. Right?
  
  If I'm too far from the point RTFM is an acceptable answer :D
  
  You need to be more precise about type of the model you have to get more > definite advice. What is 'compiled' in HTK model, is it converted to binary? You can convert it back easily.
  
  I meant I didn't created the AM, just downloaded it from a public repository. So I don't have the audio sources to re-train - as these are proprietary and not provided by the univerity's repo.
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2013-01-17
    
    So, given a word 'hello' there's a unique triphone representation of it? (sil-H-E / H-E-L / E-L-L..)
    
    Triphones also count neighbour words. The triphones in htk are named with +:
    
    sil-H+E
    
    If I got it right, the MMF file contains a big set of triphones, and each one is composed by HMM transition data and the audio representation as senones. Right?
    
    Yes
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nickolay V. Shmyrev - 2013-01-16

And of course there are other methods to reduce the size of the acoustic model without reducing the accuracy, one of them is quantization. CMUSphinx model size can be significantly reduced with quantization of mixture weights to 4 bits.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Gabriel Marques - 2013-01-17
  
  That's done in the ASCII to binary MMF conversion? (so I can study the proper tool)
  
  And thanks for both replies Nickolay.
  
  Last edit: Gabriel Marques 2013-01-17
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - Nickolay V. Shmyrev - 2013-01-17
    
    This is done with CMUSphinx model only, HTK doesn't support this feature
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

The Grand Janitor - 2013-01-16

The audio model available, though, is one built for large vocabulary user-independent recognition.

This sentence confused me a bit. Acoustic model and language model are two separate models trained by resources which is related but not always the same.

Guessing from your question, it sounds like what you actually want is to build a smaller grammar, which contains just a small set of commands. That will result in a reduction of the size of LM. Sphinx 4's tutorial

Another approach is reducing the AM by compromising the ASR error rate .

You can do that too. One way to go is train a smaller size AM.

(Also just saw Nick's reply. You probably should clarify too.)

Arthur

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Acoustic model reduction

Speech Recognition Toolkit

Forums

Help

Acoustic model reduction

Acoustic model reduction

Speech Recognition Toolkit

Forums

Help

Acoustic model reduction document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Acoustic model reduction