Menu

Acoustic model reduction

2013-01-16
2013-01-17
  • Gabriel Marques

    Gabriel Marques - 2013-01-16

    Hello folks,

    I'm trying to run an ASR in a memory-constrained device, in my home language.
    One of the problems I'm facing is the size of the models used by the system, that need a lot more RAM than available to be loaded.

    As I'm using the ASR for just a small set of commands, so a simple grammar got me rid of the language model.

    The audio model available, though, is one built for large vocabulary user-independent recognition. And it's big.
    Even considering that the AM describes phone pronunciations, not related to a set of words, is it possible to reduce the AM by narrowing it to a set of words?
    I may be confusing concepts, but maybe some phonemes are never spoken in a small set of words.

    Another approach is reducing the AM by compromising the ASR error rate - it may sound the opposite of everyone's work, but speed and memory gains may justify this.
    The question is, how that can be accomplished on a HTK ASCII AM already 'compiled'.

     
  • Nickolay V. Shmyrev

    Even considering that the AM describes phone pronunciations, not related to a set of words, is it possible to reduce the AM by narrowing it to a set of words?

    Yes, you can remove senones and triphones which you will never see with a custom tool

    The question is, how that can be accomplished on a HTK ASCII AM already 'compiled'.

    You need to be more precise about type of the model you have to get more definite advice. What is 'compiled' in HTK model, is it converted to binary? You can convert it back easily.

     
    • Gabriel Marques

      Gabriel Marques - 2013-01-17

      Yes, you can remove senones and triphones which you will never see with a custom tool

      Looking at the ascii MMF file it doesn't seems to be complex, just large.
      "...Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes..."
      So, given a word 'hello' there's a unique triphone representation of it? (sil-H-E / H-E-L / E-L-L..)

      Then I can find all triphones that can be generated by the grammar I've defined, study the HTK file format, find them on the file with associated data (senones) and keep the ones I need.

      If I got it right, the MMF file contains a big set of triphones, and each one is composed by HMM transition data and the audio representation as senones. Right?

      If I'm too far from the point RTFM is an acceptable answer :D

      You need to be more precise about type of the model you have to get more > definite advice. What is 'compiled' in HTK model, is it converted to binary? You can convert it back easily.

      I meant I didn't created the AM, just downloaded it from a public repository. So I don't have the audio sources to re-train - as these are proprietary and not provided by the univerity's repo.

       
      • Nickolay V. Shmyrev

        So, given a word 'hello' there's a unique triphone representation of it? (sil-H-E / H-E-L / E-L-L..)

        Triphones also count neighbour words. The triphones in htk are named with +:

        sil-H+E

        If I got it right, the MMF file contains a big set of triphones, and each one is composed by HMM transition data and the audio representation as senones. Right?

        Yes

         
  • Nickolay V. Shmyrev

    And of course there are other methods to reduce the size of the acoustic model without reducing the accuracy, one of them is quantization. CMUSphinx model size can be significantly reduced with quantization of mixture weights to 4 bits.

     
    • Gabriel Marques

      Gabriel Marques - 2013-01-17

      That's done in the ASCII to binary MMF conversion? (so I can study the proper tool)

      And thanks for both replies Nickolay.

       

      Last edit: Gabriel Marques 2013-01-17
      • Nickolay V. Shmyrev

        This is done with CMUSphinx model only, HTK doesn't support this feature

         
  • The Grand Janitor

    The audio model available, though, is one built for large vocabulary user-independent recognition.

    This sentence confused me a bit. Acoustic model and language model are two separate models trained by resources which is related but not always the same.

    Guessing from your question, it sounds like what you actually want is to build a smaller grammar, which contains just a small set of commands. That will result in a reduction of the size of LM. Sphinx 4's tutorial

    Another approach is reducing the AM by compromising the ASR error rate .

    You can do that too. One way to go is train a smaller size AM.

    (Also just saw Nick's reply. You probably should clarify too.)

    Arthur

     

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.