Menu

Some thoughts to tune Sphinx

Help
Yueyu/Lin
2008-09-30
2012-09-22
  • Yueyu/Lin

    Yueyu/Lin - 2008-09-30

    I'm currently using a very small NGram to test the Sphinx performance. I called Timer.dumpAll() to show the time spended in different steps. The followings are my results:

    --------------- Summary statistics ---------

    Total Time Audio: 1.62s Proc: 0.17s Speed: 0.10 X real time
    result is are your grandparents still living,duration is 6251

    ----------------------------- Timers----------------------------------------

    Name Count CurTime MinTime MaxTime AvgTime TotTime

    streamDataSourc 25 0.0000s 0.0000s 0.0010s 0.0003s 0.0070s
    premphasizer 25 0.0000s 0.0000s 0.0010s 0.0001s 0.0020s
    windower 25 0.0000s 0.0000s 0.0010s 0.0002s 0.0060s
    fft 163 0.0000s 0.0000s 0.0040s 0.0003s 0.0550s
    melFilterBank 163 0.0000s 0.0000s 0.0010s 0.0000s 0.0030s
    dct 163 0.0000s 0.0000s 0.0010s 0.0000s 0.0040s
    featureExtracti 159 0.0000s 0.0000s 0.0010s 0.0000s 0.0010s
    Score 162 0.0000s 0.0000s 0.0810s 0.0007s 0.1160s
    Prune 971 0.0000s 0.0000s 0.0010s 0.0000s 0.0010s
    Grow 972 0.0000s 0.0000s 0.0040s 0.0001s 0.1050s
    DictionaryLoad 1 0.5430s 0.5430s 0.5430s 0.5430s 0.5430s
    AM_Load 1 3.4910s 3.4910s 3.4910s 3.4910s 3.4910s
    compile 1 1.6120s 1.6120s 1.6120s 1.6120s 1.6120s
    buildHmmPool 1 1.6010s 1.6010s 1.6010s 1.6010s 1.6010s
    Create HMMTree 1 0.0060s 0.0060s 0.0060s 0.0060s 0.0060s

    From above, we can find the AM_Load--Acoustic Model loading costs a lot of time.
    I'm using edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model
    We can find the function loader.load(); costs the time.
    In my Configure.xml, the loader is
    edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader

    The first step is to improve the performance of ModelLoader.

    There are two options:
    1. Re-use the Acoustic Model instance for multiple recognizers. From what I undersatnd now, it's not possible. Am I right?
    2. Re-implement the Acoustic Model, especially the Acoustic Model Loader. Actually I think it's a practical way. The model is a tree and we can keep the tree data among the whole JVM. What I'm planning is to use an embedded DB(for an example, BerkeleyDB) as the cache. For one model, just store the acoustic model data structure to the database. So multiple recognizers can share the same database instance. Since it's a read only operation, it will be super fast to access it(from my industry experience in other fields).

    Do you have any more thoughts about this?

    If the acoustic model can be tuned like this, the process buildHmmPool and Create HMMTree can also be tuned in this way. How do you think about it?

     
    • Yueyu/Lin

      Yueyu/Lin - 2008-10-01

      I have checked the code of LexTextLinguist.java. In theory, the acoustic model and NGram grammar can be separated. Unfortunately, the allocate() function is responsible to initiate the acoustic model and compile the grammar. I can't find any way to recompile the grammar without touching the acoustic model. Even worse, the allocate() function set the acousticModel variable to null so I can't even access it later.
      But I can still manage to handle that.
      1. I copied out all the codes from LexTextLinguist.java and rename it to MyLexTextLinguist.java
      2. Keep the acousticModel variable to be reused later
      3. Allocate the MyLexTextLinguist instance
      4. Use the recognizer to do the recognition
      4.1 I have reimplemented a NGram grammar that can be reused multiple times that use different files
      5. Set the languageModel's resouce a new file location
      6. Recompile the languageModel for the MyLexTextLinguist instance

      That will work for my cases. Now I can continue to tune the application to cache the grammar into an embedded database so I won't be bothered to recompile them every time I use them. The text NGram grammar compilation is not cheap, too. I guess I can make them as fast as I can.

      But I really hope the Sphinx team may abstract the data access layer out to enable more spaces to create different applications.

      BTW: I also used Nuance recognizer. They can make all recognizer to share the same copy of acoustic models and language models. For one acoustic model and language model referred by the same URL, the Nuance SDK always points to the same copy instead of load and compile them from time to time. I can actually make the Sphinx to behave like this. But it's much more like a hack because I can't find more support in the current framework design.

      That's all what I'm thinking and doing now. Any comments are welcome. Thanks.

       
      • Nickolay V. Shmyrev

        > That will work for my cases.

        You are welcome to submit a patch

        > But I really hope the Sphinx team may abstract the data access layer out to enable more spaces to create different applications.

        Right now there is no immediate need in this.

         
    • Yueyu/Lin

      Yueyu/Lin - 2008-09-30

      After profiling using netbeans, I found
      1. The edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.ModelLoader.loadHMMPool(boolean, java.io.InputStream, String) costs 61% time
      2. edu.cmu.sphinx.model.acoustic.WSJ_8gau_13dCep_16k_40mel_130Hz_6800Hz.Model.lookupNearestHMM(edu.cmu.sphinx.linguist.acoustic.Unit, edu.cmu.sphinx.linguist.acoustic.HMMPosition, boolean) costs 22% time

      Since what I want to do is to reuse recognizer using different grammars(it means differnt LanguageModel instances). So I think the first time can be saved since all grammars can share the same acoustic model. For the second part, it's possible to improve as much as we can.

      Do you have any opinions? Thanks in advance.

       
      • Nickolay V. Shmyrev

        There was similar discussion already.

        https://sourceforge.net/forum/message.php?msg_id=4625020

        I think first of all like in any optimization you need to know what to optimize, understand the application and it's behaviour. So, please provide your testing application first, the model loading should be a minor thing really comparing to speech processing.

        In part of speech processing speed there are well-known tricks that should speedup processing significantly. Multipass recognition with lattice rescoring for example.

         
        • Yueyu/Lin

          Yueyu/Lin - 2008-09-30

          My application is not that typical recognition application.
          I have a lot of small grammars that needs to be used by different recognition process.
          So it means I can't use one grammar to do the recognition. Actually everytime, the grammar should be different.
          That requires loading acoustic model and compiling grammars should be as fast as possible. Since the grammar is small, the recognition performance is not the major issue for this application.
          For typical recognition application, the grammar is not shifted frequently and there will not be that much grammars-- different grammar for different recognition session. So the grammar loading and acoustic model loading is not a issue since they can be only loaded once and be reused in the later sessions.

          That's my intention to improve the performance of loading acoustic model and compiling the grammars.

          Do you have any ideas about it? Thanks.

           
          • Nickolay V. Shmyrev

            I'm sure it's not required to reload acoustic model to recompile grammar, you should load acoustic model once and reuse it everywhere. Probably something went wrong in your code.

            Grammar compilation must be relatively fast.

             

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.