CMU Sphinx / Forums / Help: Merging acoustic models

morten_aau - 2006-07-11

Hi all,

I wish to train a garbage model for the ASR system I'm working on, and then merge it with the acoustic model I already have. My idea is to do this in the following way:

1) Train an ergodic HMM (with many states) on all the speech available. (the garbage model)
2) Merge the garbage model with the existing acoustic model (Danish 3 state left-to-right triphones).
3) Convert to Sphinx4 model format.

I figure that step 1) is pretty straight forward (the technical part, not the design) but in the second step I would have to know the format of the binary acoustic model file created by SphinxTrain, which I don't. Can anyone help me out here?

As far as I'm informed, Sphinx4 is capable of handling any model topology, so in theory the plan should work; unless the S3 model loader assumes a uniform model topology?

The reason I'm trying this, is that I want to do word spotting. The Idea is to create a rule grammar with the words (or word sequences) I want to spot and the garbage model in a loop.

Actually I already tried to do word spotting by using a phone loop (60 base phones) as garbage model. It seemed to work all right, but it was rather slow: 10 x RT at an acceptable recognition performance.

I hope that someone is able to tell me how the structure of the acoustic model file is, or who I should ask to get it.

Best regards
Morten

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- morten_aau - 2006-07-14
  
  Hi Arthur and Holger,
  
  Thank you very much for the input! I'm not sure whether I will make the merger or I will try this approach:
  
  1) "copy" the acoustic training material
  2) train the triphone models on one part and a garbage model on the other part (in the same training step, effectively giving one binary model file)
  
  For now, I'll go on vacation ;o)
  
  Best regards
  Morten
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Holger Brandl - 2006-07-11
  
  2) Probabliy a good point to start with is the class edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader
  
  There you can find a method called loadModelFiles which loads all the binary sp3 files. Maybe you could use this class as a start point to implement a merging procedure.
  
  What do you mean with
  "it was rather slow: 10 x RT at an acceptable recognition performance. ".
  Personally I think that 10xRT is a quite nice performance.
  
  Holger
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - morten_aau - 2006-07-12
    
    Hi Holger,
    
    Thank you for the reply!
    
    Yes, I might be forced to try and figure out the format myself (I believe I also could take a look at some of the code in SphinxTrain/src/libs/libio/*). It would, though, be nice if someone knew the exact format.
    
    The application I'm working on requires near real time performance, so 10xRT is too slow.
    
    Best regards
    Morten
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
    - The Grand Janitor - 2006-07-12
      
      Will this help you?
      
      http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#4b
      
      I could answer you more detail question if you want.
      
      For starter, printp will be a great help for you.
      
      -a
      
      If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Holger Brandl - 2006-07-12
  
  Hi Morten,
  
  > I might be forced to try and figure out the format myself
  Not really if you want to use Java: You can directly load all model properties with the Sphinx3Loader. You only need to write a merging function.
  Handling binary data is (imho) never a pleasure, so i would try to avoid it and use existing code.
  
  > The application I'm working on requires near real time performance, so 10xRT is too slow.
  My mistake. If someone says "A method runs in 10XRT" i interpret that in a way that the method is able to process 10 seconds of speech in 1 second. Am i wrong?
  
  For sure 0.1xRT is not sufficient. What is the hmm topology of your phone loop? If it is ergodic your search space might be blown up enormously, which would lead (together with badly adjusted pruning parameters) to bad performance.
  
  Best regards, Holger
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
  - The Grand Janitor - 2006-07-12
    
    Hi Holger and Morten,
    Holger:
    Glad that you bring up the point of S3Loader. I think it is a good point.
    
    For the binary format part. It is a legacy . When the a.m. was first built, fast loading was a big issue. That's why data structure-like format has become the thing. Nothing right or wrong.
    
    For 10xRT. In SR's speed research, when people say 10xRt, it actually mean 1 second of speech is processed in 10 seconds. In SR's commercial description though. when people say 10xRT, they actually take your meaning. Again, I don't think you are right or wrong in this case. It's just a convention in a different field.
    
    For Morten
    If you could contribute the merger for us, I will be very grateful.
    
    Regards,
    Arthur
    
    If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Merging acoustic models

Speech Recognition Toolkit

Forums

Help

Merging acoustic models document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

Merging acoustic models