Menu

Merging acoustic models

Help
morten_aau
2006-07-11
2012-09-22
  • morten_aau

    morten_aau - 2006-07-11

    Hi all,

    I wish to train a garbage model for the ASR system I'm working on, and then merge it with the acoustic model I already have. My idea is to do this in the following way:

    1) Train an ergodic HMM (with many states) on all the speech available. (the garbage model)
    2) Merge the garbage model with the existing acoustic model (Danish 3 state left-to-right triphones).
    3) Convert to Sphinx4 model format.

    I figure that step 1) is pretty straight forward (the technical part, not the design) but in the second step I would have to know the format of the binary acoustic model file created by SphinxTrain, which I don't. Can anyone help me out here?

    As far as I'm informed, Sphinx4 is capable of handling any model topology, so in theory the plan should work; unless the S3 model loader assumes a uniform model topology?

    The reason I'm trying this, is that I want to do word spotting. The Idea is to create a rule grammar with the words (or word sequences) I want to spot and the garbage model in a loop.

    Actually I already tried to do word spotting by using a phone loop (60 base phones) as garbage model. It seemed to work all right, but it was rather slow: 10 x RT at an acceptable recognition performance.

    I hope that someone is able to tell me how the structure of the acoustic model file is, or who I should ask to get it.

    Best regards
    Morten

     
    • morten_aau

      morten_aau - 2006-07-14

      Hi Arthur and Holger,

      Thank you very much for the input! I'm not sure whether I will make the merger or I will try this approach:

      1) "copy" the acoustic training material
      2) train the triphone models on one part and a garbage model on the other part (in the same training step, effectively giving one binary model file)

      For now, I'll go on vacation ;o)

      Best regards
      Morten

       
    • Holger Brandl

      Holger Brandl - 2006-07-11

      2) Probabliy a good point to start with is the class edu.cmu.sphinx.linguist.acoustic.tiedstate.Sphinx3Loader

      There you can find a method called loadModelFiles which loads all the binary sp3 files. Maybe you could use this class as a start point to implement a merging procedure.

      What do you mean with
      "it was rather slow: 10 x RT at an acceptable recognition performance. ".
      Personally I think that 10xRT is a quite nice performance.

      Holger

       
      • morten_aau

        morten_aau - 2006-07-12

        Hi Holger,

        Thank you for the reply!

        Yes, I might be forced to try and figure out the format myself (I believe I also could take a look at some of the code in SphinxTrain/src/libs/libio/*). It would, though, be nice if someone knew the exact format.

        The application I'm working on requires near real time performance, so 10xRT is too slow.

        Best regards
        Morten

         
        • The Grand Janitor

          Will this help you?

          http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#4b

          I could answer you more detail question if you want.

          For starter, printp will be a great help for you.

          -a

           
    • Holger Brandl

      Holger Brandl - 2006-07-12

      Hi Morten,

      > I might be forced to try and figure out the format myself
      Not really if you want to use Java: You can directly load all model properties with the Sphinx3Loader. You only need to write a merging function.
      Handling binary data is (imho) never a pleasure, so i would try to avoid it and use existing code.

      > The application I'm working on requires near real time performance, so 10xRT is too slow.
      My mistake. If someone says "A method runs in 10XRT" i interpret that in a way that the method is able to process 10 seconds of speech in 1 second. Am i wrong?

      For sure 0.1xRT is not sufficient. What is the hmm topology of your phone loop? If it is ergodic your search space might be blown up enormously, which would lead (together with badly adjusted pruning parameters) to bad performance.

      Best regards, Holger

       
      • The Grand Janitor

        Hi Holger and Morten,
        Holger:
        Glad that you bring up the point of S3Loader. I think it is a good point.

        For the binary format part. It is a legacy . When the a.m. was first built, fast loading was a big issue. That's why data structure-like format has become the thing. Nothing right or wrong.

        For 10xRT. In SR's speed research, when people say 10xRt, it actually mean 1 second of speech is processed in 10 seconds. In SR's commercial description though. when people say 10xRT, they actually take your meaning. Again, I don't think you are right or wrong in this case. It's just a convention in a different field.

        For Morten
        If you could contribute the merger for us, I will be very grateful.

        Regards,
        Arthur

         

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.