Menu

Which Sphinx version best for phone lattices?

Help
Alex S
2008-01-31
2012-09-22
  • Alex S

    Alex S - 2008-01-31

    Hi,
    I need to generate phone lattices (preferably in HTK format), and I'd like to know which version of Sphinx is best to use for this. As I understand, S3.7 can generate HTK word lattices but phone lattice capabilities have been removed entirely??
    Which of {S2-0.6, S3-0.6, others versions?} is best for getting phone lattices?
    Can that version output them in HTK format, and if not, is there existing software to convert from sphinx to htk format?

    Thanks
    Alex

     
    • David Huggins-Daines

      As far as I know, Sphinx 3.7 can generate phone lattices just fine. In fact it ought to be able to generate them in HTK format. Just run it with -mode allphone -outlatdir . -outlatfmt htk.

       
      • Nagendra Kumar Goel

        From my experience of using it last time.... they are in HTK format, but there are subtle differences.
        One important difference is that node numbers are backwards in time and second that the final node is
        not unique, well defined at the end signal time.

         
      • Alex S

        Alex S - 2008-03-05

        Sphinx 3.7 does not support the -phonetp tag and seems to have no equivalent. Is it possible to specify phone transition probabilities in allphone mode?

        I'm using Sphinx to create phone lattices. If I were to run Sphinx in one of the standard modes (not allphone), with a dictionary that consists of one word for every phone, would this work? Would it have any advantages/disadvantages over running it in allphone mode?

        If this is not slower/less accurate, it would give me the advantage of using phone trigrams by specifying a language model for these "phone words".

         
        • David Huggins-Daines

          The -phonetp flag has been removed because now you are able to use a standard trigram language model instead.

           
    • Alex S

      Alex S - 2008-03-11

      I've been able to successfully generate phone lattices in allphone mode and fwdtree mode, but the operation of allphone mode is undocumented and not very clear.
      In both modes, I specify a dictionary of 40 words (each word is one phone):

      word phone
      ---- -----
      AA AA
      AE AE
      AH AH
      etc.

      I also have a filler dictionary with
      <s> SIL
      </s> SIL
      <sil> SIL
      ++BREATH++ +BREATH+
      ++COUGH++ +COUGH+
      ++SMACK++ +SMACK+
      ++UH++ +UH+
      ++UM++ +UM+

      Finally, I specify a language model (phone unigrams and bigrams and optionally trigrams)
      it looks like:
      \2-grams:
      -2.164211 AA AH -0.4287878
      -3.005496 AA AO -0.1684585
      etc

      I understand what happens in fwdtree mode with all this information, but can someone please explain what happens in allphone mode? Is any of this info unused? It would make sense that the dictionary and fillerdict would be completely ignored, and that my LM is interpreted as a phone LM: P(phone(t) | phone(t-1), phone(t-2)) for computing the LM scores. I would also imagine that the difference in operation is due to insertion of fillers being possible in the fwdtree case, and not possible in the allphone mode??

      Is this close to what actually happens?

      I end up with lattices of different densities and with different n-best lists when using allphone and fwdtree with the same params.

      Thanks again for your help

      -Alex

       
      • David Huggins-Daines

        Hi, sorry about the lack of documentation...

        The dictionary and filler dict are completely ignored in allphone mode, you're correct. They are just there to satisfy some parts of the decoder which expect them to be there.

        As for the difference between allphone and fwdtree, there are two big differences. The first one you've already mentioned, which is that fwdtree will try to insert fillers between each phone.

        The second one is that, because fwdtree thinks each phone is a "word", it will only search the single-word triphones for them. This means that it is not actually using a good chunk of your acoustic model.

        Also, left and right contexts at word boundaries are approximated by fwdtree search using "composite senones", which means that you're not getting full triphone modeling.

        In practice I think using fwdtree for "phone" decoding is about 20-30% less accurate (relative).

         
  • Pranav Jawale

    Pranav Jawale - 2012-02-29

    Hello,

    As mentioned above, fillerdict is ignored in allphone mode. How to make sure
    that BOTH fillers + phones are recognized in allphone mode?

    If I add fillers as phones, i.e. BREATH +BREATH+ in dictionary, it doesn't
    work because my LM doesn't contain fillers. My phone bigram is build from
    phonetic transcription. How to specify Fillers in LM? Would adding them under
    unigram with some probability work (as below)

    \1-grams:
    -99.0000 HORN  0.0000
    -99.0000 BREATH 0.0000
    -99.0000 SMACK  0.0000
    

    or some other strategy is adviced for choosing filler probability.

     
  • Nickolay V. Shmyrev

    If I add fillers as phones, i.e. BREATH +BREATH+ in dictionary, it doesn't
    work because my LM doesn't contain fillers. My phone bigram is build from
    phonetic transcription. How to specify Fillers in LM? Would adding them under
    unigram with some probability work (as below)

    You can create a new lm with fillers from phonetic transcription of the medium
    size text. Ideally you should have some real world training material but you
    can also model the certain percentage of the fillers in the training phonetic
    text. The whole purpose of lm in allphone is to estimate phone sequence
    probabilities. You only need to make this estimation accurate enough.

     

Log in to post a comment.