Hi,
I need to generate phone lattices (preferably in HTK format), and I'd like to know which version of Sphinx is best to use for this. As I understand, S3.7 can generate HTK word lattices but phone lattice capabilities have been removed entirely??
Which of {S2-0.6, S3-0.6, others versions?} is best for getting phone lattices?
Can that version output them in HTK format, and if not, is there existing software to convert from sphinx to htk format?
Thanks
Alex
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As far as I know, Sphinx 3.7 can generate phone lattices just fine. In fact it ought to be able to generate them in HTK format. Just run it with -mode allphone -outlatdir . -outlatfmt htk.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
From my experience of using it last time.... they are in HTK format, but there are subtle differences.
One important difference is that node numbers are backwards in time and second that the final node is
not unique, well defined at the end signal time.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sphinx 3.7 does not support the -phonetp tag and seems to have no equivalent. Is it possible to specify phone transition probabilities in allphone mode?
I'm using Sphinx to create phone lattices. If I were to run Sphinx in one of the standard modes (not allphone), with a dictionary that consists of one word for every phone, would this work? Would it have any advantages/disadvantages over running it in allphone mode?
If this is not slower/less accurate, it would give me the advantage of using phone trigrams by specifying a language model for these "phone words".
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've been able to successfully generate phone lattices in allphone mode and fwdtree mode, but the operation of allphone mode is undocumented and not very clear.
In both modes, I specify a dictionary of 40 words (each word is one phone):
word phone
---- -----
AA AA
AE AE
AH AH
etc.
I also have a filler dictionary with
<s> SIL
</s> SIL
<sil> SIL
++BREATH++ +BREATH+
++COUGH++ +COUGH+
++SMACK++ +SMACK+
++UH++ +UH+
++UM++ +UM+
Finally, I specify a language model (phone unigrams and bigrams and optionally trigrams)
it looks like:
\2-grams:
-2.164211 AA AH -0.4287878
-3.005496 AA AO -0.1684585
etc
I understand what happens in fwdtree mode with all this information, but can someone please explain what happens in allphone mode? Is any of this info unused? It would make sense that the dictionary and fillerdict would be completely ignored, and that my LM is interpreted as a phone LM: P(phone(t) | phone(t-1), phone(t-2)) for computing the LM scores. I would also imagine that the difference in operation is due to insertion of fillers being possible in the fwdtree case, and not possible in the allphone mode??
Is this close to what actually happens?
I end up with lattices of different densities and with different n-best lists when using allphone and fwdtree with the same params.
Thanks again for your help
-Alex
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The dictionary and filler dict are completely ignored in allphone mode, you're correct. They are just there to satisfy some parts of the decoder which expect them to be there.
As for the difference between allphone and fwdtree, there are two big differences. The first one you've already mentioned, which is that fwdtree will try to insert fillers between each phone.
The second one is that, because fwdtree thinks each phone is a "word", it will only search the single-word triphones for them. This means that it is not actually using a good chunk of your acoustic model.
Also, left and right contexts at word boundaries are approximated by fwdtree search using "composite senones", which means that you're not getting full triphone modeling.
In practice I think using fwdtree for "phone" decoding is about 20-30% less accurate (relative).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
As mentioned above, fillerdict is ignored in allphone mode. How to make sure
that BOTH fillers + phones are recognized in allphone mode?
If I add fillers as phones, i.e. BREATH +BREATH+ in dictionary, it doesn't
work because my LM doesn't contain fillers. My phone bigram is build from
phonetic transcription. How to specify Fillers in LM? Would adding them under
unigram with some probability work (as below)
If I add fillers as phones, i.e. BREATH +BREATH+ in dictionary, it doesn't
work because my LM doesn't contain fillers. My phone bigram is build from
phonetic transcription. How to specify Fillers in LM? Would adding them under
unigram with some probability work (as below)
You can create a new lm with fillers from phonetic transcription of the medium
size text. Ideally you should have some real world training material but you
can also model the certain percentage of the fillers in the training phonetic
text. The whole purpose of lm in allphone is to estimate phone sequence
probabilities. You only need to make this estimation accurate enough.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hi,
I need to generate phone lattices (preferably in HTK format), and I'd like to know which version of Sphinx is best to use for this. As I understand, S3.7 can generate HTK word lattices but phone lattice capabilities have been removed entirely??
Which of {S2-0.6, S3-0.6, others versions?} is best for getting phone lattices?
Can that version output them in HTK format, and if not, is there existing software to convert from sphinx to htk format?
Thanks
Alex
As far as I know, Sphinx 3.7 can generate phone lattices just fine. In fact it ought to be able to generate them in HTK format. Just run it with -mode allphone -outlatdir . -outlatfmt htk.
From my experience of using it last time.... they are in HTK format, but there are subtle differences.
One important difference is that node numbers are backwards in time and second that the final node is
not unique, well defined at the end signal time.
Sphinx 3.7 does not support the -phonetp tag and seems to have no equivalent. Is it possible to specify phone transition probabilities in allphone mode?
I'm using Sphinx to create phone lattices. If I were to run Sphinx in one of the standard modes (not allphone), with a dictionary that consists of one word for every phone, would this work? Would it have any advantages/disadvantages over running it in allphone mode?
If this is not slower/less accurate, it would give me the advantage of using phone trigrams by specifying a language model for these "phone words".
The -phonetp flag has been removed because now you are able to use a standard trigram language model instead.
I've been able to successfully generate phone lattices in allphone mode and fwdtree mode, but the operation of allphone mode is undocumented and not very clear.
In both modes, I specify a dictionary of 40 words (each word is one phone):
word phone
---- -----
AA AA
AE AE
AH AH
etc.
I also have a filler dictionary with
<s> SIL
</s> SIL
<sil> SIL
++BREATH++ +BREATH+
++COUGH++ +COUGH+
++SMACK++ +SMACK+
++UH++ +UH+
++UM++ +UM+
Finally, I specify a language model (phone unigrams and bigrams and optionally trigrams)
it looks like:
\2-grams:
-2.164211 AA AH -0.4287878
-3.005496 AA AO -0.1684585
etc
I understand what happens in fwdtree mode with all this information, but can someone please explain what happens in allphone mode? Is any of this info unused? It would make sense that the dictionary and fillerdict would be completely ignored, and that my LM is interpreted as a phone LM: P(phone(t) | phone(t-1), phone(t-2)) for computing the LM scores. I would also imagine that the difference in operation is due to insertion of fillers being possible in the fwdtree case, and not possible in the allphone mode??
Is this close to what actually happens?
I end up with lattices of different densities and with different n-best lists when using allphone and fwdtree with the same params.
Thanks again for your help
-Alex
Hi, sorry about the lack of documentation...
The dictionary and filler dict are completely ignored in allphone mode, you're correct. They are just there to satisfy some parts of the decoder which expect them to be there.
As for the difference between allphone and fwdtree, there are two big differences. The first one you've already mentioned, which is that fwdtree will try to insert fillers between each phone.
The second one is that, because fwdtree thinks each phone is a "word", it will only search the single-word triphones for them. This means that it is not actually using a good chunk of your acoustic model.
Also, left and right contexts at word boundaries are approximated by fwdtree search using "composite senones", which means that you're not getting full triphone modeling.
In practice I think using fwdtree for "phone" decoding is about 20-30% less accurate (relative).
Hello,
As mentioned above, fillerdict is ignored in allphone mode. How to make sure
that BOTH fillers + phones are recognized in allphone mode?
If I add fillers as phones, i.e. BREATH +BREATH+ in dictionary, it doesn't
work because my LM doesn't contain fillers. My phone bigram is build from
phonetic transcription. How to specify Fillers in LM? Would adding them under
unigram with some probability work (as below)
or some other strategy is adviced for choosing filler probability.
You can create a new lm with fillers from phonetic transcription of the medium
size text. Ideally you should have some real world training material but you
can also model the certain percentage of the fillers in the training phonetic
text. The whole purpose of lm in allphone is to estimate phone sequence
probabilities. You only need to make this estimation accurate enough.