I am currently working on a comparison of a number of speech recognition toolkits, including HTK, ISIP and Sphinx. I've created models in HTK, but now I'd like to convert them to the Sphinx 3 format, so that I can evaluate the performance of the Sphinx 3 decoder.
There are two things that I'd like to ask:
Did anyone already succeed in creating a converter for HTK to Sphinx 3 or 4 (I believe these formats are the same)? A number of people have said on this forum that they were working on it, but there has never been a follow up saying that they did or did not succeed.
Could anyone give me some information about the file formats used for the Sphinx 3 models? The documentation about the file formats (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#4b) doesn't seem to be correct. (I don't see any correspondence between the
documentation and the acutal files, at least.) The documentation implies that the models are in ASCII-form, but all models I have seen are binary with a textual header which isn't mentioned in the documentation.
If a converter doesn't exist (which I expect), I'd like to convert my HTK models to Sphinx 3 format.
I would be thankful for any help you could give me!
Best regards,
Wout
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Because I want to compare the performance (WER and time) of a number of decoders, I'd like to have them use the same data as far as possible. This includes the feature vectors, language model and HMMs. If I retrain using Sphinx, it will be difficult to get the same model, I believe.
After looking at the tutorial and the documentation for SphinxTrain, I think that recreating the models and getting to know SphinxTrain and its (many) configuration options would take more time than converting the HTK models. I know from using HTK that it takes a long time to figure out the training process.
I'd be happy to hear from you if you think I'm wrong ;-).
Best regards,
Wout Maaskant
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Gosh, that documentation really isn't very good. The model files are binary files. What that documentation is describing is the ASCII output of the 'pdump' program which displays their contents.
Nobody has completed a converter at this point. I have about zero familiarity with HTK (and in some sense I'd like to keep it that way since HTK is not open-source) but I'd like to be able to interoperate with its model files. As far as I know the basic structure of Sphinx3 and HTK models is the same, only the file format is different.
Briefly, the Sphinx3 acoustic model consists of a model definition file, Gaussian parameter files, a mixture weights file, and a transition matrices file. There is also an optional tied-state to codebook mapping file which is (unfortunately) not supported by all decoders at this time. In the absence of this file the acoustic model is assumed to be either fully continuous (each tied state has its own set of Gaussians and mixture weights) or semi-continuous (all tied states share a single set of Gaussians, but have their own mixture weights).
The Gaussian parameter files are binary files with a short text header. There is one parameter file for the means and another one for the (diagonal) variance vectors. (full covariances are also supported but the file format is the same - the parameter vectors are just the matrices in row-major order, yes, this isn't very efficient). This header looks like:
s3
<key> <value>
...
endhdr
In other words, the first line is 's3', followed by whitespace-separated key/value pairs, followed by 'endhdr'. The binary part of the file follows. The first four bytes are a byte-order marker, which is 0x11223344 in the byte order of the file. In other words, if the file is little-endian, it will consist of the four bytes 0x44, 0x33, 0x22, 0x11, and if the file is big-endian, they will be in the opposite order.
The next four bytes are the number of codebooks (Gaussian mixture models) in the file.
The next four bytes are the number of feature streams (this is usually 1).
The next four bytes are the number of densities per codebook.
For each feature stream, in order, there are four bytes containing the dimensionality of that stream (usually there is 1, and it is 39).
Finally, the next four bytes are the number of 32-bit floating-point numbers in the file. This should be equal to the product of the number of codebooks and densities and the total dimensionality of all streams.
The actual data follows. It consists of an array in row-major order of 32-bit floating-point numbers in the form [codebook][feature][density][dimension]. In other words, the values are ordered:
codebook 0 feature 0 density 0 [1..39]
codebook 0 feature 0 density 1 [1..39]
...
codebook N feature M density K [1..39]
There is an optional checksum at the end of the file (if the header contains "chksum0 yes", then it is present). I don't usually bother with this, so you'll have to look at the code to see how it's computed.
The mixture weights file is mostly the same. The header and byte-order markers are exactly the same. Following them, there are:
A four-byte integer containing the number of state output distributions (senones, GMMs)
A four-byte integer containing the number of feature streams
A four-byte integer containing the number of densities per mixture
The mixture weight data follows. Again, it is a 3-dimensional array of 32-bit floats indexed by [senone][feature][density]. An important thing to note is that the mixture weights are stored in UNNORMALIZED format. That is, they don't sum to one across each senone. So, you may wish to normalize them before using them. (the reason they are in this format is that it allows the same files to be used for accumulating mixture weight counts in training. as a happy accident, it also makes MAP adaptation much easier)
Finally, the transition matrix file is, again, of the same general format, with the same header and byte-order markers. After the byte-order marker, there are:
A four-byte integer containing the number of transition matrices (this is usually, but doesn't have to be, the same as the number of context-independent phones)
A four-byte integer containing the number of emitting states per phone
A four-byte integer containing the total number of states per phone (i.e. the previous number + one non-emitting final state)
And there is yet another 3-dimensional array of 32-bit floats. For much the same reason as the mixture weights, the transition matrices are also UNNORMALIZED.
Hope this helps! There is Python/NumPy code which reimplements these file formats in SphinxTrain/python/sphinx - look at s3gau.py, s3tmat.py, s3mixw.py, s3gau.py.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> And there is yet another 3-dimensional array of 32-bit floats. For much the same reason as the mixture weights, the transition matrices are also UNNORMALIZED.
I can't figure out what the three dimensions are. Shouldn't this just be a list of transition matrices, each consisting of [number of states]^2 32-bit floats?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Am I correct in assuming that the state indices used in the model definition are implicitly defined by the ordering used in the means, variances and mixture weights files? And that the order of the related means, variances and mixture weights in those files should therefore be the same?
Have you had a chance to look at my question about the position column in the model definition (https://sourceforge.net/forum/message.php?msg_id=4356228)?
I currently have a parser for the HTK model file in ASCII format and I can output the Sphinx 3 transition matrices file.
Best regards,
Wout
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm making progress: the tmat, mean, var mixw can be converted and mdef is almost finished. I haven't checked if Sphinx 3 will accept the files, though.
I hope you don't mind that I keep asking all kinds of detailed questions, because I've got two more:
You mentioned that Sphinx 3 expects monophone models to be present for the base phone of every triphone model. In the tutorial, all triphone models share their transition matrix with the monophone model for their base phone. Is this requirement by Sphinx?
My other question concerns one of the variables at the start of the model definition file:
Do I understand correctly that n_tied_tmat is the number of unique transition matrices used (and the same as the number of transition matrices in the tmat file), including those for triphone models?
I am checking this because this parameter is described in the tutorial as "The HMM for each CI phone has a transition probability matrix associated it. This is the total number of transition matrices for the given set of models." Because this description is given in the CI part of the tutorial, I wasn't sure if CI phones were mentioned because of the location in the tutorial or because this parameter /really/ only concerns the base phones.
Best regards,
Wout
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> - You mentioned that Sphinx 3 expects monophone models to be present for the base phone of every triphone model. In the tutorial, all triphone models share their transition matrix with the monophone model for their base phone. Is this requirement by Sphinx?
It's not a requirement, but there are some parts of the decoder that assume it to be true. This is true in particular for the "fast" (lexicon tree based) decoder which is the standard one in Sphinx3.6 - where "composite" HMMs are used to represent mutiple triphones with the same base phone (and different senone sequences), it is assumed that they share the same transition matrix.
In practice it probably doesn't matter much because the transition probabilities for different triphones are unlikely to be sufficiently different to affect accuracy.
> - Do I understand correctly that n_tied_tmat is the number of unique transition matrices used (and the same as the number of transition matrices in the tmat file), including those for triphone models?
Yes, this should be the same as the number in the tmat file. Or, rather, it should be the maximum of the 'tmat' column in the list of phone definitions. The confusion stems from the fact that this is usually the same as the number of CI phones, due to the assumption mentioned above.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> Gosh, that documentation really isn't very good. The model files are binary files. What that documentation is describing is the ASCII output of the 'pdump' program which displays their contents.
Ah, that explains it :-)
Thanks very much for all the information you've given me! It will take some time study and (try to) use it, but I'll let you know what happens.
You mention that the mixture weights and transition matrices are unnormalized in Sphinx. In HTK these are normalized and, according to the FAQ (http://www.speech.cs.cmu.edu/sphinxman/FAQ.html), it is possible to convert the transition matrices by taking logbase 1.0001 of the probabilities. Do you know whether this also holds for the mixture weights?
In the Sphinx documentation, and now in your reply as well, I noticed that an HMM structure is assumed of a few (say 3) emitting states and 1 non-emitting state. The models I created with HTK have created have 5 states, of which 3 are emitting. Is this a real difference, or is this just a different way of describing the same problem (i.e. does Sphinx implicitly create a non-emitting state before each HMM)? The sp model I'm using is a 'tee-model'.
Best regards,
Wout
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The Sphinx decoders implicitly assume that the first state in a model is the initial state, and that it is an emitting state.
Given that the system only works with left-to-right models, this is a fairly reasonable assumption. I guess HTK allows an explicit non-emitting start state because it is able to handle a more general class of HMM topologies.
What this means in practice is that if you have more than one transition out of your non-emitting initial state, then you can't use your model with Sphinx. In particular "tee" models where you can transition from the non-emitting initial state directly to the non-emitting final state, thereby skipping an HMM entirely, can't be decoded by Sphinx.
What you could do to hack around this problem is just remove the initial non-emitting state completely, which has the effect of declaring the first emitting state to be the only possible start state. I don't know how much this will affect accuracy.
Mixture weights for Sphinx are stored as floating-point numbers in the file so all you have to do is normalize them to get probabilities.
If you have normalized mixture weights already (from HTK, say), and you want to create a Sphinx mixture-weight file from them, you don't have to do anything to them. Sphinx will happily normalize them and get the exact same numbers back :-)
The logbase of 1.0001 thing is done internally to the decoder and has nothing to do with the model file formats. The only place it matters is when you are looking at the decoder outputs.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The initial (non-emitting) state in my HTK models only has a transition to the second (emitting) state, so it's no problem that the second state becomes the initial state. According to the tutorial (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html), Sphinx adds a non-emitting state to the end (search for "n_state_map" to find it). But which states will have a transition to this state? I only want the last emitting state to have such a transition.
I will have to figure out what to do with my sp and sil models. I already mentioned that sp is a "tee" model. sil currently has these transitions:
1: 2
2: 2 3
3: 3 4
4: 2 4 5
5:
Is there a common way to model short pauses between words (sp) and silences before and after sentences (sil) generally modeled in Sphinx?
It's really nice that there are Python scripts to read and write S3 model files. I've had a look and I think I'll try to use them for my conversion scripts.
Best regards,
Wout
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Sphinx uses silence and filler models that are trained the same as other context-independent phones, and have the same topology. The decoder inserts transitions to them with a flat probability in between all words.
The final non-emitting state is actually present in the transition matrices for Sphinx. Usually the only transition into it is from the final emitting state.
For the Sphinx3.6 "fast" decoder (this is NOT well documented, sorry...) there are only two allowable topologies, either a 3-state left-to-right:
x x 0 0
0 x x 0
0 0 x x
or a 5-state Bakis:
x x x 0 0 0
0 x x x 0 0
0 0 x x x 0
0 0 0 x x x
0 0 0 0 x x
For the "slow" decoder the only requirement is that the transition matrix be upper-triangular.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The 3-state L-to-R topology is exactly what I want to use :-).
I have a few questions concerning the model definition file format. I'll post my question in this thread, as the main discussion happens here ;-).
> Sphinx uses silence and filler models that are trained the same as other context-independent phones, and have the same topology. The decoder inserts transitions to them with a flat probability in between all words.
Do I understand correctly that I do need to define those models in my model definition file, and use "filler" in the "attrib" column? (I.e. the silence and filler models are not defined implicitly by Sphinx?)
In the tutorial (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#30) the monophone (context-independent) models are kept in the model definition file after the triphone (context-dependent) models are created. Is this a requirement of Sphinx, or just something that is done in the tutorial? To clarify: the file contains
base lft rt p attrib tmat ... state id's ...
AE - - - n/a 1 3 4 5 N
AX - - - n/a 2 6 7 8 N
but also:
AE B T i n/a 1 15 16 17 N
AE T B i n/a 1 18 16 19 N
AX AX AX s n/a 2 20 21 22 N
AX AX B s n/a 2 23 21 22 N
The models I am currently using with HTK only contain trigram HMMs. Would Sphinx accept a model definition file with just the context-dependent part?
Another question concerns the position ("p") column in the model definition file. The tutorial lists "b", "e", "i" and "s" as valid values, for word beginning, ending or internal triphones or single word triphones, respectively. Is there any difference in how Sphinx handles triphone models based on their position? HTK doesn't record this value, so I will probably just choose a default value for all triphones, if there isn't much difference.
Thank you again for your help!
Best regards,
Wout
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
> The models I am currently using with HTK only contain trigram HMMs. Would Sphinx accept a model definition file with just the context-dependent part?
Not without some hacking... Yes, Sphinx assumes that the context-independent phones are also present in the model. It uses them to speed up GMM computation, and it will also fall back on them if a triphone from the dictionary is not available.
> Do I understand correctly that I do need to define those models in my model definition file, and use "filler" in the "attrib" column? (I.e. the silence and filler models are not defined implicitly by Sphinx?)
Yes, that's correct.
> Another question concerns the position ("p") column in the model definition file. The tutorial lists "b", "e", "i" and "s" as valid values, for word beginning, ending or internal triphones or single word triphones, respectively. Is there any difference in how Sphinx handles triphone models based on their position? HTK doesn't record this value, so I will probably just choose a default value for all triphones, if there isn't much difference.
When building word HMMs or lexicon trees, Sphinx will try to use the triphone with the appropriate word position. If it can't find it, then it will try to use any instance of that triphone. If it can't find one, then it will use the context-independent phone. (there are some instances where it doesn't back off to other instances of the triphone and just uses the CI phone - this may be a bug)
You can just pick an arbitrary one and use it. I'd suggest either "i" or "s". But accuracy will probably suffer somewhat.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also have you seen Keith Virtanen's comparison of Sphinx3 and HTK? He compared models trained with the respective trainers on a couple of tasks. What you're proposing to do is a step further and I am very interested to see what the results are. See:
Currently it only converts in opposite direction from sphinx3 to htk, but probably it will be useful for you. Let's hope other formats will be supported too.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Yeah, unfortunately neither Arthur nor Udhay (the two originators of this project) are actively working on it at this point. Udhay works on JANUS now and I'm not sure what Arthur is doing...
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
One addendum to what I mentioned before: Internally, the HMM computation code in Sphinx 3.x does use a non-emitting initial state, but it's assumed that it has a single transition to the first emitting state, partly because the model file format doesn't contain an initial non-emitting state.
Also, Sphinx 3.x does allow 3-state Bakis (skip-state) topologies. But nobody uses them.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I'm doing something very similar right now (in my spare time):
Instead of converting, I'm rather writing a loader in Sphinx4 for
HTK 3ph and 1ph models.
I'm nearly done, but there's still a lot to debug.
I have one question though: I'm not familiar with sphinx4 algorithms,
and I'm a bit surprised that there is no "hmm tying" (the HTK tiedlist) ?
There is no regression tree in the models neither (?), so what happens
when a triphone that's not in the models is required during decoding ?
I thought cross-word triphones was supported, isn't it ?
Thanks a lot to a Sphinx4 guru for this help !
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
we succeeded only partially in using our existing HMM models. The ASR accepted them and made something out of them - but the recognition rate was far from good. We have to make slight changes to the frontend routines in SPHINX 4 - which means that we rather changed the source code of the recognizer than the models.
This work was done two years ago, though, and we stopped without satisfying results.
the results were far better with HTK. We compared very small samples running both HTK and SPHINX4.
I don't want to discourage you, I just thought it might be useful for you to have this information ;-).
> I have one question though: I'm not familiar with sphinx4 algorithms,
and I'm a bit surprised that there is no "hmm tying" (the HTK tiedlist) ?
I'm not really familiar with HTK, but in Sphinx3 (I can't speak for Sphinx4...) we do something called "senone sequence compression". All triphones with the same sequence of output distributions are mapped to a single HMM. Maybe this is what you mean?
> There is no regression tree in the models neither (?), so what happens
when a triphone that's not in the models is required during decoding ?
I thought cross-word triphones was supported, isn't it ?
I kind of answered this earlier... Sphinx will back off to a context-independent phone.
The regression tree would be a really nice thing to have in the models, because not only would it help with unknown triphones, but we could also use it for speaker adaptation (which I guess is the other thing HTK uses it for)
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Also, one thing you should all be aware of is that the "MFCC" features in Sphinx are not the same as the ones used in HTK. The computation of the mel-scale is different and there is a quirk (maybe a bug) in the implementation of the inverse transform.
So you should either implement HTK-style MFCCs in the Sphinx front end, or if you are just doing batch recognition you should use HCopy to generate the feature files and then convert them to Sphinx format.
The Sphinx format for feature files is really stupid, it's just a 32-bit integer listing the number of data points, followed by the features as a row-major array of 32-bit floats. It can be whatever endianness you want.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks a lot Wout and David for your answers: indeed,
Paul already explained the algorithm for 3ph selection before;
(sorry, I didn't manage to find the message myself, now I'll keep it in a
safe place :-) ).
I'm aware that MFCCs between HTK and S4 are different, but this is pretty
easy to write something for batch tests, and for live, one can also
consider saving the MFCCs computed with S4 on the training corpus, or
write a new frontend - I think this can be solved.
I see several people have tried to develop this HTKLoader before, and
then have given up.
As I already have most of the code, I think I'll give it a try anyway :-)
If it happens to work somehow, I'll post again. Otherwise, you won't hea
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am currently working on a comparison of a number of speech recognition toolkits, including HTK, ISIP and Sphinx. I've created models in HTK, but now I'd like to convert them to the Sphinx 3 format, so that I can evaluate the performance of the Sphinx 3 decoder.
There are two things that I'd like to ask:
Did anyone already succeed in creating a converter for HTK to Sphinx 3 or 4 (I believe these formats are the same)? A number of people have said on this forum that they were working on it, but there has never been a follow up saying that they did or did not succeed.
Could anyone give me some information about the file formats used for the Sphinx 3 models? The documentation about the file formats (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#4b) doesn't seem to be correct. (I don't see any correspondence between the
documentation and the acutal files, at least.) The documentation implies that the models are in ASCII-form, but all models I have seen are binary with a textual header which isn't mentioned in the documentation.
If a converter doesn't exist (which I expect), I'd like to convert my HTK models to Sphinx 3 format.
I would be thankful for any help you could give me!
Best regards,
Wout
Isn't it easier to train sphinx model on the same data?
I don't know, as I haven't tried for two reasons:
Because I want to compare the performance (WER and time) of a number of decoders, I'd like to have them use the same data as far as possible. This includes the feature vectors, language model and HMMs. If I retrain using Sphinx, it will be difficult to get the same model, I believe.
After looking at the tutorial and the documentation for SphinxTrain, I think that recreating the models and getting to know SphinxTrain and its (many) configuration options would take more time than converting the HTK models. I know from using HTK that it takes a long time to figure out the training process.
I'd be happy to hear from you if you think I'm wrong ;-).
Best regards,
Wout Maaskant
Gosh, that documentation really isn't very good. The model files are binary files. What that documentation is describing is the ASCII output of the 'pdump' program which displays their contents.
Nobody has completed a converter at this point. I have about zero familiarity with HTK (and in some sense I'd like to keep it that way since HTK is not open-source) but I'd like to be able to interoperate with its model files. As far as I know the basic structure of Sphinx3 and HTK models is the same, only the file format is different.
Briefly, the Sphinx3 acoustic model consists of a model definition file, Gaussian parameter files, a mixture weights file, and a transition matrices file. There is also an optional tied-state to codebook mapping file which is (unfortunately) not supported by all decoders at this time. In the absence of this file the acoustic model is assumed to be either fully continuous (each tied state has its own set of Gaussians and mixture weights) or semi-continuous (all tied states share a single set of Gaussians, but have their own mixture weights).
The model definition file is text (but PocketSphinx has a binary format which will probably make its way into the other decoders). Its purpose is to map triphones to transition matrix IDs and to sequences of state distribution IDs. I believe this file is documented correctly: http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#4b http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#24 http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#30
The Gaussian parameter files are binary files with a short text header. There is one parameter file for the means and another one for the (diagonal) variance vectors. (full covariances are also supported but the file format is the same - the parameter vectors are just the matrices in row-major order, yes, this isn't very efficient). This header looks like:
s3
<key> <value>
...
endhdr
In other words, the first line is 's3', followed by whitespace-separated key/value pairs, followed by 'endhdr'. The binary part of the file follows. The first four bytes are a byte-order marker, which is 0x11223344 in the byte order of the file. In other words, if the file is little-endian, it will consist of the four bytes 0x44, 0x33, 0x22, 0x11, and if the file is big-endian, they will be in the opposite order.
The next four bytes are the number of codebooks (Gaussian mixture models) in the file.
The next four bytes are the number of feature streams (this is usually 1).
The next four bytes are the number of densities per codebook.
For each feature stream, in order, there are four bytes containing the dimensionality of that stream (usually there is 1, and it is 39).
Finally, the next four bytes are the number of 32-bit floating-point numbers in the file. This should be equal to the product of the number of codebooks and densities and the total dimensionality of all streams.
The actual data follows. It consists of an array in row-major order of 32-bit floating-point numbers in the form [codebook][feature][density][dimension]. In other words, the values are ordered:
codebook 0 feature 0 density 0 [1..39]
codebook 0 feature 0 density 1 [1..39]
...
codebook N feature M density K [1..39]
There is an optional checksum at the end of the file (if the header contains "chksum0 yes", then it is present). I don't usually bother with this, so you'll have to look at the code to see how it's computed.
The mixture weights file is mostly the same. The header and byte-order markers are exactly the same. Following them, there are:
A four-byte integer containing the number of state output distributions (senones, GMMs)
A four-byte integer containing the number of feature streams
A four-byte integer containing the number of densities per mixture
The mixture weight data follows. Again, it is a 3-dimensional array of 32-bit floats indexed by [senone][feature][density]. An important thing to note is that the mixture weights are stored in UNNORMALIZED format. That is, they don't sum to one across each senone. So, you may wish to normalize them before using them. (the reason they are in this format is that it allows the same files to be used for accumulating mixture weight counts in training. as a happy accident, it also makes MAP adaptation much easier)
Finally, the transition matrix file is, again, of the same general format, with the same header and byte-order markers. After the byte-order marker, there are:
A four-byte integer containing the number of transition matrices (this is usually, but doesn't have to be, the same as the number of context-independent phones)
A four-byte integer containing the number of emitting states per phone
A four-byte integer containing the total number of states per phone (i.e. the previous number + one non-emitting final state)
And there is yet another 3-dimensional array of 32-bit floats. For much the same reason as the mixture weights, the transition matrices are also UNNORMALIZED.
Hope this helps! There is Python/NumPy code which reimplements these file formats in SphinxTrain/python/sphinx - look at s3gau.py, s3tmat.py, s3mixw.py, s3gau.py.
> And there is yet another 3-dimensional array of 32-bit floats. For much the same reason as the mixture weights, the transition matrices are also UNNORMALIZED.
I can't figure out what the three dimensions are. Shouldn't this just be a list of transition matrices, each consisting of [number of states]^2 32-bit floats?
Right, the first dimension is the transition matrix ID :) For 3-state HMMs with 40 context-independent phones, you have a 40x3x4 array.
Thanks for your reply yesterday!
Am I correct in assuming that the state indices used in the model definition are implicitly defined by the ordering used in the means, variances and mixture weights files? And that the order of the related means, variances and mixture weights in those files should therefore be the same?
Have you had a chance to look at my question about the position column in the model definition (https://sourceforge.net/forum/message.php?msg_id=4356228)?
I currently have a parser for the HTK model file in ASCII format and I can output the Sphinx 3 transition matrices file.
Best regards,
Wout
Hi,
I'm making progress: the tmat, mean, var mixw can be converted and mdef is almost finished. I haven't checked if Sphinx 3 will accept the files, though.
I hope you don't mind that I keep asking all kinds of detailed questions, because I've got two more:
My other question concerns one of the variables at the start of the model definition file:
I am checking this because this parameter is described in the tutorial as "The HMM for each CI phone has a transition probability matrix associated it. This is the total number of transition matrices for the given set of models." Because this description is given in the CI part of the tutorial, I wasn't sure if CI phones were mentioned because of the location in the tutorial or because this parameter /really/ only concerns the base phones.
Best regards,
Wout
> - You mentioned that Sphinx 3 expects monophone models to be present for the base phone of every triphone model. In the tutorial, all triphone models share their transition matrix with the monophone model for their base phone. Is this requirement by Sphinx?
It's not a requirement, but there are some parts of the decoder that assume it to be true. This is true in particular for the "fast" (lexicon tree based) decoder which is the standard one in Sphinx3.6 - where "composite" HMMs are used to represent mutiple triphones with the same base phone (and different senone sequences), it is assumed that they share the same transition matrix.
In practice it probably doesn't matter much because the transition probabilities for different triphones are unlikely to be sufficiently different to affect accuracy.
> - Do I understand correctly that n_tied_tmat is the number of unique transition matrices used (and the same as the number of transition matrices in the tmat file), including those for triphone models?
Yes, this should be the same as the number in the tmat file. Or, rather, it should be the maximum of the 'tmat' column in the list of phone definitions. The confusion stems from the fact that this is usually the same as the number of CI phones, due to the assumption mentioned above.
> Gosh, that documentation really isn't very good. The model files are binary files. What that documentation is describing is the ASCII output of the 'pdump' program which displays their contents.
Ah, that explains it :-)
Thanks very much for all the information you've given me! It will take some time study and (try to) use it, but I'll let you know what happens.
You mention that the mixture weights and transition matrices are unnormalized in Sphinx. In HTK these are normalized and, according to the FAQ (http://www.speech.cs.cmu.edu/sphinxman/FAQ.html), it is possible to convert the transition matrices by taking logbase 1.0001 of the probabilities. Do you know whether this also holds for the mixture weights?
In the Sphinx documentation, and now in your reply as well, I noticed that an HMM structure is assumed of a few (say 3) emitting states and 1 non-emitting state. The models I created with HTK have created have 5 states, of which 3 are emitting. Is this a real difference, or is this just a different way of describing the same problem (i.e. does Sphinx implicitly create a non-emitting state before each HMM)? The sp model I'm using is a 'tee-model'.
Best regards,
Wout
The Sphinx decoders implicitly assume that the first state in a model is the initial state, and that it is an emitting state.
Given that the system only works with left-to-right models, this is a fairly reasonable assumption. I guess HTK allows an explicit non-emitting start state because it is able to handle a more general class of HMM topologies.
What this means in practice is that if you have more than one transition out of your non-emitting initial state, then you can't use your model with Sphinx. In particular "tee" models where you can transition from the non-emitting initial state directly to the non-emitting final state, thereby skipping an HMM entirely, can't be decoded by Sphinx.
What you could do to hack around this problem is just remove the initial non-emitting state completely, which has the effect of declaring the first emitting state to be the only possible start state. I don't know how much this will affect accuracy.
Mixture weights for Sphinx are stored as floating-point numbers in the file so all you have to do is normalize them to get probabilities.
If you have normalized mixture weights already (from HTK, say), and you want to create a Sphinx mixture-weight file from them, you don't have to do anything to them. Sphinx will happily normalize them and get the exact same numbers back :-)
The logbase of 1.0001 thing is done internally to the decoder and has nothing to do with the model file formats. The only place it matters is when you are looking at the decoder outputs.
Thank you for your reply.
The initial (non-emitting) state in my HTK models only has a transition to the second (emitting) state, so it's no problem that the second state becomes the initial state. According to the tutorial (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html), Sphinx adds a non-emitting state to the end (search for "n_state_map" to find it). But which states will have a transition to this state? I only want the last emitting state to have such a transition.
I will have to figure out what to do with my sp and sil models. I already mentioned that sp is a "tee" model. sil currently has these transitions:
1: 2
2: 2 3
3: 3 4
4: 2 4 5
5:
Is there a common way to model short pauses between words (sp) and silences before and after sentences (sil) generally modeled in Sphinx?
It's really nice that there are Python scripts to read and write S3 model files. I've had a look and I think I'll try to use them for my conversion scripts.
Best regards,
Wout
Sphinx uses silence and filler models that are trained the same as other context-independent phones, and have the same topology. The decoder inserts transitions to them with a flat probability in between all words.
The final non-emitting state is actually present in the transition matrices for Sphinx. Usually the only transition into it is from the final emitting state.
For the Sphinx3.6 "fast" decoder (this is NOT well documented, sorry...) there are only two allowable topologies, either a 3-state left-to-right:
x x 0 0
0 x x 0
0 0 x x
or a 5-state Bakis:
x x x 0 0 0
0 x x x 0 0
0 0 x x x 0
0 0 0 x x x
0 0 0 0 x x
For the "slow" decoder the only requirement is that the transition matrix be upper-triangular.
The 3-state L-to-R topology is exactly what I want to use :-).
I have a few questions concerning the model definition file format. I'll post my question in this thread, as the main discussion happens here ;-).
> Sphinx uses silence and filler models that are trained the same as other context-independent phones, and have the same topology. The decoder inserts transitions to them with a flat probability in between all words.
Do I understand correctly that I do need to define those models in my model definition file, and use "filler" in the "attrib" column? (I.e. the silence and filler models are not defined implicitly by Sphinx?)
In the tutorial (http://www.speech.cs.cmu.edu/sphinxman/scriptman1.html#30) the monophone (context-independent) models are kept in the model definition file after the triphone (context-dependent) models are created. Is this a requirement of Sphinx, or just something that is done in the tutorial? To clarify: the file contains
base lft rt p attrib tmat ... state id's ...
AE - - - n/a 1 3 4 5 N
AX - - - n/a 2 6 7 8 N
but also:
AE B T i n/a 1 15 16 17 N
AE T B i n/a 1 18 16 19 N
AX AX AX s n/a 2 20 21 22 N
AX AX B s n/a 2 23 21 22 N
The models I am currently using with HTK only contain trigram HMMs. Would Sphinx accept a model definition file with just the context-dependent part?
Another question concerns the position ("p") column in the model definition file. The tutorial lists "b", "e", "i" and "s" as valid values, for word beginning, ending or internal triphones or single word triphones, respectively. Is there any difference in how Sphinx handles triphone models based on their position? HTK doesn't record this value, so I will probably just choose a default value for all triphones, if there isn't much difference.
Thank you again for your help!
Best regards,
Wout
> The models I am currently using with HTK only contain trigram HMMs. Would Sphinx accept a model definition file with just the context-dependent part?
Not without some hacking... Yes, Sphinx assumes that the context-independent phones are also present in the model. It uses them to speed up GMM computation, and it will also fall back on them if a triphone from the dictionary is not available.
> Do I understand correctly that I do need to define those models in my model definition file, and use "filler" in the "attrib" column? (I.e. the silence and filler models are not defined implicitly by Sphinx?)
Yes, that's correct.
> Another question concerns the position ("p") column in the model definition file. The tutorial lists "b", "e", "i" and "s" as valid values, for word beginning, ending or internal triphones or single word triphones, respectively. Is there any difference in how Sphinx handles triphone models based on their position? HTK doesn't record this value, so I will probably just choose a default value for all triphones, if there isn't much difference.
When building word HMMs or lexicon trees, Sphinx will try to use the triphone with the appropriate word position. If it can't find it, then it will try to use any instance of that triphone. If it can't find one, then it will use the context-independent phone. (there are some instances where it doesn't back off to other instances of the triphone and just uses the CI phone - this may be a bug)
You can just pick an arbitrary one and use it. I'd suggest either "i" or "s". But accuracy will probably suffer somewhat.
Also have you seen Keith Virtanen's comparison of Sphinx3 and HTK? He compared models trained with the respective trainers on a couple of tasks. What you're proposing to do is a step further and I am very interested to see what the results are. See:
http://www.inference.phy.cam.ac.uk/kv227/htk/
http://www.inference.phy.cam.ac.uk/kv227/sphinx/
Btw, this project looks interesting:
http://sourceforge.net/projects/srmc/
Currently it only converts in opposite direction from sphinx3 to htk, but probably it will be useful for you. Let's hope other formats will be supported too.
Yeah, unfortunately neither Arthur nor Udhay (the two originators of this project) are actively working on it at this point. Udhay works on JANUS now and I'm not sure what Arthur is doing...
I can tell a little bit more about this proj. From what I could see, it supports only
monophones, and not triphones so far.
Some code for triphones was written, but not tested and debugged.
One addendum to what I mentioned before: Internally, the HMM computation code in Sphinx 3.x does use a non-emitting initial state, but it's assumed that it has a single transition to the first emitting state, partly because the model file format doesn't contain an initial non-emitting state.
Also, Sphinx 3.x does allow 3-state Bakis (skip-state) topologies. But nobody uses them.
I'm doing something very similar right now (in my spare time):
Instead of converting, I'm rather writing a loader in Sphinx4 for
HTK 3ph and 1ph models.
I'm nearly done, but there's still a lot to debug.
I have one question though: I'm not familiar with sphinx4 algorithms,
and I'm a bit surprised that there is no "hmm tying" (the HTK tiedlist) ?
There is no regression tree in the models neither (?), so what happens
when a triphone that's not in the models is required during decoding ?
I thought cross-word triphones was supported, isn't it ?
Thanks a lot to a Sphinx4 guru for this help !
I sent an e-mail to the poster from this message: https://sourceforge.net/forum/message.php?msg_id=3224874, to inform about the results they got. This what she replied:
we succeeded only partially in using our existing HMM models. The ASR accepted them and made something out of them - but the recognition rate was far from good. We have to make slight changes to the frontend routines in SPHINX 4 - which means that we rather changed the source code of the recognizer than the models.
This work was done two years ago, though, and we stopped without satisfying results.
the results were far better with HTK. We compared very small samples running both HTK and SPHINX4.
I don't want to discourage you, I just thought it might be useful for you to have this information ;-).
I think this thread may also be of use to you: http://sourceforge.net/forum/message.php?msg_id=2674597
> I have one question though: I'm not familiar with sphinx4 algorithms,
and I'm a bit surprised that there is no "hmm tying" (the HTK tiedlist) ?
I'm not really familiar with HTK, but in Sphinx3 (I can't speak for Sphinx4...) we do something called "senone sequence compression". All triphones with the same sequence of output distributions are mapped to a single HMM. Maybe this is what you mean?
> There is no regression tree in the models neither (?), so what happens
when a triphone that's not in the models is required during decoding ?
I thought cross-word triphones was supported, isn't it ?
I kind of answered this earlier... Sphinx will back off to a context-independent phone.
The regression tree would be a really nice thing to have in the models, because not only would it help with unknown triphones, but we could also use it for speaker adaptation (which I guess is the other thing HTK uses it for)
Also, one thing you should all be aware of is that the "MFCC" features in Sphinx are not the same as the ones used in HTK. The computation of the mel-scale is different and there is a quirk (maybe a bug) in the implementation of the inverse transform.
So you should either implement HTK-style MFCCs in the Sphinx front end, or if you are just doing batch recognition you should use HCopy to generate the feature files and then convert them to Sphinx format.
The Sphinx format for feature files is really stupid, it's just a 32-bit integer listing the number of data points, followed by the features as a row-major array of 32-bit floats. It can be whatever endianness you want.
Thanks a lot Wout and David for your answers: indeed,
Paul already explained the algorithm for 3ph selection before;
(sorry, I didn't manage to find the message myself, now I'll keep it in a
safe place :-) ).
I'm aware that MFCCs between HTK and S4 are different, but this is pretty
easy to write something for batch tests, and for live, one can also
consider saving the MFCCs computed with S4 on the training corpus, or
write a new frontend - I think this can be solved.
I see several people have tried to develop this HTKLoader before, and
then have given up.
As I already have most of the code, I think I'll give it a try anyway :-)
If it happens to work somehow, I'll post again. Otherwise, you won't hea