The Sphinx II user docs say that Forced alignment can be accomplished using the finite-state grammar capability.
With a text transcription of my utterance, I'm assuming that I would fill out a s2_fsg_trans_t struct for each word transition. What should I use for the "prob" variable, that stores the transition probablility? For each transition, would I then have to create a null transition where the probablity is (1-prob)?
Thanks!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I've used the traditional forced alignment API with much success for shorter utterances. The problem I am having is that for longer utterances, all the paths are getting pruned out. I am hoping to have better results with the FSG on longer utterances (and utterances with poor audio quality).
I am not using the forced alignment for training, as most applications do. If I were, I would just cut up the audio into smaller chunks as has been recommended on the forums. For me, the forced alignment API is a critical part of the Sphinx solution. I am evaluating Sphinx for use in a facial animation solution for video games. The end user would input a WAV file and an optional text file, and we would generate the animation data. So we need Sphinx to be able to give us it's best guess even when the utterances are long. Other solutions we have looked at can do this, and I'm hoping that Sphinx's FSG capabilities will be more robust in the scenario of longer utterances. Any help would be appreciated.
Doug
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
The Sphinx II user docs say that Forced alignment can be accomplished using the finite-state grammar capability.
With a text transcription of my utterance, I'm assuming that I would fill out a s2_fsg_trans_t struct for each word transition. What should I use for the "prob" variable, that stores the transition probablility? For each transition, would I then have to create a null transition where the probablity is (1-prob)?
Thanks!
I don't know about using the recently-added FSG capability, but the built-in batch-mode forced-alignment capability as described under http://cmusphinx.sourceforge.net/sphinx2/#sec_allphone_api should be simpler.
I have also used the s3align program in the old, "slow" Sphinx3 distribution with good results.
cheers,
jerry
Thanks Jerry,
I've used the traditional forced alignment API with much success for shorter utterances. The problem I am having is that for longer utterances, all the paths are getting pruned out. I am hoping to have better results with the FSG on longer utterances (and utterances with poor audio quality).
I am not using the forced alignment for training, as most applications do. If I were, I would just cut up the audio into smaller chunks as has been recommended on the forums. For me, the forced alignment API is a critical part of the Sphinx solution. I am evaluating Sphinx for use in a facial animation solution for video games. The end user would input a WAV file and an optional text file, and we would generate the animation data. So we need Sphinx to be able to give us it's best guess even when the utterances are long. Other solutions we have looked at can do this, and I'm hoping that Sphinx's FSG capabilities will be more robust in the scenario of longer utterances. Any help would be appreciated.
Doug