I would like to use output from Sphinx4, edited and corrected, as input to
another round of Sphinx acoustic training.
An initial system, perhaps trained with a corpus that has been edited by hand,
will create language and acoustic models for a recognizer. Say, with 1 hour of
training data.
I want that recognizer to aid in the production of more data that can, in
turn, be used to train a better recognizer. Say, with two hours of training
data.
I can see that it can help to partially recognize sound files, and use that to
prepare training. The partially recognized files would be edited and corrected
by hand, but it should be easier than starting with nothing recognized yet.
The output of some recognizer can give me some of the clues for the next
round. I can write out a word it recognizes, correctly or incorrectly, with
the timing tags it needs to make it training data.
I have kept sentence endings, , so that I can use that to break recognized
output into sentences, and translate it back to training data.
It seems, however, that all fillers are treated the same. I am finding <sil>
reported in places where I know that there must have been ++breath++ or
++noise++. </sil>
To make the next round of training data, however, I would like to be able to
get the different fillers to write them out to the next round of training. So
far, I have been unable to figure out how to do this.
Besides any specific help, I would appreciate any stories about "roundtrip"
training.
LAT
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
you can use fillers like <sil> etc..... where as it is given in the document
...... there are three fillers which u can use for your filler
dictionary.......................... </sil>
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I would like to use output from Sphinx4, edited and corrected, as input to
another round of Sphinx acoustic training.
This process is largely referred in literature as unsupervised training and
there are many papers describing how they do it
It seems, however, that all fillers are treated the same. I am finding <sil>
reported in places where I know that there must have been ++breath++ or
++noise++.</sil>
This thing needs to be supported by numbers. What is filler error rate? How
big percent of fillers is incorrectly recognized? If it's really huge, it's a
problem. With good initial model it should recognize correctly though I never
checked how good it is
On this subject I have two thoughs:
There is sense to not retrain fillers at all. Only context-independent models are kept for them and they should be already good with your initial data. Unfortunately, sphinxtrain doesn't support this feature to keep few models as they are and update others.
I would also try to remove fillers from recognizer output and try to insert them in forced alignment step of the training. This could be potentially more useful.
Besides any specific help, I would appreciate any stories about "roundtrip"
training.
It's important to develop good confidence measure to check first round
recognizer output. Such condfidence measure could use external properties to
strip incorrectly transcribed utterances.
I would like to use output from Sphinx4, edited and corrected, as input to
another round of Sphinx acoustic training.
An initial system, perhaps trained with a corpus that has been edited by hand,
will create language and acoustic models for a recognizer. Say, with 1 hour of
training data.
I want that recognizer to aid in the production of more data that can, in
turn, be used to train a better recognizer. Say, with two hours of training
data.
I can see that it can help to partially recognize sound files, and use that to
prepare training. The partially recognized files would be edited and corrected
by hand, but it should be easier than starting with nothing recognized yet.
The output of some recognizer can give me some of the clues for the next
round. I can write out a word it recognizes, correctly or incorrectly, with
the timing tags it needs to make it training data.
I have kept sentence endings, , so that I can use that to break recognized
output into sentences, and translate it back to training data.
It seems, however, that all fillers are treated the same. I am finding <sil>
reported in places where I know that there must have been ++breath++ or
++noise++. </sil>
To make the next round of training data, however, I would like to be able to
get the different fillers to write them out to the next round of training. So
far, I have been unable to figure out how to do this.
Besides any specific help, I would appreciate any stories about "roundtrip"
training.
LAT
you can use fillers like <sil> etc..... where as it is given in the document
...... there are three fillers which u can use for your filler
dictionary.......................... </sil>
ramsdoe: I don't understand your comment. LAT
Hello daktari3
This process is largely referred in literature as unsupervised training and
there are many papers describing how they do it
This thing needs to be supported by numbers. What is filler error rate? How
big percent of fillers is incorrectly recognized? If it's really huge, it's a
problem. With good initial model it should recognize correctly though I never
checked how good it is
On this subject I have two thoughs:
There is sense to not retrain fillers at all. Only context-independent models are kept for them and they should be already good with your initial data. Unfortunately, sphinxtrain doesn't support this feature to keep few models as they are and update others.
I would also try to remove fillers from recognizer output and try to insert them in forced alignment step of the training. This could be potentially more useful.
It's important to develop good confidence measure to check first round
recognizer output. Such condfidence measure could use external properties to
strip incorrectly transcribed utterances.
You can start with this paper
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.8634
But of course there are more recent ones like
http://www.bbn.com/resources/pdf/icassp07_unsupervised_training.pdf