Menu

Round trip/ recycling training for Sphinx

Daktari3
2010-07-22
2012-09-22
  • Daktari3

    Daktari3 - 2010-07-22

    I would like to use output from Sphinx4, edited and corrected, as input to
    another round of Sphinx acoustic training.

    An initial system, perhaps trained with a corpus that has been edited by hand,
    will create language and acoustic models for a recognizer. Say, with 1 hour of
    training data.

    I want that recognizer to aid in the production of more data that can, in
    turn, be used to train a better recognizer. Say, with two hours of training
    data.

    I can see that it can help to partially recognize sound files, and use that to
    prepare training. The partially recognized files would be edited and corrected
    by hand, but it should be easier than starting with nothing recognized yet.

    The output of some recognizer can give me some of the clues for the next
    round. I can write out a word it recognizes, correctly or incorrectly, with
    the timing tags it needs to make it training data.

    I have kept sentence endings, , so that I can use that to break recognized
    output into sentences, and translate it back to training data.

    It seems, however, that all fillers are treated the same. I am finding <sil>
    reported in places where I know that there must have been ++breath++ or
    ++noise++. </sil>

    To make the next round of training data, however, I would like to be able to
    get the different fillers to write them out to the next round of training. So
    far, I have been unable to figure out how to do this.

    Besides any specific help, I would appreciate any stories about "roundtrip"
    training.

    LAT

     
  • rams

    rams - 2010-07-23

    you can use fillers like <sil> etc..... where as it is given in the document
    ...... there are three fillers which u can use for your filler
    dictionary.......................... </sil>

     
  • Daktari3

    Daktari3 - 2010-07-24

    ramsdoe: I don't understand your comment. LAT

     
  • Nickolay V. Shmyrev

    Hello daktari3

    I would like to use output from Sphinx4, edited and corrected, as input to
    another round of Sphinx acoustic training.

    This process is largely referred in literature as unsupervised training and
    there are many papers describing how they do it

    It seems, however, that all fillers are treated the same. I am finding <sil>
    reported in places where I know that there must have been ++breath++ or
    ++noise++.</sil>

    This thing needs to be supported by numbers. What is filler error rate? How
    big percent of fillers is incorrectly recognized? If it's really huge, it's a
    problem. With good initial model it should recognize correctly though I never
    checked how good it is

    On this subject I have two thoughs:

    1. There is sense to not retrain fillers at all. Only context-independent models are kept for them and they should be already good with your initial data. Unfortunately, sphinxtrain doesn't support this feature to keep few models as they are and update others.

    2. I would also try to remove fillers from recognizer output and try to insert them in forced alignment step of the training. This could be potentially more useful.

    Besides any specific help, I would appreciate any stories about "roundtrip"
    training.

    It's important to develop good confidence measure to check first round
    recognizer output. Such condfidence measure could use external properties to
    strip incorrectly transcribed utterances.

    You can start with this paper

    http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.8634

    But of course there are more recent ones like

    http://www.bbn.com/resources/pdf/icassp07_unsupervised_training.pdf

     

Log in to post a comment.