Menu

UsingSphinxTrain with phonemic transcriptions

Help
2002-10-28
2012-09-22
  • Ivan Uemlianin

    Ivan Uemlianin - 2002-10-28

    I've accumulated a (not large) training dataset from a number of sources, and it turns out that some of the data has been phonemically as well as orthographically transcribed.  The transcription is basically a list of phonemes with timestamp and duration for each one.

    Naively, it seems to me that it might be useful to use this phonemic transcription instead of the orthographic one: all the training vagueness to do with variations in pronouncing words would be bypassed.

    1.  Can anyone tell me whether this is too naive, or is it an 'e,pirical question'?

    2.  Would I need to treat the phonemic transcription as an orthographic one - and put the phoneme symbols into the PD - or is there a more economic way of doing it?

    3.  Is there anything I can do with the timestamps?

    I've read tinydoc, the manual and the faq, but nothing has leapt out at me on this issue (I should say I've read *through* the manual).  If I've missed something or there are other docs I should consult please point me in the right direction.

    Thanks and best wishes

    Ivan

     
    • brabus

      brabus - 2007-04-03

      I am also interesting if this is possible...

       
    • David Huggins-Daines

      Hi!

      Yes, you can do this. I have trained models for phoneme recognition this way (from TIMIT) and they work pretty well.

      If you want to use the resulting models for connected-word recognition, they will work, but they will probably not be as good as models trained from word transcripts (provided the word dictionary and transcripts have all the relevant pronunciation variants in them).

      The reason why is that context-dependent phones in Sphinx take into account not just the previous and next phonemes but also the position in the word. This often doesn't make a difference (because the resulting context-dependent phones will end up sharing the same tied state sequence anyway) but sometimes it does.

      If you train from phonemic transcripts in the way you're suggesting, then it will only train the "single-phone word" triphones from the data in question. I don't have any empirical results to say for sure that this isn't as good but it probably isn't.

      If you do have word boundary information, what you should do is to create new words in the dictionary and adjust your transcript to use them. For example, if you had this transcription (with # being a word boundary):

      HH EH L OW # W ER L D

      You could change it to this:

      HH_EH_L_OW W_ER_L_D

      And add these entries to the dictionary:

      HH_EH_L_OW HH EH L OW
      W_ER_L_D W ER L D

       

Log in to post a comment.

Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.