CMU Sphinx / Forums / Help: UsingSphinxTrain with phonemic transcriptions

Ivan Uemlianin - 2002-10-28

I've accumulated a (not large) training dataset from a number of sources, and it turns out that some of the data has been phonemically as well as orthographically transcribed. The transcription is basically a list of phonemes with timestamp and duration for each one.

Naively, it seems to me that it might be useful to use this phonemic transcription instead of the orthographic one: all the training vagueness to do with variations in pronouncing words would be bypassed.

1. Can anyone tell me whether this is too naive, or is it an 'e,pirical question'?

2. Would I need to treat the phonemic transcription as an orthographic one - and put the phoneme symbols into the PD - or is there a more economic way of doing it?

3. Is there anything I can do with the timestamps?

I've read tinydoc, the manual and the faq, but nothing has leapt out at me on this issue (I should say I've read *through* the manual). If I've missed something or there are other docs I should consult please point me in the right direction.

Thanks and best wishes

Ivan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- brabus - 2007-04-03
  
  I am also interesting if this is possible...
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- David Huggins-Daines - 2007-04-03
  
  Hi!
  
  Yes, you can do this. I have trained models for phoneme recognition this way (from TIMIT) and they work pretty well.
  
  If you want to use the resulting models for connected-word recognition, they will work, but they will probably not be as good as models trained from word transcripts (provided the word dictionary and transcripts have all the relevant pronunciation variants in them).
  
  The reason why is that context-dependent phones in Sphinx take into account not just the previous and next phonemes but also the position in the word. This often doesn't make a difference (because the resulting context-dependent phones will end up sharing the same tied state sequence anyway) but sometimes it does.
  
  If you train from phonemic transcripts in the way you're suggesting, then it will only train the "single-phone word" triphones from the data in question. I don't have any empirical results to say for sure that this isn't as good but it probably isn't.
  
  If you do have word boundary information, what you should do is to create new words in the dictionary and adjust your transcript to use them. For example, if you had this transcription (with # being a word boundary):
  
  HH EH L OW # W ER L D
  
  You could change it to this:
  
  HH_EH_L_OW W_ER_L_D
  
  And add these entries to the dictionary:
  
  HH_EH_L_OW HH EH L OW
  W_ER_L_D W ER L D
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

UsingSphinxTrain with phonemic transcriptions

Speech Recognition Toolkit

Forums

Help

UsingSphinxTrain with phonemic transcriptions document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

HH EH L OW # W ER L D

UsingSphinxTrain with phonemic transcriptions