I followed the steps mentioned in the page http://cmusphinx.sourceforge.net/wiki/phonemerecognition to extract phonemes using CMU Sphinx.I'm trying to build a language model for the language "Tamil" .I'm using cmuclmtk for building the LM.I did the following steps:
The CMU wiki says to "Just replace the words with their corresponding transcription" .Where should I do this ? I just have the phoneme list for my language .How do I incorporate this into cmuclmtk .And is there any tools to transcribe text to phonemes for my language?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I followed the steps mentioned in the page http://cmusphinx.sourceforge.net/wiki/phonemerecognition to extract phonemes using CMU Sphinx.I'm trying to build a language model for the language "Tamil" .I'm using cmuclmtk for building the LM.I did the following steps:
Step 1:text2wfreq.exe <input.txt>input.wfreq</input.txt>
Step 2:wfreq2vocab.exe <input.wfreq>input.vocab</input.wfreq>
Step 3:text2idngram.exe -vocab input.vocab -idngram input.idngram <input.txt
Step 4:idngram2lm.exe -idngram input.idngram -vocab input.vocab -arpa input.arpa
Step 5:echo "perplexity -text input.txt" |evallm -arpa input.arpa
The CMU wiki says to "Just replace the words with their corresponding transcription" .Where should I do this ? I just have the phoneme list for my language .How do I incorporate this into cmuclmtk .And is there any tools to transcribe text to phonemes for my language?
You have to write your own script to replace training text with phonemic sequences. You can use any scripting langauge for that - Python, Perl, etc.