Hello,
I have been using pocket sphinx to do phoneme recognition as documented here: http://cmusphinx.sourceforge.net/wiki/phonemerecognition
by adding the -time argument, I can get the timing of the phonemes, which I use to segment the source file into tiny chunks.
The software has been easy to use and setup, however, like mentioned on the page, the accuracy is not very good for directly running phenome recognition. My goal is to build a collection of short sound files sorted by phonemes.
If I have a transcription for the source audio, is there some way to do "alignment" on the audio to get better segmentations? The audio files are only 5 - 20 s long.
How would I build such a language model for allphone/phonemes? From what I understand, the allphone search only takes ngram as the model? Should I replace everything in my transcription with phonemes and feed it to srilm ngram-count to use as the language model?
Or is there some better way to get phoneme timings through alignment in pocketsphinx?
How would I build such a language model for allphone/phonemes? From what I understand, the allphone search only takes ngram as the model?
You can still use grammar with a dictionary of single-phone words.
There is also ps_alignment API, example of which you can find in tests. But that requires very exact match between reference phoneme string and the actual audio contents.
Hello,
I have been using pocket sphinx to do phoneme recognition as documented here:
http://cmusphinx.sourceforge.net/wiki/phonemerecognition
by adding the -time argument, I can get the timing of the phonemes, which I use to segment the source file into tiny chunks.
The software has been easy to use and setup, however, like mentioned on the page, the accuracy is not very good for directly running phenome recognition. My goal is to build a collection of short sound files sorted by phonemes.
If I have a transcription for the source audio, is there some way to do "alignment" on the audio to get better segmentations? The audio files are only 5 - 20 s long.
There is a mention in another thread that you can get alignment on pocketsphinx by simply using the transcription as the grammar. https://sourceforge.net/p/cmusphinx/discussion/help/thread/dd998add/
How would I build such a language model for allphone/phonemes? From what I understand, the allphone search only takes ngram as the model? Should I replace everything in my transcription with phonemes and feed it to srilm ngram-count to use as the language model?
Or is there some better way to get phoneme timings through alignment in pocketsphinx?
Thank you very much!
Also, I think that this page has a typo near the bottom? feed this text file into [strilm] srilm
http://cmusphinx.sourceforge.net/wiki/phonemerecognition
You can still use grammar with a dictionary of single-phone words.
There is also ps_alignment API, example of which you can find in tests. But that requires very exact match between reference phoneme string and the actual audio contents.
Thank you for the notice, fixed