I am a novice to speech recognition. What I understand from reading Ravi's thesis is that sampled voice in time domain is processed to generate 4 feature streams at 100 frames per sec. Question: Can these features streams be used to reconstruct back sound in time domain? If yes, has anyone done it? Thanks.
Shih-Lien
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Good question. I imagine that the feature streams are fairly highly processed? Probably an intermediate step between samples and phonemes. So even though I don't know a thing about this architecture, I can at least say that the answer is no if there is data loss, but yes if you don't mind that, i.e. you can still use a stream of phonemes if you just want to run it into a text-to-speech engine later. What do you plan to use the processed data for?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am a novice to speech recognition. What I understand from reading Ravi's thesis is that sampled voice in time domain is processed to generate 4 feature streams at 100 frames per sec. Question: Can these features streams be used to reconstruct back sound in time domain? If yes, has anyone done it? Thanks.
Shih-Lien
Good question. I imagine that the feature streams are fairly highly processed? Probably an intermediate step between samples and phonemes. So even though I don't know a thing about this architecture, I can at least say that the answer is no if there is data loss, but yes if you don't mind that, i.e. you can still use a stream of phonemes if you just want to run it into a text-to-speech engine later. What do you plan to use the processed data for?