My name is Artur, I am currently working on my PhD in Computer Science at
University of Louisville. My interests include application of data mining
methods in social networking and viral marketing. I am very new to Speech
Recognition area and Sphinx particular, so excuse me if I ask too dummy
questions :).
Let me first explained what I trying to do. I trying to develop an application
where an user speaks into the microphone and a cartoon character repeats the
words after (preferable in real time). So, it is basically a lips-sync
application.
Searching through this forum, its became clear to me that building a phoneme
recognition system is a nontrivial task. But for my application I don`t even
need the phonemes, I need only visemes (facial expression). Currently, I have
only 18 visemes (O, OO, R, FV, S, SH, EE, TH, L,... ) which I think should be
enough.
Could you please, give me an advice: What is the best way to use Sphinx?
1. Train it with 39 phonemes and map them to 18 visemes
2. Train it with 18 visemes.
3. Use Sphinx as its, recognize words and map them to visemes.
4. Something else?
I also have requirements for the application
1. Speaker independence
2. Continues speech
3. Noise Level adaptation
4. Mobile platform, if possible
And one more: Speed is more important than Accuracy.
Thank you!
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Real-Time Continuous Phoneme Recognition System Using Class-Dependent Tied-
Mixture HMM With HBT Structure for Speech-Driven Lip-Sync by Junho Park and
Hanseok Ko
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Hello, All
My name is Artur, I am currently working on my PhD in Computer Science at
University of Louisville. My interests include application of data mining
methods in social networking and viral marketing. I am very new to Speech
Recognition area and Sphinx particular, so excuse me if I ask too dummy
questions :).
Let me first explained what I trying to do. I trying to develop an application
where an user speaks into the microphone and a cartoon character repeats the
words after (preferable in real time). So, it is basically a lips-sync
application.
Searching through this forum, its became clear to me that building a phoneme
recognition system is a nontrivial task. But for my application I don`t even
need the phonemes, I need only visemes (facial expression). Currently, I have
only 18 visemes (O, OO, R, FV, S, SH, EE, TH, L,... ) which I think should be
enough.
Could you please, give me an advice: What is the best way to use Sphinx?
1. Train it with 39 phonemes and map them to 18 visemes
2. Train it with 18 visemes.
3. Use Sphinx as its, recognize words and map them to visemes.
4. Something else?
I also have requirements for the application
1. Speaker independence
2. Continues speech
3. Noise Level adaptation
4. Mobile platform, if possible
And one more: Speed is more important than Accuracy.
Thank you!
Hello
According to this paper:
Comparision Of Phoneme And Viseme Based Acoustic Units fFor Speech Driven
Realistic Lip Animation by Bozkurt et all
http://staff.eng.bahcesehir.edu.tr/~cigdemeroglu/papers/international_confere
nce_papers/C_07_3DTV_phoneme.pdf
Viseme is only slightly better but still it's worth to use it because of
theoretical considerations. The less parameters to train you have the better.
I also recommend you to check the papers
Real-time language independent lip synchronization method using a genetic
algorithm by Goranka Zoric and Igor S. Pandzic
http://www.fer.unizg.hr/images/50009013/sp06.pdf
and
Real-Time Continuous Phoneme Recognition System Using Class-Dependent Tied-
Mixture HMM With HBT Structure for Speech-Driven Lip-Sync by Junho Park and
Hanseok Ko
Thank you for such fast reply. I'll look into those papers.