Re: [Algorithms] lipsynch
Brought to you by:
vexxed72
From: Igor K. <ig...@ob...> - 2003-01-22 15:12:41
|
Message Just seen that LipSynchro's SDK (if you indeed use it) can recognize = voice energy level. Maybe it's just the examples on the site which are pretty crappy but not = show at all the real power of the SDK. If you have some references or links to show me something made with = LipSynchro, I would be interested to have a look. Igor ----- Original Message -----=20 From: Chris Haarmeijer=20 To: gda...@li...=20 Sent: Wednesday, January 22, 2003 8:29 AM Subject: RE: [Algorithms] lipsynch What do you mean by preprocessing.... We're using LipSynchro which is = integratable using their SDK. You pass it a WAV file and it returns the = phonemes used with other data that can be used for gestures. The whole = process is automatic. We pass the data returned to our facial animation = software that translates the phonemes into visemes and morphs = accordingly. Chris --- Keep IT Simple Software Van Alphenstraat 12 7514 DD Enschede W: http://www.keepitsimple.nl E: mailto:in...@ke... T: +31 53 4356687 -----Original Message----- From: gda...@li... = [mailto:gda...@li...] On Behalf Of Igor = Kravtchenko Sent: woensdag 22 januari 2003 4:28 To: gda...@li... Subject: [Algorithms] lipsynch I would be interested to share experience on real time lipsynch. What I'm talking about is lipsynch which work from a single audio = file and does not need preprocess like phonems analysis (such as TalkMaster). So, you just need to = pass a WAV and the head talks. After several experiences, we ended on a quite decent result. Here how we approximatively do. We start from a model setuped with one endomorph (called = "blendshape" under 3DS Max) for the mouth and one another for the two eyes. A certain motion rate is then chosen = to animate the head, usually 25 hz. At 25 hz so, the audio signals are read each 40ms. At 44.1khz, it = represents a read of 1764 samples per polling. Then we take the average of the difference between samples. = Why not the average of samples themself ? Because a sound is heard (so exists) because of its variation not = because of its "values". Finally, that average of delta is normalized between 0 and 1. So right now, we generate 25 = normalized float values per second, okey. These values are added into a motion with one keyframe each 40ms. A = noise reduction is then applied to the motion (a "blur" between keyframes) and, on that result, adjacent = keyframes which don't represent a major change (using a certain threshold) are removed from the motion. = Finally, all the stuff is binded together with a good old TCB spline. At audio replaying, the value computed at a = given time is then directly interpreted as a percentage applied to the endomorph (blendshape) which control the = opening of the mouth. For a speak of 40 seconds, it only represents a .mot of 24kb (ASCII = format) when we back the motion to Lightwave. The size ofcourse depends of the WAV but for a "normal" speech, it = doesn't change too much. The result is pretty convincing for something automatically = generated. However I'm not sure we reached to obtain the same quality as some games like LOTR (when Gandalf talks = in the first cinematic, the result is just ... nice !). Maybe it exists some (easy ?) tricks to improve = the quality of a such automatic lipsynch ? Any hint is welcome, Igor |