Re: [Algorithms] lipsynch

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Message
Just seen that LipSynchro's SDK (if you indeed use it) can recognize =
voice energy level.
Maybe it's just the examples on the site which are pretty crappy but not =
show at all the real power of the SDK.
If you have some references or links to show me something made with =
LipSynchro, I would be interested
to have a look.

Igor

  ----- Original Message -----=20
  From: Chris Haarmeijer=20
  To: gda...@li...=20
  Sent: Wednesday, January 22, 2003 8:29 AM
  Subject: RE: [Algorithms] lipsynch

  What do you mean by preprocessing.... We're using LipSynchro which is =
integratable using their SDK. You pass it a WAV file and it returns the =
phonemes used with other data that can be used for gestures. The whole =
process is automatic. We pass the data returned to our facial animation =
software that translates the phonemes into visemes and morphs =
accordingly.

  Chris

  ---
  Keep IT Simple Software
  Van Alphenstraat 12
  7514 DD Enschede

  W: http://www.keepitsimple.nl
  E: mailto:in...@ke...
  T: +31 53 4356687
    -----Original Message-----
    From: gda...@li... =
[mailto:gda...@li...] On Behalf Of Igor =
Kravtchenko
    Sent: woensdag 22 januari 2003 4:28
    To: gda...@li...
    Subject: [Algorithms] lipsynch

    I would be interested to share experience on real time lipsynch.
    What I'm talking about is lipsynch which work from a single audio =
file and does not need preprocess
    like phonems analysis (such as TalkMaster). So, you just need to =
pass a WAV and the head talks.

    After several experiences, we ended on a quite decent result.
    Here how we approximatively do.

    We start from a model setuped with one endomorph (called =
"blendshape" under 3DS Max) for the mouth and
    one another for the two eyes. A certain motion rate is then chosen =
to animate the head, usually 25 hz.
    At 25 hz so, the audio signals are read each 40ms. At 44.1khz, it =
represents a read of 1764 samples per
    polling. Then we take the average of the difference between samples. =
Why not the average of samples themself ?
    Because a sound is heard (so exists) because of its variation not =
because of its "values". Finally, that average
    of delta is normalized between 0 and 1. So right now, we generate 25 =
normalized float values per second, okey.

    These values are added into a motion with one keyframe each 40ms. A =
noise reduction is then applied to the
    motion (a "blur" between keyframes) and, on that result, adjacent =
keyframes which don't represent a major
    change (using a certain threshold) are removed from the motion. =
Finally, all the stuff is binded together with
    a good old TCB spline. At audio replaying, the value computed at a =
given time is then directly interpreted as
    a percentage applied to the endomorph (blendshape) which control the =
opening of the mouth.

    For a speak of 40 seconds, it only represents a .mot of 24kb (ASCII =
format) when we back the motion to Lightwave.
    The size ofcourse depends of the WAV but for a "normal" speech, it =
doesn't change too much.

    The result is pretty convincing for something automatically =
generated. However I'm not sure we reached to
    obtain the same quality as some games like LOTR (when Gandalf talks =
in the first cinematic, the result is
    just ... nice !). Maybe it exists some (easy ?) tricks to improve =
the quality of a such automatic lipsynch ?

    Any hint is welcome,

    Igor