RE: [Algorithms] skeletal animation system

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Oh, I see what you mean now! That's pretty cunning.

> What makes you think that cache performance is going to=20
> improve on PS3? Or xbox360?
> As far as I have read the penalties for cache misses in both=20
> platforms are=20
> considerable. Actually relative to the speed of the processing units=20
> situation is much worse than on PS2.

Agreed. In fact, I believe the miss to main memory is almost exactly the
same between PS2 and Xbox 360 in nanoseconds, but of course the =
instruction
throughput has increased by somewhere around 30x!

However, that's the latency that is the same. For DMA systems, setting =
up a
transfer is somewhat costly, and there's often an overhead to starting =
and
stopping. On cache-based system, it's a single instruction, and since
everything is cache-line-sized, there's very little penalty for small
transfers. Or actually, it's that there's very little gain from large
transfers!

So while you need to set up your fetches earlier in both systems, =
setting
them up in a cache-based system is cheaper. Most crucially, if you
mis-predict one in a thousand times, or run out of local storage, cache
systems don't crash, they just run a bit slower :-) For example, looping
animations need you to fetch the very end and the very start of the
animation. On a cache system, just prefetch one place - it's not worth =
the
extra computation to optimise for that rare case. On a DMA-based system, =
you
don't have a choice.

Sorry, little rant about caches vs DMA there. I know you weren't =
suggesting
they were inherently good. And of course it's perfectly true that a =
cache
takes more gates to implement.

> If you have a system working on PS2 it much easier to port it=20
> to PC/Xbox=20
> than the other way around - especially if you want to get=20
> some performance=20
> out. PS3 is again different from the other two and still=20
> closer to PS2 in=20
> programming ideology.

Erm... not really. The PS3 is similar to nothing in its ideology. Having =
a
good PS2 engine doesn't help you at all in having a good PS3 engine - =
except
in raising your pain threashold. In theory you could say that VU0 was =
like
an SPU, but in practice VU0 was so crippled it was almost useless (in my
previous life writing games we used it as extra scratchpad memory :-).
Hopefully SPUs will be slightly more useful.

It is true that if you have a PS2 engine you can relatively easily port =
it
to PC/Xbox (and I have done exactly that - when designing a =
cross-platform
engine, the PS2 was the focus and then the other two were easy). =
However,
the PS3 is somewhat at right-angles to everything. It's true that you =
can
simply put each SPU on a "thread" and then instead of DMAs you do =
memcpy,
but it's not exactly elegant or efficient.

> Anyway I don't pretend what I have is best. What I'm saying=20
> is try ASAP your=20
> PS3 plan. Tweak/change your plan until you are happy with the=20
> performance.=20
> When that happens see how it maps to the other platforms and=20
> you'll find=20
> that there is hardly anything left to do. ;)

Well, maybe. It does mangle the PC/Xb/Xb360/Mac/GC/PS2/PSP/Rev (yeah, we =
do
a lot of platforms!) code quite a lot for this one platform, and one of =
the
things we value highly is a clean codebase, since customers actually get =
to
see it all. I also think that at best the PS3->whatever back-port will =
be
the same speed - but it might well be slower, because essentially you're
trading computing cycles to make up for not having as much intelligence =
in
the memory hierarchy.

But I do like your incremental update idea - certainly worth looking at. =
I'm
slightly worried about keeping chunks of temporary memory around
per-instance, but it probably isn't all that big.

TomF.

> -----Original Message-----
> From: gda...@li...=20
> [mailto:gda...@li...] On=20
> Behalf Of Yordan Gyurchev
> Sent: 29 October 2005 05:33
> To: gda...@li...
> Subject: Re: [Algorithms] skeletal animation system
>=20
>=20
> I just want to point out that I never advocated reusing same=20
> animation=20
> samples (or caching them) in multiple characters. And as Tom=20
> points out=20
> chances of that happening are tiny unless your scene dictates=20
> otherwise or=20
> you are not dealing with skeletal animation at all.
> So lets put this point to rest. ;)
>=20
> What I have are small structures per-character that hold=20
> cached sample info=20
> and are used mostly when characters are updated.One structures =3D one =

> character update. They have maximum locality and require=20
> minimum fetch=20
> operations that can be predicted/anticipated. This is where=20
> my no-cache=20
> misses approach comes from. It also allows you to not care if=20
> you have the=20
> whole animation in memory allowing you to stream from disk=20
> you have some=20
> crazy long cutscenes.
>=20
> Think of it as a reader head that goes through the stream and=20
> caches the=20
> current data... when new data is needed it fetches it from=20
> the stream. Fetch=20
> here can be asynchronous operation as the sample cache is=20
> reading ahead.=20
> (cache is used as a programming concept here rather than real=20
> hw cache)
>=20
> structure
> ------
> control data
> time in the animation
> control points for bone 1 (could be 5 controls - 4used and 1=20
> prefetched in=20
> anticipation)
> control points for bone 2
> ...
> control points for bone N
> ...
> OuputSkeleton State
> ------
>=20
> So the algorithm works all the time on this local to the PU=20
> (or prefetched)=20
> structure. No main memory access. When the time in the=20
> animation is advanced=20
> enough so control point 4 becomes control point 3 and 5 ->=20
> 4... we need to=20
> prefetch new control point 5 and all this asynchronous so no=20
> stalls occur=20
> anywhere i.e. code waiting for the data
>=20
> Update will be something like this:
> ----
> pre-fetch nextcharacter
> calclulate current character
> store curretncharacter
> move next : make current character =3D next character
> ----
>=20
> > Note that I have not tried this on the PS2's scratchpad -=20
> that is rather
> > small. The cache performance on the PS2 was so abysmal in=20
> every way that=20
> > it
> > would have been a big big job to get it even half-decent on there.
> My approach works on PS2 scratchpad. The differences to PC=20
> code are 5 lines=20
> of code that are ifdef-ed. All other code is identical in all=20
> platforms.=20
> (this is with normal lerp for key frames... with four=20
> controls per bone=20
> spline I don't fit in scratchpad either). I also double=20
> buffer the scratch=20
> pad so I keep the processor busy.
>=20
> What makes you think that cache performance is going to=20
> improve on PS3? Or=20
> xbox360?
> As far as I have read the penalties for cache misses in both=20
> platforms are=20
> considerable. Actually relative to the speed of the processing units=20
> situation is much worse than on PS2.
>=20
> If you have a system working on PS2 it much easier to port it=20
> to PC/Xbox=20
> than the other way around - especially if you want to get=20
> some performance=20
> out. PS3 is again different from the other two and still=20
> closer to PS2 in=20
> programming ideology.
>=20
> As you point out Tom the other (non PS) two platforms are=20
> trivial. Most=20
> techniques John mentioned are supported directly by OpenMP.=20
> This is no=20
> coincidence as the symmetric/shared memory architecture=20
> favours that. They=20
> are trivial because most techniques in these architectures=20
> are based around=20
> "how to quickly make a serial program into a parallel one".=20
> Loop parallelism=20
> being by far the most popular. My approach is based around=20
> mainly that=20
> pattern with several sync points.
>=20
> Anyway I don't pretend what I have is best. What I'm saying=20
> is try ASAP your=20
> PS3 plan. Tweak/change your plan until you are happy with the=20
> performance.=20
> When that happens see how it maps to the other platforms and=20
> you'll find=20
> that there is hardly anything left to do. ;)
>=20
> John, you talk about writing schedulers. Do these run in the=20
> main thread?=20
> Have you tried separate threads/processors to schedule=20
> themselves with=20
> shared task queue/task pools approach? Just curious what the=20
> differences are=20
> going to be in PU utilization when you add more PUs although=20
> 1.9 is pretty=20
> good :)
>=20
> -Yordan
>=20
> ----- Original Message -----=20
> From: "Tom Forsyth" <tom...@ee...>
> To: <gda...@li...>
> Sent: Saturday, October 29, 2005 4:12 AM
> Subject: RE: [Algorithms] skeletal animation system
>=20
>=20
> > My argument here was that new keyframes/knots are required
> > infrequently. With a
> > good spline fitting algorithm you should be experiencing the
> > same.
>=20
> Well, from frame 1 of character 3, bone 4 to frame 2 of=20
> character 3, bone 4,
> yes - you will need almost exactly the same data. The problem=20
> is, those two
> samplings are an entire 60th of a second apart. You'll want=20
> to fill local
> store/cache with tons of other data in the meantime.
>=20
> The chances of characters 4, 5, 6, etc needing the same 3-4 knots or
> keyframes in frame 1 is tiny, according to my data. But yes,=20
> maybe if you
> have a big crowd of characters all running the same animation=20
> at roughly the
> same phase, you'll get re-use. I'm just saying, I don't think=20
> that is even
> close to the worst case performance, nor do I think it is=20
> even a common
> case. But it does depend on your game type.
>=20
> > Do you have similar control structures or you rely that
> > animation is going
> > to be prefetched in local storage? If so do you have
> > guarantee that every
> > single of your animations is going to fit there? Remember
> > this would be
> > about 1/3 (in practice more like 1/4) of real local storage
> > space if you
> > plan to triple buffer to keep the PU busy.
>=20
> I only fetch the 3-4 control points in each spline that the current
> character needs. If the next character happens to also need=20
> them, then hey -
> that's fine - the cache might do useful work. But I assume it=20
> doesn't, and I
> assume that each character has to fetch all-new data, because=20
> that is the
> worse case, and it's also the common case. So I need to=20
> prefetch that data
> before I do the sampling & blending
>=20
> Space usually isn't a problem - I only need 4 control points=20
> from pos & orn
> for each bone for each animation. In practice, the size of=20
> the required
> control data (e.g. how long each animation is, what sort of=20
> compression each
> spline uses, etc) is bigger than the control point data I need this
> particular frame. But none of it is really very large - it fits in all
> sensible-sized caches and local data storage just fine.
>=20
> Note that I have not tried this on the PS2's scratchpad -=20
> that is rather
> small. The cache performance on the PS2 was so abysmal in=20
> every way that it
> would have been a big big job to get it even half-decent on=20
> there. Since I
> have five other platforms to deal with, I just waited for=20
> someone to try to
> run a complex scene on the PS2 before I worried about it. And=20
> nobody ever
> did, so it never had to be done. And probably never will now.=20
> Which I am
> very happy about :-)
>=20
> > You have to decide where in the animation stream are
> > each quadruples
> > of controls for each bone in order to get only those
> > quadruples and DMA them
> > in the local PU memory.
>=20
> Correct.
>=20
> > Now I can tell you stright away (and
> > I'm almost sure
> > things wont change much) that on PS2 one solid DMA transfer
> > is far better
> > than a number of small ones. They have some setup time and=20
> need to be
> > prioritized and etc. So the less you have the better. Not
> > only that you have
> > to spend CPU some time deciding/calculating where exactly in
> > your animation
> > are these quadruples of keyframes and that is not something
> > you want to be
> > doing every frame, or you do?
>=20
> Well, you have to decide which controls you need anyway -=20
> that's part of the
> spline sampling process. So that's no extra cost.
>=20
> Let me preface this with a warning - I have not tried a=20
> DMA-based scheme
> yet. That's because on every other platform, we're not=20
> talking about DMAs,
> we're talking about normal cache accesses, and they don't care about
> small/large transfers - everything is just the size of a=20
> cache line, and
> setting up a transfer is just executing a "prefetch" with the relevant
> address - extremely cheap. All you need to do is make sure=20
> you do those
> prefetches far enough in advance (to absorb the latency of=20
> the transfer from
> main memory), and life is good. The only platforms that DMAs=20
> are needed for
> are PS2 (avoided doing this - see above) and PS3.
>=20
> So, here is my entirely theoretical plan for PS3 - due to be=20
> implemented
> some time next year.
>=20
> Granny allows animation splines to be listed in any order you=20
> like in memory
> - they are remapped to the skeleton on the fly. This is a big=20
> bonus for a
> bunch of things, but that would be a very long discussion, so=20
> let's just
> take that as read. Some splines will have lots of=20
> knots/controls, and some
> will be able to use different sorts of compression, and so=20
> they will all be
> different lengths in bytes. So, given that, order your spline=20
> data in memory
> from smallest to largest.
>=20
> Now, decide what your magic byte length is for your platform.=20
> This is going
> to be largely trial and error, but it's roughly the place=20
> where the cost of
> setting up a new transfer exceeds the cost of transferring=20
> the extra data.
> So let's say that's 64 bytes (chosen at random). That is, if=20
> you wanted to
> transfer two chunks of 12 bytes that were 64 bytes apart, it=20
> is the same
> speed to set up a single transfer that is 64+12 bytes long,=20
> as to do two
> transfers of 12 bytes each.
>=20
> OK, so we know we need 12 bytes (roughly - varies with compression and
> spline degree) from each spline in the list. And they're=20
> ordered by size. So
> we just need to find the first spline with data longer than=20
> 64 bytes. All
> the splines before it in the list just get their entire data=20
> DMA'd over in a
> single chunk - it's not efficient to do them with separate=20
> transfers. All
> the ones afterwards just get a 12-byte transfer set up for=20
> each of them,
> because that's the only bit you need, and you'd probably run=20
> out of local
> store if you transferred the whole lot, as well as having to=20
> wait longer for
> the transfer.
>=20
> So that's the theory. I have yet to try it in practice=20
> though. Now, it may
> be that the number is not 64 bytes, it's more like 4kbytes or=20
> something. In
> which case this mostly just collapses to "transfer the whole animation
> over". Well, so be it. But people do use some pretty long=20
> animations with
> Granny (I've had people export 15-minute credit sequences as a single
> animation!), so at some point you need to cope with not being=20
> able to DMA
> the whole thing over.
>=20
> As I said above, on every platform but the PS*, it's trivial.=20
> They have
> caches, and you just do a prefetch instruction per spline. On=20
> PS3, it's
> somewhat more effort <shakes fist at Sony/IBM for their crazy=20
> architecture>.
>=20
>=20
> TomF.
>=20
>=20
> > -----Original Message-----
> > From: gda...@li...
> > [mailto:gda...@li...] On
> > Behalf Of Yordan Gyurchev
> > Sent: 28 October 2005 14:29
> > To: gda...@li...
> > Subject: Re: [Algorithms] skeletal animation system
> >
> >
> > Ok, you've won me over. Processing one character/entity on
> > one PU thing. So,
> > one character =3D one task.
> >
> > When I think about it though it fits my approach, just the scale is
> > different.
> >
> > > By "fetch", I meant you're fetching them from main memory
> > into either a
> > > cache or some sort of local storage (e.g. SPU memory). I
> > was assuming the
> > > animation has already been loaded off disk, otherwise
> > you're doomed!
> > > Granny
> > > uses splines rather than keyframes, but the data access
> > patterns are
> > > roughly
> > > similar to a sparse keyframe system.
> > Fetch is =3D main memory =3D> PU
> > I use the word keyframe for both normal keyframes and spline
> > knots (probably
> > wrong but in our case its still some data at some point in
> > time regardless
> > of the interp method).
> >
> > > So IMHO it's best to focus optimisation on what you do at
> > each leaf, which
> > > can usually be expressed as sample-and-blend.
> > Agreed. This is what I'm most interested in as well. And
> > analyser data shows
> > most time is spend there.
> >
> > I find it difficult to follow what exactly you do with the PU
> > and what goes
> > to local memory and back to main etc (see later my rant on
> > DMA transfers).
> > IMHO there needs to be a number of support structures. Inputs
> > like where in
> > the animation you are, blending weights, parameters, etc. And
> > outputs like
> > the final skeleton state (what is rendered), other output
> > info (we return
> > for example velocity of the root or absolute movement).
> >
> > In my approach I put all my support structures in one solid
> > chunk and call
> > that animation state. Its all the data the task will need
> > (regardless of the
> > scale of the task - be it a full character or not).
> >
> > So we move the animation state into local memory, process
> > that character,
> > then move it back. It can be structured so read-only data
> > goes only one way
> > without problem.
> >
> > Now the question is what goes in this animation state and
> > what stays in main
> > memory and is fetched on request (or pre-streamed). With my
> > approach I keep
> > a duplicates of the control points (key frames) that are currently
> > interpolated into local interpolation structures (local to
> > the animation
> > state). So, only when I move to the next keyframe (require
> > next knot) I
> > touch main memory to fetch it (or not if it is pre-streamed
> > in local). My
> > argument here was that new keyframes/knots are required
> > infrequently. With a
> > good spline fitting algorithm you should be experiencing the
> > same. Mine is
> > (I assume) worse than yours by a mile and still keyframe
> > fetches (and for
> > that reason cache misses) are not an issue when I analyze the
> > execution.
> > Even more you can fetch asynchronously an extra
> > keyframe/control so you
> > don't have any stalls.
> >
> > Do you have similar control structures or you rely that
> > animation is going
> > to be prefetched in local storage? If so do you have
> > guarantee that every
> > single of your animations is going to fit there? Remember
> > this would be
> > about 1/3 (in practice more like 1/4) of real local storage
> > space if you
> > plan to triple buffer to keep the PU busy.
> >
> > May be I'm missing something here.
> >
> > > Whereas source animations can be compressed far smaller
> > than this - we
> > > average around 4 bytes per spline control, and you usually
> > need to sample
> > > the nearest three controls in a spline (quadratic
> > interpolation), and it's
> > > read-only data, so around 12 bytes of traffic. Obviously
> > these are really
> > > rough figures, but there's still a significant difference.
> > Yes I understand the maths, but I think you are focusing on
> > the wrong thing
> > here.  Its not always how much you transfer but on what
> > chunks and how
> > often.You have to decide where in the animation stream are
> > each quadruples
> > of controls for each bone in order to get only those
> > quadruples and DMA them
> > in the local PU memory. Now I can tell you stright away (and
> > I'm almost sure
> > things wont change much) that on PS2 one solid DMA transfer
> > is far better
> > than a number of small ones. They have some setup time and=20
> need to be
> > prioritized and etc. So the less you have the better. Not
> > only that you have
> > to spend CPU some time deciding/calculating where exactly in
> > your animation
> > are these quadruples of keyframes and that is not something
> > you want to be
> > doing every frame, or you do?
> >
> > If you want to directly sample data you need to move most of
> > your animation
> > (if possible all of it) in local mem. Or consider some genius
> > animation
> > format where you can DMA solid blocks every time you need to
> > sample data
> > (once per animation).
> >
> > I must say that load every animation and sample all characters seems
> > attractive. Though it hides potential problems of either not
> > being able to
> > fit animation (too long animation) or sample data (too many
> > character).
> >
> > On the topic of perfect load balancing I agree with you.
> >
> > -Yordan
> >
> > ----- Original Message -----=20
> > From: "Tom Forsyth" <tom...@ee...>
> > To: <gda...@li...>
> > Sent: Friday, October 28, 2005 8:27 PM
> > Subject: RE: [Algorithms] skeletal animation system
> >
> >
> > > You are talking about these
> > > few most important characters that you see in the foreground.
> >
> > I'm also talking about decent numbers of characters. In fact,
> > if there's
> > 1000s of characters in the scene, then keeping each
> > character's animation
> > inside a single PU makes even more sense.
> >
> > > Due to the overlapping nature of the memory allocation
> > (especially in
> > > cross-fade blends) you have implied data separation that
> > > might as well be utilized.
> >
> > Sorry, you've totally lost me there :-)
> >
> > > > You do very few blends between already-sampled data.
> > > Actually we almost never do anything but that. Different
> > > perspective here.
> > > We do a few N-blends based on parameters like vectors,
> > > quaternions, etc.
> > > You can do an "artistic" IK look-alike that way. I'm sure
> > > there was link
> > > in this list some time ago... about this.
> >
> > Bad mis-communication on my part. What I meant is that
> > because trees are
> > broad but shallow, you are almost always blending together
> > data sampled
> > directly from an animation (the leaves of the tree). You
> > rarely blend data
> > that is the result of a previous blend (the branches and
> > trunk of the tree).
> > So IMHO it's best to focus optimisation on what you do at
> > each leaf, which
> > can usually be expressed as sample-and-blend.
> >
> > Of course, in a purely binary tree, you have roughly equal
> > numbers of the
> > two operations. But that's one reason why a binary tree isn't
> > that great an
> > idea in general :-)
> >
> > However, you do say that in your data sets you often have
> > just one animation
> > running, so it's a sample with no blending. Granny also has a
> > special-cased
> > path for this (only one active animation). In practice, some
> > games hit that
> > a lot, some hit it almost never - it all depends on the data
> > sets. I would
> > be cautious about optimising for the case that is already=20
> pretty fast!
> > There's a minor code-maintenance issue, but it's only two
> > paths, and they
> > call a lot of the same functions (that the compiler inlines
> > perfectly well).
> >
> > > most of the time you are doing interpolation between key
> > > frames (that are
> > > stored locally) rather than fetching them. If your=20
> streams are that
> > > densely populated that you have to fetch every update you=20
> are doing
> > > something wrong, which i cant imagine being the case with=20
> Granny for
> > > example.
> >
> > By "fetch", I meant you're fetching them from main memory
> > into either a
> > cache or some sort of local storage (e.g. SPU memory). I was
> > assuming the
> > animation has already been loaded off disk, otherwise you're
> > doomed! Granny
> > uses splines rather than keyframes, but the data access
> > patterns are roughly
> > similar to a sparse keyframe system.
> >
> > The way Granny is oriented is on a character-by-character=20
> basis. Each
> > character is sampled & blended fully, pulling in the data on
> > demand, and
> > then you do the next one. There's obvious exceptions for
> > characters that
> > interact with each other, but they're certainly not the
> > average case. This
> > keeps all the intermediate data nice and warm in caches or
> > local storage,
> > and only the inputs (animations) and final outputs (what you
> > render) live in
> > main memory.
> >
> > To do effective streaming from main memory, you can either
> > prefetch/DMA all
> > the animations for a character before you start sampling the
> > first one (this
> > works well when characters are playing a lot of animations),
> > or you can
> > prefetch character 2 before you sample & blend character 1
> > (this works well
> > when characters are playing much fewer or simpler
> > animations). If prefetch
> > times are long, and local storage/cache space is large, you
> > can prefetch
> > character 3 or 4 before doing character 1, but that tends to get
> > counterproductive and can thrash caches (which is of course lethal).
> >
> > Now, you could possibly orient it around animations=20
> instead. Load each
> > animation just once, then do all the sampling of that
> > animation required for
> > all the characters in the scene. This means you only fetch
> > each animation's
> > data just once. The problem is, the results of the sampling=20
> need to be
> > stored somewhere, but because you're doing all the sampling of that
> > animation in one go, you're unlikely to use that result soon
> > enough for it
> > to stay in a cache or fit in limited local storage. And it's
> > big - typically
> > a vec3 of position and a vec4 of orientation, or 28 bytes in
> > total, and you
> > need to write the data then read it back later - bus traffic
> > of 56 bytes.
> > Whereas source animations can be compressed far smaller=20
> than this - we
> > average around 4 bytes per spline control, and you usually
> > need to sample
> > the nearest three controls in a spline (quadratic
> > interpolation), and it's
> > read-only data, so around 12 bytes of traffic. Obviously
> > these are really
> > rough figures, but there's still a significant difference.
> >
> > > Granularity. By doing this unless you build some very fine
> > > scheduling metric
> > > you are going to have trouble balancing PU utilization. Some
> > > characters
> > > will have longer blend trees than other, taking more
> > processing times,
> > > etc. Unless you are doing all anim operations on one PU.
> > Still without
> > > test data on this one I may be completely off target.
> >
> > In general (obviously waving my hands a bit here), if you
> > have more than
> > about 4x the number of tasks than processors, and the time
> > taken by each
> > task is distributed in a roughly bell-curvish way, your
> > granularity is "fine
> > enough". Splitting the work into finer chunks might get
> > slightly more level
> > load-balancing, but the overhead of doing the splitting,=20
> and the extra
> > synchronisation required because you have more tasks, means
> > that you don't
> > actually get any benefit.
> >
> > To put this another way - the worst case scenario is that all
> > PUs finish all
> > available tasks, and have to wait for a single PU to do a
> > single task. They
> > never have to wait for that last PU to do more than one=20
> task (because
> > otherwise one of the idle ones would do it instead). So if
> > the total number
> > of tasks divided by the number of PUs is more than 10, then
> > the worst you
> > can do is waste 10% of your processing power. Wasting only
> > 10% of your total
> > power is considered superb efficiency in the multiprocessing
> > world! This
> > assumes that the tasks are not dependent and can be processed
> > by any PU in
> > any order, but if you partition by characters, this is 99%
> > true (and for the
> > few that are dependent on something else, just schedule=20
> those first).
> > Certainly with 1000s of characters and only around 8 PUs
> > (e.g. PS3), I just
> > can't see it being a problem.
> >
> >
> > TomF.
> >
> >
> > > -----Original Message-----
> > > From: gda...@li...
> > > [mailto:gda...@li...] On
> > > Behalf Of yg...@gy...
> > > Sent: 28 October 2005 03:08
> > > To: gda...@li...
> > > Subject: RE: [Algorithms] skeletal animation system
> > >
> > >
> > > Hi Tom,
> > >
> > > Points, well made.
> > >
> > > I think we come from different points of view. You are
> > > talking about these
> > > few most important characters that you see in the foreground.
> > > I'm focusing on LOTS of characters in the background
> > > (instancing numbers,
> > > hundred, thousands, etc) and not so complicated animation=20
> operations
> > > (cross-fade blends being by far the most popular blend in my
> > > experience in
> > > these cases).
> > > Due to the overlapping nature of the memory allocation
> > (especially in
> > > cross-fade blends) you have implied data separation that
> > > might as well be
> > > utilized.
> > >
> > > Also in my experience especially when animating all kinds of
> > > entities (and
> > > have some external attachment between them) its good to=20
> process all
> > > animations before you get to actually processing your
> > > entities (characters
> > > if you want). I think we agree on this point.
> > >
> > > > Every now and then you get multiple characters doing
> > > roughly the same
> > > > animation, and the cache helps, but I regard this as luck.
> > > We have actually done some research and testing on that. I
> > > dont think I
> > > agree with this statement when it comes to crowds. We (a
> > colleague of
> > > mine) do massive character scenes by sorting morphed data. We
> > > upload the
> > > morph targets and then send instance data (translate,rotate). This
> > > achieves amazing (talking 1000+) quantities of characters on
> > > screen on a
> > > current console platform. I was convinced at first that this
> > > would never
> > > work because like you I didnt imagine there would be any=20
> reasonable
> > > exploitable "roughly the same keyframe" animation.
> > >
> > > > we need to optimise the worse case scenarios, not the best!
> > > Quite right ;) In my case worst scenario happens when you do
> > > fast forward
> > > cause you hit the anim stream way too often. Unless
> > > pre-streamed like you
> > > suggest.
> > >
> > > > So, in conclusion - assume every time you sample an
> > > animation, you will
> > > > miss every cache.
> > > Agreed. However, I'm arguing (and in my experience and
> > > analyser data) that
> > > most of the time you are doing interpolation between key
> > > frames (that are
> > > stored locally) rather than fetching them. If your=20
> streams are that
> > > densely populated that you have to fetch every update you=20
> are doing
> > > something wrong, which i cant imagine being the case with=20
> Granny for
> > > example.
> > >
> > > I do agree that blend trees are usually not that tall and
> > > binary blending
> > > (except for cross-fades) is not the best under the sun.
> > >
> > > > You do very few blends between already-sampled data.
> > > Actually we almost never do anything but that. Different
> > > perspective here.
> > > We do a few N-blends based on parameters like vectors,
> > > quaternions, etc.
> > > You can do an "artistic" IK look-alike that way. I'm sure
> > > there was link
> > > in this list some time ago... about this.
> > >
> > > > Thus, although it is elegant to split the processing into
> > > (a) sample all
> > > > your animations into temporary buffers and then (b) blend
> > > the buffers
> > > > together, in practice it makes more sense to have your
> > > animation sampling
> > > > also do the weighted accumulation, so that you can=20
> focus all your
> > > > optimisation on that one routine, and not have that
> > > intermediate data hang
> > > > around polluting caches - as soon as you sample a bone's
> > > data, you add it
> > > > to
> > > > its blend.
> > > My approach is: dont do any blends if you dont have to. So you
> > > sample/interpolate your anims and most often, if they are not
> > > in a cross
> > > fade, you are done. If there is something else to be done
> > its another
> > > task. Makes for better code separation. With your solution
> > you have to
> > > make a fit-all process function.
> > > Still your approach will be less memory consuming and less
> > > burden on the
> > > bus so it could be worth the maintainability-performance=20
> trade-off.
> > >
> > > > So the point I'm trying to (finally) get to is that trying
> > > to parallelise
> > > > the operations done on one particular character seems like
> > > a lot of effort
> > > > for little benefit.
> > > I'm not after parallelsing operations on one character. I'm for
> > > parallelizing all operations on all characters on one go. The
> > > animation
> > > system doesnt know about "character" as such... it just animating
> > > entities...
> > > Your main point here is that the serial term of the=20
> algorithm being
> > > dominant is reducing or invalidating any benefit from the
> > > parallel part. I
> > > must agree that its a good point but without any test data I
> > > cant say if
> > > that is the case or not. If task scheduling (and DMA transfers) is
> > > properly interlieved (double or tripple buffered) you dont
> > > get any penalty
> > > for that. Still, a very valid point.
> > >
> > > > IMHO, a far better thing is to assume that the number of
> > characters
> > > > on-screen is significantly larger than the number of PUs,
> > > and simply do
> > > > all
> > > > the anim sampling and blending of each character on a single PU.
> > > Granularity. By doing this unless you build some very fine
> > > scheduling metric
> > > you are going to have trouble balancing PU utilization. Some
> > > characters
> > > will have longer blend trees than other, taking more
> > processing times,
> > > etc. Unless you are doing all anim operations on one PU.
> > Still without
> > > test data on this one I may be completely off target.
> > >
> > > Thanks ever so much for the feedback. In fact I think I'm
> > going to try
> > > some of you ideas and see how they compare with the
> > existing solution.
> > >
> > > -Yordan
> >
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by the JBoss Inc.
> > Get Certified Today * Register for a JBoss Training Course
> > Free Certification Exam for All Training Attendees Through=20
> End of 2005
> > Visit http://www.jboss.com/services/certification for more=20
> information
> > _______________________________________________
> > GDAlgorithms-list mailing list
> > GDA...@li...
> > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > Archives:
> > http://sourceforge.net/mailarchive/forum.php?forum_ida88
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by the JBoss Inc.
> > Get Certified Today * Register for a JBoss Training Course
> > Free Certification Exam for All Training Attendees Through=20
> End of 2005
> > Visit http://www.jboss.com/services/certification for more=20
> information
> > _______________________________________________
> > GDAlgorithms-list mailing list
> > GDA...@li...
> > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > Archives:
> > http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188
> >
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_ida88
>=20
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188
>=20