RE: [Algorithms] skeletal animation system

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> My argument here was that new keyframes/knots are required=20
> infrequently. With a=20
> good spline fitting algorithm you should be experiencing the=20
> same.

Well, from frame 1 of character 3, bone 4 to frame 2 of character 3, =
bone 4,
yes - you will need almost exactly the same data. The problem is, those =
two
samplings are an entire 60th of a second apart. You'll want to fill =
local
store/cache with tons of other data in the meantime.

The chances of characters 4, 5, 6, etc needing the same 3-4 knots or
keyframes in frame 1 is tiny, according to my data. But yes, maybe if =
you
have a big crowd of characters all running the same animation at roughly =
the
same phase, you'll get re-use. I'm just saying, I don't think that is =
even
close to the worst case performance, nor do I think it is even a common
case. But it does depend on your game type.

> Do you have similar control structures or you rely that=20
> animation is going=20
> to be prefetched in local storage? If so do you have=20
> guarantee that every=20
> single of your animations is going to fit there? Remember=20
> this would be=20
> about 1/3 (in practice more like 1/4) of real local storage=20
> space if you=20
> plan to triple buffer to keep the PU busy.

I only fetch the 3-4 control points in each spline that the current
character needs. If the next character happens to also need them, then =
hey -
that's fine - the cache might do useful work. But I assume it doesn't, =
and I
assume that each character has to fetch all-new data, because that is =
the
worse case, and it's also the common case. So I need to prefetch that =
data
before I do the sampling & blending

Space usually isn't a problem - I only need 4 control points from pos & =
orn
for each bone for each animation. In practice, the size of the required
control data (e.g. how long each animation is, what sort of compression =
each
spline uses, etc) is bigger than the control point data I need this
particular frame. But none of it is really very large - it fits in all
sensible-sized caches and local data storage just fine.

Note that I have not tried this on the PS2's scratchpad - that is rather
small. The cache performance on the PS2 was so abysmal in every way that =
it
would have been a big big job to get it even half-decent on there. Since =
I
have five other platforms to deal with, I just waited for someone to try =
to
run a complex scene on the PS2 before I worried about it. And nobody =
ever
did, so it never had to be done. And probably never will now. Which I am
very happy about :-)

> You have to decide where in the animation stream are=20
> each quadruples=20
> of controls for each bone in order to get only those=20
> quadruples and DMA them=20
> in the local PU memory.

Correct.

> Now I can tell you stright away (and=20
> I'm almost sure=20
> things wont change much) that on PS2 one solid DMA transfer=20
> is far better=20
> than a number of small ones. They have some setup time and need to be=20
> prioritized and etc. So the less you have the better. Not=20
> only that you have=20
> to spend CPU some time deciding/calculating where exactly in=20
> your animation=20
> are these quadruples of keyframes and that is not something=20
> you want to be=20
> doing every frame, or you do?

Well, you have to decide which controls you need anyway - that's part of =
the
spline sampling process. So that's no extra cost.

Let me preface this with a warning - I have not tried a DMA-based scheme
yet. That's because on every other platform, we're not talking about =
DMAs,
we're talking about normal cache accesses, and they don't care about
small/large transfers - everything is just the size of a cache line, and
setting up a transfer is just executing a "prefetch" with the relevant
address - extremely cheap. All you need to do is make sure you do those
prefetches far enough in advance (to absorb the latency of the transfer =
from
main memory), and life is good. The only platforms that DMAs are needed =
for
are PS2 (avoided doing this - see above) and PS3.

So, here is my entirely theoretical plan for PS3 - due to be implemented
some time next year.

Granny allows animation splines to be listed in any order you like in =
memory
- they are remapped to the skeleton on the fly. This is a big bonus for =
a
bunch of things, but that would be a very long discussion, so let's just
take that as read. Some splines will have lots of knots/controls, and =
some
will be able to use different sorts of compression, and so they will all =
be
different lengths in bytes. So, given that, order your spline data in =
memory
from smallest to largest.

Now, decide what your magic byte length is for your platform. This is =
going
to be largely trial and error, but it's roughly the place where the cost =
of
setting up a new transfer exceeds the cost of transferring the extra =
data.
So let's say that's 64 bytes (chosen at random). That is, if you wanted =
to
transfer two chunks of 12 bytes that were 64 bytes apart, it is the same
speed to set up a single transfer that is 64+12 bytes long, as to do two
transfers of 12 bytes each.

OK, so we know we need 12 bytes (roughly - varies with compression and
spline degree) from each spline in the list. And they're ordered by =
size. So
we just need to find the first spline with data longer than 64 bytes. =
All
the splines before it in the list just get their entire data DMA'd over =
in a
single chunk - it's not efficient to do them with separate transfers. =
All
the ones afterwards just get a 12-byte transfer set up for each of them,
because that's the only bit you need, and you'd probably run out of =
local
store if you transferred the whole lot, as well as having to wait longer =
for
the transfer.

So that's the theory. I have yet to try it in practice though. Now, it =
may
be that the number is not 64 bytes, it's more like 4kbytes or something. =
In
which case this mostly just collapses to "transfer the whole animation
over". Well, so be it. But people do use some pretty long animations =
with
Granny (I've had people export 15-minute credit sequences as a single
animation!), so at some point you need to cope with not being able to =
DMA
the whole thing over.

As I said above, on every platform but the PS*, it's trivial. They have
caches, and you just do a prefetch instruction per spline. On PS3, it's
somewhat more effort <shakes fist at Sony/IBM for their crazy =
architecture>.

TomF.

> -----Original Message-----
> From: gda...@li...=20
> [mailto:gda...@li...] On=20
> Behalf Of Yordan Gyurchev
> Sent: 28 October 2005 14:29
> To: gda...@li...
> Subject: Re: [Algorithms] skeletal animation system
>=20
>=20
> Ok, you've won me over. Processing one character/entity on=20
> one PU thing. So,=20
> one character =3D one task.
>=20
> When I think about it though it fits my approach, just the scale is=20
> different.
>=20
> > By "fetch", I meant you're fetching them from main memory=20
> into either a
> > cache or some sort of local storage (e.g. SPU memory). I=20
> was assuming the
> > animation has already been loaded off disk, otherwise=20
> you're doomed!=20
> > Granny
> > uses splines rather than keyframes, but the data access=20
> patterns are=20
> > roughly
> > similar to a sparse keyframe system.
> Fetch is =3D main memory =3D> PU
> I use the word keyframe for both normal keyframes and spline=20
> knots (probably=20
> wrong but in our case its still some data at some point in=20
> time regardless=20
> of the interp method).
>=20
> > So IMHO it's best to focus optimisation on what you do at=20
> each leaf, which
> > can usually be expressed as sample-and-blend.
> Agreed. This is what I'm most interested in as well. And=20
> analyser data shows=20
> most time is spend there.
>=20
> I find it difficult to follow what exactly you do with the PU=20
> and what goes=20
> to local memory and back to main etc (see later my rant on=20
> DMA transfers).=20
> IMHO there needs to be a number of support structures. Inputs=20
> like where in=20
> the animation you are, blending weights, parameters, etc. And=20
> outputs like=20
> the final skeleton state (what is rendered), other output=20
> info (we return=20
> for example velocity of the root or absolute movement).
>=20
> In my approach I put all my support structures in one solid=20
> chunk and call=20
> that animation state. Its all the data the task will need=20
> (regardless of the=20
> scale of the task - be it a full character or not).
>=20
> So we move the animation state into local memory, process=20
> that character,=20
> then move it back. It can be structured so read-only data=20
> goes only one way=20
> without problem.
>=20
> Now the question is what goes in this animation state and=20
> what stays in main=20
> memory and is fetched on request (or pre-streamed). With my=20
> approach I keep=20
> a duplicates of the control points (key frames) that are currently=20
> interpolated into local interpolation structures (local to=20
> the animation=20
> state). So, only when I move to the next keyframe (require=20
> next knot) I=20
> touch main memory to fetch it (or not if it is pre-streamed=20
> in local). My=20
> argument here was that new keyframes/knots are required=20
> infrequently. With a=20
> good spline fitting algorithm you should be experiencing the=20
> same. Mine is=20
> (I assume) worse than yours by a mile and still keyframe=20
> fetches (and for=20
> that reason cache misses) are not an issue when I analyze the=20
> execution.=20
> Even more you can fetch asynchronously an extra=20
> keyframe/control so you=20
> don't have any stalls.
>=20
> Do you have similar control structures or you rely that=20
> animation is going=20
> to be prefetched in local storage? If so do you have=20
> guarantee that every=20
> single of your animations is going to fit there? Remember=20
> this would be=20
> about 1/3 (in practice more like 1/4) of real local storage=20
> space if you=20
> plan to triple buffer to keep the PU busy.
>=20
> May be I'm missing something here.
>=20
> > Whereas source animations can be compressed far smaller=20
> than this - we
> > average around 4 bytes per spline control, and you usually=20
> need to sample
> > the nearest three controls in a spline (quadratic=20
> interpolation), and it's
> > read-only data, so around 12 bytes of traffic. Obviously=20
> these are really
> > rough figures, but there's still a significant difference.
> Yes I understand the maths, but I think you are focusing on=20
> the wrong thing=20
> here.  Its not always how much you transfer but on what=20
> chunks and how=20
> often.You have to decide where in the animation stream are=20
> each quadruples=20
> of controls for each bone in order to get only those=20
> quadruples and DMA them=20
> in the local PU memory. Now I can tell you stright away (and=20
> I'm almost sure=20
> things wont change much) that on PS2 one solid DMA transfer=20
> is far better=20
> than a number of small ones. They have some setup time and need to be=20
> prioritized and etc. So the less you have the better. Not=20
> only that you have=20
> to spend CPU some time deciding/calculating where exactly in=20
> your animation=20
> are these quadruples of keyframes and that is not something=20
> you want to be=20
> doing every frame, or you do?
>=20
> If you want to directly sample data you need to move most of=20
> your animation=20
> (if possible all of it) in local mem. Or consider some genius=20
> animation=20
> format where you can DMA solid blocks every time you need to=20
> sample data=20
> (once per animation).
>=20
> I must say that load every animation and sample all characters seems=20
> attractive. Though it hides potential problems of either not=20
> being able to=20
> fit animation (too long animation) or sample data (too many=20
> character).
>=20
> On the topic of perfect load balancing I agree with you.
>=20
> -Yordan
>=20
> ----- Original Message -----=20
> From: "Tom Forsyth" <tom...@ee...>
> To: <gda...@li...>
> Sent: Friday, October 28, 2005 8:27 PM
> Subject: RE: [Algorithms] skeletal animation system
>=20
>=20
> > You are talking about these
> > few most important characters that you see in the foreground.
>=20
> I'm also talking about decent numbers of characters. In fact,=20
> if there's
> 1000s of characters in the scene, then keeping each=20
> character's animation
> inside a single PU makes even more sense.
>=20
> > Due to the overlapping nature of the memory allocation=20
> (especially in
> > cross-fade blends) you have implied data separation that
> > might as well be utilized.
>=20
> Sorry, you've totally lost me there :-)
>=20
> > > You do very few blends between already-sampled data.
> > Actually we almost never do anything but that. Different
> > perspective here.
> > We do a few N-blends based on parameters like vectors,
> > quaternions, etc.
> > You can do an "artistic" IK look-alike that way. I'm sure
> > there was link
> > in this list some time ago... about this.
>=20
> Bad mis-communication on my part. What I meant is that=20
> because trees are
> broad but shallow, you are almost always blending together=20
> data sampled
> directly from an animation (the leaves of the tree). You=20
> rarely blend data
> that is the result of a previous blend (the branches and=20
> trunk of the tree).
> So IMHO it's best to focus optimisation on what you do at=20
> each leaf, which
> can usually be expressed as sample-and-blend.
>=20
> Of course, in a purely binary tree, you have roughly equal=20
> numbers of the
> two operations. But that's one reason why a binary tree isn't=20
> that great an
> idea in general :-)
>=20
> However, you do say that in your data sets you often have=20
> just one animation
> running, so it's a sample with no blending. Granny also has a=20
> special-cased
> path for this (only one active animation). In practice, some=20
> games hit that
> a lot, some hit it almost never - it all depends on the data=20
> sets. I would
> be cautious about optimising for the case that is already pretty fast!
> There's a minor code-maintenance issue, but it's only two=20
> paths, and they
> call a lot of the same functions (that the compiler inlines=20
> perfectly well).
>=20
> > most of the time you are doing interpolation between key
> > frames (that are
> > stored locally) rather than fetching them. If your streams are that
> > densely populated that you have to fetch every update you are doing
> > something wrong, which i cant imagine being the case with Granny for
> > example.
>=20
> By "fetch", I meant you're fetching them from main memory=20
> into either a
> cache or some sort of local storage (e.g. SPU memory). I was=20
> assuming the
> animation has already been loaded off disk, otherwise you're=20
> doomed! Granny
> uses splines rather than keyframes, but the data access=20
> patterns are roughly
> similar to a sparse keyframe system.
>=20
> The way Granny is oriented is on a character-by-character basis. Each
> character is sampled & blended fully, pulling in the data on=20
> demand, and
> then you do the next one. There's obvious exceptions for=20
> characters that
> interact with each other, but they're certainly not the=20
> average case. This
> keeps all the intermediate data nice and warm in caches or=20
> local storage,
> and only the inputs (animations) and final outputs (what you=20
> render) live in
> main memory.
>=20
> To do effective streaming from main memory, you can either=20
> prefetch/DMA all
> the animations for a character before you start sampling the=20
> first one (this
> works well when characters are playing a lot of animations),=20
> or you can
> prefetch character 2 before you sample & blend character 1=20
> (this works well
> when characters are playing much fewer or simpler=20
> animations). If prefetch
> times are long, and local storage/cache space is large, you=20
> can prefetch
> character 3 or 4 before doing character 1, but that tends to get
> counterproductive and can thrash caches (which is of course lethal).
>=20
> Now, you could possibly orient it around animations instead. Load each
> animation just once, then do all the sampling of that=20
> animation required for
> all the characters in the scene. This means you only fetch=20
> each animation's
> data just once. The problem is, the results of the sampling need to be
> stored somewhere, but because you're doing all the sampling of that
> animation in one go, you're unlikely to use that result soon=20
> enough for it
> to stay in a cache or fit in limited local storage. And it's=20
> big - typically
> a vec3 of position and a vec4 of orientation, or 28 bytes in=20
> total, and you
> need to write the data then read it back later - bus traffic=20
> of 56 bytes.
> Whereas source animations can be compressed far smaller than this - we
> average around 4 bytes per spline control, and you usually=20
> need to sample
> the nearest three controls in a spline (quadratic=20
> interpolation), and it's
> read-only data, so around 12 bytes of traffic. Obviously=20
> these are really
> rough figures, but there's still a significant difference.
>=20
> > Granularity. By doing this unless you build some very fine
> > scheduling metric
> > you are going to have trouble balancing PU utilization. Some
> > characters
> > will have longer blend trees than other, taking more=20
> processing times,
> > etc. Unless you are doing all anim operations on one PU.=20
> Still without
> > test data on this one I may be completely off target.
>=20
> In general (obviously waving my hands a bit here), if you=20
> have more than
> about 4x the number of tasks than processors, and the time=20
> taken by each
> task is distributed in a roughly bell-curvish way, your=20
> granularity is "fine
> enough". Splitting the work into finer chunks might get=20
> slightly more level
> load-balancing, but the overhead of doing the splitting, and the extra
> synchronisation required because you have more tasks, means=20
> that you don't
> actually get any benefit.
>=20
> To put this another way - the worst case scenario is that all=20
> PUs finish all
> available tasks, and have to wait for a single PU to do a=20
> single task. They
> never have to wait for that last PU to do more than one task (because
> otherwise one of the idle ones would do it instead). So if=20
> the total number
> of tasks divided by the number of PUs is more than 10, then=20
> the worst you
> can do is waste 10% of your processing power. Wasting only=20
> 10% of your total
> power is considered superb efficiency in the multiprocessing=20
> world! This
> assumes that the tasks are not dependent and can be processed=20
> by any PU in
> any order, but if you partition by characters, this is 99%=20
> true (and for the
> few that are dependent on something else, just schedule those first).
> Certainly with 1000s of characters and only around 8 PUs=20
> (e.g. PS3), I just
> can't see it being a problem.
>=20
>=20
> TomF.
>=20
>=20
> > -----Original Message-----
> > From: gda...@li...
> > [mailto:gda...@li...] On
> > Behalf Of yg...@gy...
> > Sent: 28 October 2005 03:08
> > To: gda...@li...
> > Subject: RE: [Algorithms] skeletal animation system
> >
> >
> > Hi Tom,
> >
> > Points, well made.
> >
> > I think we come from different points of view. You are
> > talking about these
> > few most important characters that you see in the foreground.
> > I'm focusing on LOTS of characters in the background
> > (instancing numbers,
> > hundred, thousands, etc) and not so complicated animation operations
> > (cross-fade blends being by far the most popular blend in my
> > experience in
> > these cases).
> > Due to the overlapping nature of the memory allocation=20
> (especially in
> > cross-fade blends) you have implied data separation that
> > might as well be
> > utilized.
> >
> > Also in my experience especially when animating all kinds of
> > entities (and
> > have some external attachment between them) its good to process all
> > animations before you get to actually processing your
> > entities (characters
> > if you want). I think we agree on this point.
> >
> > > Every now and then you get multiple characters doing
> > roughly the same
> > > animation, and the cache helps, but I regard this as luck.
> > We have actually done some research and testing on that. I
> > dont think I
> > agree with this statement when it comes to crowds. We (a=20
> colleague of
> > mine) do massive character scenes by sorting morphed data. We
> > upload the
> > morph targets and then send instance data (translate,rotate). This
> > achieves amazing (talking 1000+) quantities of characters on
> > screen on a
> > current console platform. I was convinced at first that this
> > would never
> > work because like you I didnt imagine there would be any reasonable
> > exploitable "roughly the same keyframe" animation.
> >
> > > we need to optimise the worse case scenarios, not the best!
> > Quite right ;) In my case worst scenario happens when you do
> > fast forward
> > cause you hit the anim stream way too often. Unless
> > pre-streamed like you
> > suggest.
> >
> > > So, in conclusion - assume every time you sample an
> > animation, you will
> > > miss every cache.
> > Agreed. However, I'm arguing (and in my experience and
> > analyser data) that
> > most of the time you are doing interpolation between key
> > frames (that are
> > stored locally) rather than fetching them. If your streams are that
> > densely populated that you have to fetch every update you are doing
> > something wrong, which i cant imagine being the case with Granny for
> > example.
> >
> > I do agree that blend trees are usually not that tall and
> > binary blending
> > (except for cross-fades) is not the best under the sun.
> >
> > > You do very few blends between already-sampled data.
> > Actually we almost never do anything but that. Different
> > perspective here.
> > We do a few N-blends based on parameters like vectors,
> > quaternions, etc.
> > You can do an "artistic" IK look-alike that way. I'm sure
> > there was link
> > in this list some time ago... about this.
> >
> > > Thus, although it is elegant to split the processing into
> > (a) sample all
> > > your animations into temporary buffers and then (b) blend
> > the buffers
> > > together, in practice it makes more sense to have your
> > animation sampling
> > > also do the weighted accumulation, so that you can focus all your
> > > optimisation on that one routine, and not have that
> > intermediate data hang
> > > around polluting caches - as soon as you sample a bone's
> > data, you add it
> > > to
> > > its blend.
> > My approach is: dont do any blends if you dont have to. So you
> > sample/interpolate your anims and most often, if they are not
> > in a cross
> > fade, you are done. If there is something else to be done=20
> its another
> > task. Makes for better code separation. With your solution=20
> you have to
> > make a fit-all process function.
> > Still your approach will be less memory consuming and less
> > burden on the
> > bus so it could be worth the maintainability-performance trade-off.
> >
> > > So the point I'm trying to (finally) get to is that trying
> > to parallelise
> > > the operations done on one particular character seems like
> > a lot of effort
> > > for little benefit.
> > I'm not after parallelsing operations on one character. I'm for
> > parallelizing all operations on all characters on one go. The
> > animation
> > system doesnt know about "character" as such... it just animating
> > entities...
> > Your main point here is that the serial term of the algorithm being
> > dominant is reducing or invalidating any benefit from the
> > parallel part. I
> > must agree that its a good point but without any test data I
> > cant say if
> > that is the case or not. If task scheduling (and DMA transfers) is
> > properly interlieved (double or tripple buffered) you dont
> > get any penalty
> > for that. Still, a very valid point.
> >
> > > IMHO, a far better thing is to assume that the number of=20
> characters
> > > on-screen is significantly larger than the number of PUs,
> > and simply do
> > > all
> > > the anim sampling and blending of each character on a single PU.
> > Granularity. By doing this unless you build some very fine
> > scheduling metric
> > you are going to have trouble balancing PU utilization. Some
> > characters
> > will have longer blend trees than other, taking more=20
> processing times,
> > etc. Unless you are doing all anim operations on one PU.=20
> Still without
> > test data on this one I may be completely off target.
> >
> > Thanks ever so much for the feedback. In fact I think I'm=20
> going to try
> > some of you ideas and see how they compare with the=20
> existing solution.
> >
> > -Yordan
>=20
>=20
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_ida88
>=20
>=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188
>=20