Re: [Algorithms] skeletal animation system

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

> Erm... not really. The PS3 is similar to nothing in its ideology. Having a
> good PS2 engine doesn't help you at all in having a good PS3 engine - 
> except
> in raising your pain threashold.
Probably should have made my statement clearer. I had in mind the idea of a 
processing unit that has local (on-chip) memory with almost no-latency 
access to it. This was the case with the EE/scratchpad on PS2 and now on PS3 
you have 7 of these (well similar) things. I'm just implying the programming 
idioms are similar to PS2 in this respect (alongside the DMA stuff).

> Sorry, little rant about caches vs DMA there. I know you weren't 
> suggesting
> they were inherently good. And of course it's perfectly true that a cache
> takes more gates to implement.
No, I agree with you. There are actually more problems with DMA here. DMA 
based systems usually have caches as well and the combo spells even more 
trouble. Because when you DMA data to your local storage (i.e. scratchpad) 
you want to flush your caches from data that you are actually move otherwise 
if the cache commits something that you later DMA back or some similar 
(cache/dma) race condition occurs - you are in "painland".

> But I do like your incremental update idea - certainly worth looking at. 
> I'm
> slightly worried about keeping chunks of temporary memory around
> per-instance, but it probably isn't all that big.
Some numbers in that line of thought. Obviously the more bones I have in my 
animated skeletons the more temp data I need. I use keyframes (two controls 
needed per interpolation) but my samples are in raw form. Also my structure 
contains few bytes of control data and the output transforms. I managed to 
squeeze max 61 bones in 8K (half scratchpad). So my animated skeletons have 
61 bone cap.Of course this 8K of data needs to live in main memory as well. 
So 100 characters = 800kb, 1000+ we are talking ~8Meg. Though if you have 
1000+ 61boned characters you are asking for trouble anyway. Still, 8mb are 
8mb ;)

-Yordan

----- Original Message ----- 
From: "Tom Forsyth" <tom...@ee...>
To: <gda...@li...>
Sent: Saturday, October 29, 2005 7:51 PM
Subject: RE: [Algorithms] skeletal animation system

Oh, I see what you mean now! That's pretty cunning.

> What makes you think that cache performance is going to
> improve on PS3? Or xbox360?
> As far as I have read the penalties for cache misses in both
> platforms are
> considerable. Actually relative to the speed of the processing units
> situation is much worse than on PS2.

Agreed. In fact, I believe the miss to main memory is almost exactly the
same between PS2 and Xbox 360 in nanoseconds, but of course the instruction
throughput has increased by somewhere around 30x!

However, that's the latency that is the same. For DMA systems, setting up a
transfer is somewhat costly, and there's often an overhead to starting and
stopping. On cache-based system, it's a single instruction, and since
everything is cache-line-sized, there's very little penalty for small
transfers. Or actually, it's that there's very little gain from large
transfers!

So while you need to set up your fetches earlier in both systems, setting
them up in a cache-based system is cheaper. Most crucially, if you
mis-predict one in a thousand times, or run out of local storage, cache
systems don't crash, they just run a bit slower :-) For example, looping
animations need you to fetch the very end and the very start of the
animation. On a cache system, just prefetch one place - it's not worth the
extra computation to optimise for that rare case. On a DMA-based system, you
don't have a choice.

Sorry, little rant about caches vs DMA there. I know you weren't suggesting
they were inherently good. And of course it's perfectly true that a cache
takes more gates to implement.

> If you have a system working on PS2 it much easier to port it
> to PC/Xbox
> than the other way around - especially if you want to get
> some performance
> out. PS3 is again different from the other two and still
> closer to PS2 in
> programming ideology.

Erm... not really. The PS3 is similar to nothing in its ideology. Having a
good PS2 engine doesn't help you at all in having a good PS3 engine - except
in raising your pain threashold. In theory you could say that VU0 was like
an SPU, but in practice VU0 was so crippled it was almost useless (in my
previous life writing games we used it as extra scratchpad memory :-).
Hopefully SPUs will be slightly more useful.

It is true that if you have a PS2 engine you can relatively easily port it
to PC/Xbox (and I have done exactly that - when designing a cross-platform
engine, the PS2 was the focus and then the other two were easy). However,
the PS3 is somewhat at right-angles to everything. It's true that you can
simply put each SPU on a "thread" and then instead of DMAs you do memcpy,
but it's not exactly elegant or efficient.

> Anyway I don't pretend what I have is best. What I'm saying
> is try ASAP your
> PS3 plan. Tweak/change your plan until you are happy with the
> performance.
> When that happens see how it maps to the other platforms and
> you'll find
> that there is hardly anything left to do. ;)

Well, maybe. It does mangle the PC/Xb/Xb360/Mac/GC/PS2/PSP/Rev (yeah, we do
a lot of platforms!) code quite a lot for this one platform, and one of the
things we value highly is a clean codebase, since customers actually get to
see it all. I also think that at best the PS3->whatever back-port will be
the same speed - but it might well be slower, because essentially you're
trading computing cycles to make up for not having as much intelligence in
the memory hierarchy.

But I do like your incremental update idea - certainly worth looking at. I'm
slightly worried about keeping chunks of temporary memory around
per-instance, but it probably isn't all that big.

TomF.

> -----Original Message-----
> From: gda...@li...
> [mailto:gda...@li...] On
> Behalf Of Yordan Gyurchev
> Sent: 29 October 2005 05:33
> To: gda...@li...
> Subject: Re: [Algorithms] skeletal animation system
>
>
> I just want to point out that I never advocated reusing same
> animation
> samples (or caching them) in multiple characters. And as Tom
> points out
> chances of that happening are tiny unless your scene dictates
> otherwise or
> you are not dealing with skeletal animation at all.
> So lets put this point to rest. ;)
>
> What I have are small structures per-character that hold
> cached sample info
> and are used mostly when characters are updated.One structures = one
> character update. They have maximum locality and require
> minimum fetch
> operations that can be predicted/anticipated. This is where
> my no-cache
> misses approach comes from. It also allows you to not care if
> you have the
> whole animation in memory allowing you to stream from disk
> you have some
> crazy long cutscenes.
>
> Think of it as a reader head that goes through the stream and
> caches the
> current data... when new data is needed it fetches it from
> the stream. Fetch
> here can be asynchronous operation as the sample cache is
> reading ahead.
> (cache is used as a programming concept here rather than real
> hw cache)
>
> structure
> ------
> control data
> time in the animation
> control points for bone 1 (could be 5 controls - 4used and 1
> prefetched in
> anticipation)
> control points for bone 2
> ...
> control points for bone N
> ...
> OuputSkeleton State
> ------
>
> So the algorithm works all the time on this local to the PU
> (or prefetched)
> structure. No main memory access. When the time in the
> animation is advanced
> enough so control point 4 becomes control point 3 and 5 ->
> 4... we need to
> prefetch new control point 5 and all this asynchronous so no
> stalls occur
> anywhere i.e. code waiting for the data
>
> Update will be something like this:
> ----
> pre-fetch nextcharacter
> calclulate current character
> store curretncharacter
> move next : make current character = next character
> ----
>
> > Note that I have not tried this on the PS2's scratchpad -
> that is rather
> > small. The cache performance on the PS2 was so abysmal in
> every way that
> > it
> > would have been a big big job to get it even half-decent on there.
> My approach works on PS2 scratchpad. The differences to PC
> code are 5 lines
> of code that are ifdef-ed. All other code is identical in all
> platforms.
> (this is with normal lerp for key frames... with four
> controls per bone
> spline I don't fit in scratchpad either). I also double
> buffer the scratch
> pad so I keep the processor busy.
>
> What makes you think that cache performance is going to
> improve on PS3? Or
> xbox360?
> As far as I have read the penalties for cache misses in both
> platforms are
> considerable. Actually relative to the speed of the processing units
> situation is much worse than on PS2.
>
> If you have a system working on PS2 it much easier to port it
> to PC/Xbox
> than the other way around - especially if you want to get
> some performance
> out. PS3 is again different from the other two and still
> closer to PS2 in
> programming ideology.
>
> As you point out Tom the other (non PS) two platforms are
> trivial. Most
> techniques John mentioned are supported directly by OpenMP.
> This is no
> coincidence as the symmetric/shared memory architecture
> favours that. They
> are trivial because most techniques in these architectures
> are based around
> "how to quickly make a serial program into a parallel one".
> Loop parallelism
> being by far the most popular. My approach is based around
> mainly that
> pattern with several sync points.
>
> Anyway I don't pretend what I have is best. What I'm saying
> is try ASAP your
> PS3 plan. Tweak/change your plan until you are happy with the
> performance.
> When that happens see how it maps to the other platforms and
> you'll find
> that there is hardly anything left to do. ;)
>
> John, you talk about writing schedulers. Do these run in the
> main thread?
> Have you tried separate threads/processors to schedule
> themselves with
> shared task queue/task pools approach? Just curious what the
> differences are
> going to be in PU utilization when you add more PUs although
> 1.9 is pretty
> good :)
>
> -Yordan
>
> ----- Original Message ----- 
> From: "Tom Forsyth" <tom...@ee...>
> To: <gda...@li...>
> Sent: Saturday, October 29, 2005 4:12 AM
> Subject: RE: [Algorithms] skeletal animation system
>
>
> > My argument here was that new keyframes/knots are required
> > infrequently. With a
> > good spline fitting algorithm you should be experiencing the
> > same.
>
> Well, from frame 1 of character 3, bone 4 to frame 2 of
> character 3, bone 4,
> yes - you will need almost exactly the same data. The problem
> is, those two
> samplings are an entire 60th of a second apart. You'll want
> to fill local
> store/cache with tons of other data in the meantime.
>
> The chances of characters 4, 5, 6, etc needing the same 3-4 knots or
> keyframes in frame 1 is tiny, according to my data. But yes,
> maybe if you
> have a big crowd of characters all running the same animation
> at roughly the
> same phase, you'll get re-use. I'm just saying, I don't think
> that is even
> close to the worst case performance, nor do I think it is
> even a common
> case. But it does depend on your game type.
>
> > Do you have similar control structures or you rely that
> > animation is going
> > to be prefetched in local storage? If so do you have
> > guarantee that every
> > single of your animations is going to fit there? Remember
> > this would be
> > about 1/3 (in practice more like 1/4) of real local storage
> > space if you
> > plan to triple buffer to keep the PU busy.
>
> I only fetch the 3-4 control points in each spline that the current
> character needs. If the next character happens to also need
> them, then hey -
> that's fine - the cache might do useful work. But I assume it
> doesn't, and I
> assume that each character has to fetch all-new data, because
> that is the
> worse case, and it's also the common case. So I need to
> prefetch that data
> before I do the sampling & blending
>
> Space usually isn't a problem - I only need 4 control points
> from pos & orn
> for each bone for each animation. In practice, the size of
> the required
> control data (e.g. how long each animation is, what sort of
> compression each
> spline uses, etc) is bigger than the control point data I need this
> particular frame. But none of it is really very large - it fits in all
> sensible-sized caches and local data storage just fine.
>
> Note that I have not tried this on the PS2's scratchpad -
> that is rather
> small. The cache performance on the PS2 was so abysmal in
> every way that it
> would have been a big big job to get it even half-decent on
> there. Since I
> have five other platforms to deal with, I just waited for
> someone to try to
> run a complex scene on the PS2 before I worried about it. And
> nobody ever
> did, so it never had to be done. And probably never will now.
> Which I am
> very happy about :-)
>
> > You have to decide where in the animation stream are
> > each quadruples
> > of controls for each bone in order to get only those
> > quadruples and DMA them
> > in the local PU memory.
>
> Correct.
>
> > Now I can tell you stright away (and
> > I'm almost sure
> > things wont change much) that on PS2 one solid DMA transfer
> > is far better
> > than a number of small ones. They have some setup time and
> need to be
> > prioritized and etc. So the less you have the better. Not
> > only that you have
> > to spend CPU some time deciding/calculating where exactly in
> > your animation
> > are these quadruples of keyframes and that is not something
> > you want to be
> > doing every frame, or you do?
>
> Well, you have to decide which controls you need anyway -
> that's part of the
> spline sampling process. So that's no extra cost.
>
> Let me preface this with a warning - I have not tried a
> DMA-based scheme
> yet. That's because on every other platform, we're not
> talking about DMAs,
> we're talking about normal cache accesses, and they don't care about
> small/large transfers - everything is just the size of a
> cache line, and
> setting up a transfer is just executing a "prefetch" with the relevant
> address - extremely cheap. All you need to do is make sure
> you do those
> prefetches far enough in advance (to absorb the latency of
> the transfer from
> main memory), and life is good. The only platforms that DMAs
> are needed for
> are PS2 (avoided doing this - see above) and PS3.
>
> So, here is my entirely theoretical plan for PS3 - due to be
> implemented
> some time next year.
>
> Granny allows animation splines to be listed in any order you
> like in memory
> - they are remapped to the skeleton on the fly. This is a big
> bonus for a
> bunch of things, but that would be a very long discussion, so
> let's just
> take that as read. Some splines will have lots of
> knots/controls, and some
> will be able to use different sorts of compression, and so
> they will all be
> different lengths in bytes. So, given that, order your spline
> data in memory
> from smallest to largest.
>
> Now, decide what your magic byte length is for your platform.
> This is going
> to be largely trial and error, but it's roughly the place
> where the cost of
> setting up a new transfer exceeds the cost of transferring
> the extra data.
> So let's say that's 64 bytes (chosen at random). That is, if
> you wanted to
> transfer two chunks of 12 bytes that were 64 bytes apart, it
> is the same
> speed to set up a single transfer that is 64+12 bytes long,
> as to do two
> transfers of 12 bytes each.
>
> OK, so we know we need 12 bytes (roughly - varies with compression and
> spline degree) from each spline in the list. And they're
> ordered by size. So
> we just need to find the first spline with data longer than
> 64 bytes. All
> the splines before it in the list just get their entire data
> DMA'd over in a
> single chunk - it's not efficient to do them with separate
> transfers. All
> the ones afterwards just get a 12-byte transfer set up for
> each of them,
> because that's the only bit you need, and you'd probably run
> out of local
> store if you transferred the whole lot, as well as having to
> wait longer for
> the transfer.
>
> So that's the theory. I have yet to try it in practice
> though. Now, it may
> be that the number is not 64 bytes, it's more like 4kbytes or
> something. In
> which case this mostly just collapses to "transfer the whole animation
> over". Well, so be it. But people do use some pretty long
> animations with
> Granny (I've had people export 15-minute credit sequences as a single
> animation!), so at some point you need to cope with not being
> able to DMA
> the whole thing over.
>
> As I said above, on every platform but the PS*, it's trivial.
> They have
> caches, and you just do a prefetch instruction per spline. On
> PS3, it's
> somewhat more effort <shakes fist at Sony/IBM for their crazy
> architecture>.
>
>
> TomF.
>
>
> > -----Original Message-----
> > From: gda...@li...
> > [mailto:gda...@li...] On
> > Behalf Of Yordan Gyurchev
> > Sent: 28 October 2005 14:29
> > To: gda...@li...
> > Subject: Re: [Algorithms] skeletal animation system
> >
> >
> > Ok, you've won me over. Processing one character/entity on
> > one PU thing. So,
> > one character = one task.
> >
> > When I think about it though it fits my approach, just the scale is
> > different.
> >
> > > By "fetch", I meant you're fetching them from main memory
> > into either a
> > > cache or some sort of local storage (e.g. SPU memory). I
> > was assuming the
> > > animation has already been loaded off disk, otherwise
> > you're doomed!
> > > Granny
> > > uses splines rather than keyframes, but the data access
> > patterns are
> > > roughly
> > > similar to a sparse keyframe system.
> > Fetch is = main memory => PU
> > I use the word keyframe for both normal keyframes and spline
> > knots (probably
> > wrong but in our case its still some data at some point in
> > time regardless
> > of the interp method).
> >
> > > So IMHO it's best to focus optimisation on what you do at
> > each leaf, which
> > > can usually be expressed as sample-and-blend.
> > Agreed. This is what I'm most interested in as well. And
> > analyser data shows
> > most time is spend there.
> >
> > I find it difficult to follow what exactly you do with the PU
> > and what goes
> > to local memory and back to main etc (see later my rant on
> > DMA transfers).
> > IMHO there needs to be a number of support structures. Inputs
> > like where in
> > the animation you are, blending weights, parameters, etc. And
> > outputs like
> > the final skeleton state (what is rendered), other output
> > info (we return
> > for example velocity of the root or absolute movement).
> >
> > In my approach I put all my support structures in one solid
> > chunk and call
> > that animation state. Its all the data the task will need
> > (regardless of the
> > scale of the task - be it a full character or not).
> >
> > So we move the animation state into local memory, process
> > that character,
> > then move it back. It can be structured so read-only data
> > goes only one way
> > without problem.
> >
> > Now the question is what goes in this animation state and
> > what stays in main
> > memory and is fetched on request (or pre-streamed). With my
> > approach I keep
> > a duplicates of the control points (key frames) that are currently
> > interpolated into local interpolation structures (local to
> > the animation
> > state). So, only when I move to the next keyframe (require
> > next knot) I
> > touch main memory to fetch it (or not if it is pre-streamed
> > in local). My
> > argument here was that new keyframes/knots are required
> > infrequently. With a
> > good spline fitting algorithm you should be experiencing the
> > same. Mine is
> > (I assume) worse than yours by a mile and still keyframe
> > fetches (and for
> > that reason cache misses) are not an issue when I analyze the
> > execution.
> > Even more you can fetch asynchronously an extra
> > keyframe/control so you
> > don't have any stalls.
> >
> > Do you have similar control structures or you rely that
> > animation is going
> > to be prefetched in local storage? If so do you have
> > guarantee that every
> > single of your animations is going to fit there? Remember
> > this would be
> > about 1/3 (in practice more like 1/4) of real local storage
> > space if you
> > plan to triple buffer to keep the PU busy.
> >
> > May be I'm missing something here.
> >
> > > Whereas source animations can be compressed far smaller
> > than this - we
> > > average around 4 bytes per spline control, and you usually
> > need to sample
> > > the nearest three controls in a spline (quadratic
> > interpolation), and it's
> > > read-only data, so around 12 bytes of traffic. Obviously
> > these are really
> > > rough figures, but there's still a significant difference.
> > Yes I understand the maths, but I think you are focusing on
> > the wrong thing
> > here.  Its not always how much you transfer but on what
> > chunks and how
> > often.You have to decide where in the animation stream are
> > each quadruples
> > of controls for each bone in order to get only those
> > quadruples and DMA them
> > in the local PU memory. Now I can tell you stright away (and
> > I'm almost sure
> > things wont change much) that on PS2 one solid DMA transfer
> > is far better
> > than a number of small ones. They have some setup time and
> need to be
> > prioritized and etc. So the less you have the better. Not
> > only that you have
> > to spend CPU some time deciding/calculating where exactly in
> > your animation
> > are these quadruples of keyframes and that is not something
> > you want to be
> > doing every frame, or you do?
> >
> > If you want to directly sample data you need to move most of
> > your animation
> > (if possible all of it) in local mem. Or consider some genius
> > animation
> > format where you can DMA solid blocks every time you need to
> > sample data
> > (once per animation).
> >
> > I must say that load every animation and sample all characters seems
> > attractive. Though it hides potential problems of either not
> > being able to
> > fit animation (too long animation) or sample data (too many
> > character).
> >
> > On the topic of perfect load balancing I agree with you.
> >
> > -Yordan
> >
> > ----- Original Message ----- 
> > From: "Tom Forsyth" <tom...@ee...>
> > To: <gda...@li...>
> > Sent: Friday, October 28, 2005 8:27 PM
> > Subject: RE: [Algorithms] skeletal animation system
> >
> >
> > > You are talking about these
> > > few most important characters that you see in the foreground.
> >
> > I'm also talking about decent numbers of characters. In fact,
> > if there's
> > 1000s of characters in the scene, then keeping each
> > character's animation
> > inside a single PU makes even more sense.
> >
> > > Due to the overlapping nature of the memory allocation
> > (especially in
> > > cross-fade blends) you have implied data separation that
> > > might as well be utilized.
> >
> > Sorry, you've totally lost me there :-)
> >
> > > > You do very few blends between already-sampled data.
> > > Actually we almost never do anything but that. Different
> > > perspective here.
> > > We do a few N-blends based on parameters like vectors,
> > > quaternions, etc.
> > > You can do an "artistic" IK look-alike that way. I'm sure
> > > there was link
> > > in this list some time ago... about this.
> >
> > Bad mis-communication on my part. What I meant is that
> > because trees are
> > broad but shallow, you are almost always blending together
> > data sampled
> > directly from an animation (the leaves of the tree). You
> > rarely blend data
> > that is the result of a previous blend (the branches and
> > trunk of the tree).
> > So IMHO it's best to focus optimisation on what you do at
> > each leaf, which
> > can usually be expressed as sample-and-blend.
> >
> > Of course, in a purely binary tree, you have roughly equal
> > numbers of the
> > two operations. But that's one reason why a binary tree isn't
> > that great an
> > idea in general :-)
> >
> > However, you do say that in your data sets you often have
> > just one animation
> > running, so it's a sample with no blending. Granny also has a
> > special-cased
> > path for this (only one active animation). In practice, some
> > games hit that
> > a lot, some hit it almost never - it all depends on the data
> > sets. I would
> > be cautious about optimising for the case that is already
> pretty fast!
> > There's a minor code-maintenance issue, but it's only two
> > paths, and they
> > call a lot of the same functions (that the compiler inlines
> > perfectly well).
> >
> > > most of the time you are doing interpolation between key
> > > frames (that are
> > > stored locally) rather than fetching them. If your
> streams are that
> > > densely populated that you have to fetch every update you
> are doing
> > > something wrong, which i cant imagine being the case with
> Granny for
> > > example.
> >
> > By "fetch", I meant you're fetching them from main memory
> > into either a
> > cache or some sort of local storage (e.g. SPU memory). I was
> > assuming the
> > animation has already been loaded off disk, otherwise you're
> > doomed! Granny
> > uses splines rather than keyframes, but the data access
> > patterns are roughly
> > similar to a sparse keyframe system.
> >
> > The way Granny is oriented is on a character-by-character
> basis. Each
> > character is sampled & blended fully, pulling in the data on
> > demand, and
> > then you do the next one. There's obvious exceptions for
> > characters that
> > interact with each other, but they're certainly not the
> > average case. This
> > keeps all the intermediate data nice and warm in caches or
> > local storage,
> > and only the inputs (animations) and final outputs (what you
> > render) live in
> > main memory.
> >
> > To do effective streaming from main memory, you can either
> > prefetch/DMA all
> > the animations for a character before you start sampling the
> > first one (this
> > works well when characters are playing a lot of animations),
> > or you can
> > prefetch character 2 before you sample & blend character 1
> > (this works well
> > when characters are playing much fewer or simpler
> > animations). If prefetch
> > times are long, and local storage/cache space is large, you
> > can prefetch
> > character 3 or 4 before doing character 1, but that tends to get
> > counterproductive and can thrash caches (which is of course lethal).
> >
> > Now, you could possibly orient it around animations
> instead. Load each
> > animation just once, then do all the sampling of that
> > animation required for
> > all the characters in the scene. This means you only fetch
> > each animation's
> > data just once. The problem is, the results of the sampling
> need to be
> > stored somewhere, but because you're doing all the sampling of that
> > animation in one go, you're unlikely to use that result soon
> > enough for it
> > to stay in a cache or fit in limited local storage. And it's
> > big - typically
> > a vec3 of position and a vec4 of orientation, or 28 bytes in
> > total, and you
> > need to write the data then read it back later - bus traffic
> > of 56 bytes.
> > Whereas source animations can be compressed far smaller
> than this - we
> > average around 4 bytes per spline control, and you usually
> > need to sample
> > the nearest three controls in a spline (quadratic
> > interpolation), and it's
> > read-only data, so around 12 bytes of traffic. Obviously
> > these are really
> > rough figures, but there's still a significant difference.
> >
> > > Granularity. By doing this unless you build some very fine
> > > scheduling metric
> > > you are going to have trouble balancing PU utilization. Some
> > > characters
> > > will have longer blend trees than other, taking more
> > processing times,
> > > etc. Unless you are doing all anim operations on one PU.
> > Still without
> > > test data on this one I may be completely off target.
> >
> > In general (obviously waving my hands a bit here), if you
> > have more than
> > about 4x the number of tasks than processors, and the time
> > taken by each
> > task is distributed in a roughly bell-curvish way, your
> > granularity is "fine
> > enough". Splitting the work into finer chunks might get
> > slightly more level
> > load-balancing, but the overhead of doing the splitting,
> and the extra
> > synchronisation required because you have more tasks, means
> > that you don't
> > actually get any benefit.
> >
> > To put this another way - the worst case scenario is that all
> > PUs finish all
> > available tasks, and have to wait for a single PU to do a
> > single task. They
> > never have to wait for that last PU to do more than one
> task (because
> > otherwise one of the idle ones would do it instead). So if
> > the total number
> > of tasks divided by the number of PUs is more than 10, then
> > the worst you
> > can do is waste 10% of your processing power. Wasting only
> > 10% of your total
> > power is considered superb efficiency in the multiprocessing
> > world! This
> > assumes that the tasks are not dependent and can be processed
> > by any PU in
> > any order, but if you partition by characters, this is 99%
> > true (and for the
> > few that are dependent on something else, just schedule
> those first).
> > Certainly with 1000s of characters and only around 8 PUs
> > (e.g. PS3), I just
> > can't see it being a problem.
> >
> >
> > TomF.
> >
> >
> > > -----Original Message-----
> > > From: gda...@li...
> > > [mailto:gda...@li...] On
> > > Behalf Of yg...@gy...
> > > Sent: 28 October 2005 03:08
> > > To: gda...@li...
> > > Subject: RE: [Algorithms] skeletal animation system
> > >
> > >
> > > Hi Tom,
> > >
> > > Points, well made.
> > >
> > > I think we come from different points of view. You are
> > > talking about these
> > > few most important characters that you see in the foreground.
> > > I'm focusing on LOTS of characters in the background
> > > (instancing numbers,
> > > hundred, thousands, etc) and not so complicated animation
> operations
> > > (cross-fade blends being by far the most popular blend in my
> > > experience in
> > > these cases).
> > > Due to the overlapping nature of the memory allocation
> > (especially in
> > > cross-fade blends) you have implied data separation that
> > > might as well be
> > > utilized.
> > >
> > > Also in my experience especially when animating all kinds of
> > > entities (and
> > > have some external attachment between them) its good to
> process all
> > > animations before you get to actually processing your
> > > entities (characters
> > > if you want). I think we agree on this point.
> > >
> > > > Every now and then you get multiple characters doing
> > > roughly the same
> > > > animation, and the cache helps, but I regard this as luck.
> > > We have actually done some research and testing on that. I
> > > dont think I
> > > agree with this statement when it comes to crowds. We (a
> > colleague of
> > > mine) do massive character scenes by sorting morphed data. We
> > > upload the
> > > morph targets and then send instance data (translate,rotate). This
> > > achieves amazing (talking 1000+) quantities of characters on
> > > screen on a
> > > current console platform. I was convinced at first that this
> > > would never
> > > work because like you I didnt imagine there would be any
> reasonable
> > > exploitable "roughly the same keyframe" animation.
> > >
> > > > we need to optimise the worse case scenarios, not the best!
> > > Quite right ;) In my case worst scenario happens when you do
> > > fast forward
> > > cause you hit the anim stream way too often. Unless
> > > pre-streamed like you
> > > suggest.
> > >
> > > > So, in conclusion - assume every time you sample an
> > > animation, you will
> > > > miss every cache.
> > > Agreed. However, I'm arguing (and in my experience and
> > > analyser data) that
> > > most of the time you are doing interpolation between key
> > > frames (that are
> > > stored locally) rather than fetching them. If your
> streams are that
> > > densely populated that you have to fetch every update you
> are doing
> > > something wrong, which i cant imagine being the case with
> Granny for
> > > example.
> > >
> > > I do agree that blend trees are usually not that tall and
> > > binary blending
> > > (except for cross-fades) is not the best under the sun.
> > >
> > > > You do very few blends between already-sampled data.
> > > Actually we almost never do anything but that. Different
> > > perspective here.
> > > We do a few N-blends based on parameters like vectors,
> > > quaternions, etc.
> > > You can do an "artistic" IK look-alike that way. I'm sure
> > > there was link
> > > in this list some time ago... about this.
> > >
> > > > Thus, although it is elegant to split the processing into
> > > (a) sample all
> > > > your animations into temporary buffers and then (b) blend
> > > the buffers
> > > > together, in practice it makes more sense to have your
> > > animation sampling
> > > > also do the weighted accumulation, so that you can
> focus all your
> > > > optimisation on that one routine, and not have that
> > > intermediate data hang
> > > > around polluting caches - as soon as you sample a bone's
> > > data, you add it
> > > > to
> > > > its blend.
> > > My approach is: dont do any blends if you dont have to. So you
> > > sample/interpolate your anims and most often, if they are not
> > > in a cross
> > > fade, you are done. If there is something else to be done
> > its another
> > > task. Makes for better code separation. With your solution
> > you have to
> > > make a fit-all process function.
> > > Still your approach will be less memory consuming and less
> > > burden on the
> > > bus so it could be worth the maintainability-performance
> trade-off.
> > >
> > > > So the point I'm trying to (finally) get to is that trying
> > > to parallelise
> > > > the operations done on one particular character seems like
> > > a lot of effort
> > > > for little benefit.
> > > I'm not after parallelsing operations on one character. I'm for
> > > parallelizing all operations on all characters on one go. The
> > > animation
> > > system doesnt know about "character" as such... it just animating
> > > entities...
> > > Your main point here is that the serial term of the
> algorithm being
> > > dominant is reducing or invalidating any benefit from the
> > > parallel part. I
> > > must agree that its a good point but without any test data I
> > > cant say if
> > > that is the case or not. If task scheduling (and DMA transfers) is
> > > properly interlieved (double or tripple buffered) you dont
> > > get any penalty
> > > for that. Still, a very valid point.
> > >
> > > > IMHO, a far better thing is to assume that the number of
> > characters
> > > > on-screen is significantly larger than the number of PUs,
> > > and simply do
> > > > all
> > > > the anim sampling and blending of each character on a single PU.
> > > Granularity. By doing this unless you build some very fine
> > > scheduling metric
> > > you are going to have trouble balancing PU utilization. Some
> > > characters
> > > will have longer blend trees than other, taking more
> > processing times,
> > > etc. Unless you are doing all anim operations on one PU.
> > Still without
> > > test data on this one I may be completely off target.
> > >
> > > Thanks ever so much for the feedback. In fact I think I'm
> > going to try
> > > some of you ideas and see how they compare with the
> > existing solution.
> > >
> > > -Yordan
> >
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by the JBoss Inc.
> > Get Certified Today * Register for a JBoss Training Course
> > Free Certification Exam for All Training Attendees Through
> End of 2005
> > Visit http://www.jboss.com/services/certification for more
> information
> > _______________________________________________
> > GDAlgorithms-list mailing list
> > GDA...@li...
> > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > Archives:
> > http://sourceforge.net/mailarchive/forum.php?forum_ida88
> >
> >
> >
> >
> > -------------------------------------------------------
> > This SF.Net email is sponsored by the JBoss Inc.
> > Get Certified Today * Register for a JBoss Training Course
> > Free Certification Exam for All Training Attendees Through
> End of 2005
> > Visit http://www.jboss.com/services/certification for more
> information
> > _______________________________________________
> > GDAlgorithms-list mailing list
> > GDA...@li...
> > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> > Archives:
> > http://sourceforge.net/mailarchive/forum.php?forum_id=6188
> >
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_ida88
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> GDAlgorithms-list mailing list
> GDA...@li...
> https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
> Archives:
> http://sourceforge.net/mailarchive/forum.php?forum_id=6188
>

-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
GDAlgorithms-list mailing list
GDA...@li...
https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list
Archives:
http://sourceforge.net/mailarchive/forum.php?forum_ida88