RE: [Algorithms] skeletal animation system
Brought to you by:
vexxed72
From: Tom F. <tom...@ee...> - 2005-10-29 03:12:43
|
> My argument here was that new keyframes/knots are required=20 > infrequently. With a=20 > good spline fitting algorithm you should be experiencing the=20 > same. Well, from frame 1 of character 3, bone 4 to frame 2 of character 3, = bone 4, yes - you will need almost exactly the same data. The problem is, those = two samplings are an entire 60th of a second apart. You'll want to fill = local store/cache with tons of other data in the meantime. The chances of characters 4, 5, 6, etc needing the same 3-4 knots or keyframes in frame 1 is tiny, according to my data. But yes, maybe if = you have a big crowd of characters all running the same animation at roughly = the same phase, you'll get re-use. I'm just saying, I don't think that is = even close to the worst case performance, nor do I think it is even a common case. But it does depend on your game type. > Do you have similar control structures or you rely that=20 > animation is going=20 > to be prefetched in local storage? If so do you have=20 > guarantee that every=20 > single of your animations is going to fit there? Remember=20 > this would be=20 > about 1/3 (in practice more like 1/4) of real local storage=20 > space if you=20 > plan to triple buffer to keep the PU busy. I only fetch the 3-4 control points in each spline that the current character needs. If the next character happens to also need them, then = hey - that's fine - the cache might do useful work. But I assume it doesn't, = and I assume that each character has to fetch all-new data, because that is = the worse case, and it's also the common case. So I need to prefetch that = data before I do the sampling & blending Space usually isn't a problem - I only need 4 control points from pos & = orn for each bone for each animation. In practice, the size of the required control data (e.g. how long each animation is, what sort of compression = each spline uses, etc) is bigger than the control point data I need this particular frame. But none of it is really very large - it fits in all sensible-sized caches and local data storage just fine. Note that I have not tried this on the PS2's scratchpad - that is rather small. The cache performance on the PS2 was so abysmal in every way that = it would have been a big big job to get it even half-decent on there. Since = I have five other platforms to deal with, I just waited for someone to try = to run a complex scene on the PS2 before I worried about it. And nobody = ever did, so it never had to be done. And probably never will now. Which I am very happy about :-) > You have to decide where in the animation stream are=20 > each quadruples=20 > of controls for each bone in order to get only those=20 > quadruples and DMA them=20 > in the local PU memory. Correct. > Now I can tell you stright away (and=20 > I'm almost sure=20 > things wont change much) that on PS2 one solid DMA transfer=20 > is far better=20 > than a number of small ones. They have some setup time and need to be=20 > prioritized and etc. So the less you have the better. Not=20 > only that you have=20 > to spend CPU some time deciding/calculating where exactly in=20 > your animation=20 > are these quadruples of keyframes and that is not something=20 > you want to be=20 > doing every frame, or you do? Well, you have to decide which controls you need anyway - that's part of = the spline sampling process. So that's no extra cost. Let me preface this with a warning - I have not tried a DMA-based scheme yet. That's because on every other platform, we're not talking about = DMAs, we're talking about normal cache accesses, and they don't care about small/large transfers - everything is just the size of a cache line, and setting up a transfer is just executing a "prefetch" with the relevant address - extremely cheap. All you need to do is make sure you do those prefetches far enough in advance (to absorb the latency of the transfer = from main memory), and life is good. The only platforms that DMAs are needed = for are PS2 (avoided doing this - see above) and PS3. So, here is my entirely theoretical plan for PS3 - due to be implemented some time next year. Granny allows animation splines to be listed in any order you like in = memory - they are remapped to the skeleton on the fly. This is a big bonus for = a bunch of things, but that would be a very long discussion, so let's just take that as read. Some splines will have lots of knots/controls, and = some will be able to use different sorts of compression, and so they will all = be different lengths in bytes. So, given that, order your spline data in = memory from smallest to largest. Now, decide what your magic byte length is for your platform. This is = going to be largely trial and error, but it's roughly the place where the cost = of setting up a new transfer exceeds the cost of transferring the extra = data. So let's say that's 64 bytes (chosen at random). That is, if you wanted = to transfer two chunks of 12 bytes that were 64 bytes apart, it is the same speed to set up a single transfer that is 64+12 bytes long, as to do two transfers of 12 bytes each. OK, so we know we need 12 bytes (roughly - varies with compression and spline degree) from each spline in the list. And they're ordered by = size. So we just need to find the first spline with data longer than 64 bytes. = All the splines before it in the list just get their entire data DMA'd over = in a single chunk - it's not efficient to do them with separate transfers. = All the ones afterwards just get a 12-byte transfer set up for each of them, because that's the only bit you need, and you'd probably run out of = local store if you transferred the whole lot, as well as having to wait longer = for the transfer. So that's the theory. I have yet to try it in practice though. Now, it = may be that the number is not 64 bytes, it's more like 4kbytes or something. = In which case this mostly just collapses to "transfer the whole animation over". Well, so be it. But people do use some pretty long animations = with Granny (I've had people export 15-minute credit sequences as a single animation!), so at some point you need to cope with not being able to = DMA the whole thing over. As I said above, on every platform but the PS*, it's trivial. They have caches, and you just do a prefetch instruction per spline. On PS3, it's somewhat more effort <shakes fist at Sony/IBM for their crazy = architecture>. TomF. > -----Original Message----- > From: gda...@li...=20 > [mailto:gda...@li...] On=20 > Behalf Of Yordan Gyurchev > Sent: 28 October 2005 14:29 > To: gda...@li... > Subject: Re: [Algorithms] skeletal animation system >=20 >=20 > Ok, you've won me over. Processing one character/entity on=20 > one PU thing. So,=20 > one character =3D one task. >=20 > When I think about it though it fits my approach, just the scale is=20 > different. >=20 > > By "fetch", I meant you're fetching them from main memory=20 > into either a > > cache or some sort of local storage (e.g. SPU memory). I=20 > was assuming the > > animation has already been loaded off disk, otherwise=20 > you're doomed!=20 > > Granny > > uses splines rather than keyframes, but the data access=20 > patterns are=20 > > roughly > > similar to a sparse keyframe system. > Fetch is =3D main memory =3D> PU > I use the word keyframe for both normal keyframes and spline=20 > knots (probably=20 > wrong but in our case its still some data at some point in=20 > time regardless=20 > of the interp method). >=20 > > So IMHO it's best to focus optimisation on what you do at=20 > each leaf, which > > can usually be expressed as sample-and-blend. > Agreed. This is what I'm most interested in as well. And=20 > analyser data shows=20 > most time is spend there. >=20 > I find it difficult to follow what exactly you do with the PU=20 > and what goes=20 > to local memory and back to main etc (see later my rant on=20 > DMA transfers).=20 > IMHO there needs to be a number of support structures. Inputs=20 > like where in=20 > the animation you are, blending weights, parameters, etc. And=20 > outputs like=20 > the final skeleton state (what is rendered), other output=20 > info (we return=20 > for example velocity of the root or absolute movement). >=20 > In my approach I put all my support structures in one solid=20 > chunk and call=20 > that animation state. Its all the data the task will need=20 > (regardless of the=20 > scale of the task - be it a full character or not). >=20 > So we move the animation state into local memory, process=20 > that character,=20 > then move it back. It can be structured so read-only data=20 > goes only one way=20 > without problem. >=20 > Now the question is what goes in this animation state and=20 > what stays in main=20 > memory and is fetched on request (or pre-streamed). With my=20 > approach I keep=20 > a duplicates of the control points (key frames) that are currently=20 > interpolated into local interpolation structures (local to=20 > the animation=20 > state). So, only when I move to the next keyframe (require=20 > next knot) I=20 > touch main memory to fetch it (or not if it is pre-streamed=20 > in local). My=20 > argument here was that new keyframes/knots are required=20 > infrequently. With a=20 > good spline fitting algorithm you should be experiencing the=20 > same. Mine is=20 > (I assume) worse than yours by a mile and still keyframe=20 > fetches (and for=20 > that reason cache misses) are not an issue when I analyze the=20 > execution.=20 > Even more you can fetch asynchronously an extra=20 > keyframe/control so you=20 > don't have any stalls. >=20 > Do you have similar control structures or you rely that=20 > animation is going=20 > to be prefetched in local storage? If so do you have=20 > guarantee that every=20 > single of your animations is going to fit there? Remember=20 > this would be=20 > about 1/3 (in practice more like 1/4) of real local storage=20 > space if you=20 > plan to triple buffer to keep the PU busy. >=20 > May be I'm missing something here. >=20 > > Whereas source animations can be compressed far smaller=20 > than this - we > > average around 4 bytes per spline control, and you usually=20 > need to sample > > the nearest three controls in a spline (quadratic=20 > interpolation), and it's > > read-only data, so around 12 bytes of traffic. Obviously=20 > these are really > > rough figures, but there's still a significant difference. > Yes I understand the maths, but I think you are focusing on=20 > the wrong thing=20 > here. Its not always how much you transfer but on what=20 > chunks and how=20 > often.You have to decide where in the animation stream are=20 > each quadruples=20 > of controls for each bone in order to get only those=20 > quadruples and DMA them=20 > in the local PU memory. Now I can tell you stright away (and=20 > I'm almost sure=20 > things wont change much) that on PS2 one solid DMA transfer=20 > is far better=20 > than a number of small ones. They have some setup time and need to be=20 > prioritized and etc. So the less you have the better. Not=20 > only that you have=20 > to spend CPU some time deciding/calculating where exactly in=20 > your animation=20 > are these quadruples of keyframes and that is not something=20 > you want to be=20 > doing every frame, or you do? >=20 > If you want to directly sample data you need to move most of=20 > your animation=20 > (if possible all of it) in local mem. Or consider some genius=20 > animation=20 > format where you can DMA solid blocks every time you need to=20 > sample data=20 > (once per animation). >=20 > I must say that load every animation and sample all characters seems=20 > attractive. Though it hides potential problems of either not=20 > being able to=20 > fit animation (too long animation) or sample data (too many=20 > character). >=20 > On the topic of perfect load balancing I agree with you. >=20 > -Yordan >=20 > ----- Original Message -----=20 > From: "Tom Forsyth" <tom...@ee...> > To: <gda...@li...> > Sent: Friday, October 28, 2005 8:27 PM > Subject: RE: [Algorithms] skeletal animation system >=20 >=20 > > You are talking about these > > few most important characters that you see in the foreground. >=20 > I'm also talking about decent numbers of characters. In fact,=20 > if there's > 1000s of characters in the scene, then keeping each=20 > character's animation > inside a single PU makes even more sense. >=20 > > Due to the overlapping nature of the memory allocation=20 > (especially in > > cross-fade blends) you have implied data separation that > > might as well be utilized. >=20 > Sorry, you've totally lost me there :-) >=20 > > > You do very few blends between already-sampled data. > > Actually we almost never do anything but that. Different > > perspective here. > > We do a few N-blends based on parameters like vectors, > > quaternions, etc. > > You can do an "artistic" IK look-alike that way. I'm sure > > there was link > > in this list some time ago... about this. >=20 > Bad mis-communication on my part. What I meant is that=20 > because trees are > broad but shallow, you are almost always blending together=20 > data sampled > directly from an animation (the leaves of the tree). You=20 > rarely blend data > that is the result of a previous blend (the branches and=20 > trunk of the tree). > So IMHO it's best to focus optimisation on what you do at=20 > each leaf, which > can usually be expressed as sample-and-blend. >=20 > Of course, in a purely binary tree, you have roughly equal=20 > numbers of the > two operations. But that's one reason why a binary tree isn't=20 > that great an > idea in general :-) >=20 > However, you do say that in your data sets you often have=20 > just one animation > running, so it's a sample with no blending. Granny also has a=20 > special-cased > path for this (only one active animation). In practice, some=20 > games hit that > a lot, some hit it almost never - it all depends on the data=20 > sets. I would > be cautious about optimising for the case that is already pretty fast! > There's a minor code-maintenance issue, but it's only two=20 > paths, and they > call a lot of the same functions (that the compiler inlines=20 > perfectly well). >=20 > > most of the time you are doing interpolation between key > > frames (that are > > stored locally) rather than fetching them. If your streams are that > > densely populated that you have to fetch every update you are doing > > something wrong, which i cant imagine being the case with Granny for > > example. >=20 > By "fetch", I meant you're fetching them from main memory=20 > into either a > cache or some sort of local storage (e.g. SPU memory). I was=20 > assuming the > animation has already been loaded off disk, otherwise you're=20 > doomed! Granny > uses splines rather than keyframes, but the data access=20 > patterns are roughly > similar to a sparse keyframe system. >=20 > The way Granny is oriented is on a character-by-character basis. Each > character is sampled & blended fully, pulling in the data on=20 > demand, and > then you do the next one. There's obvious exceptions for=20 > characters that > interact with each other, but they're certainly not the=20 > average case. This > keeps all the intermediate data nice and warm in caches or=20 > local storage, > and only the inputs (animations) and final outputs (what you=20 > render) live in > main memory. >=20 > To do effective streaming from main memory, you can either=20 > prefetch/DMA all > the animations for a character before you start sampling the=20 > first one (this > works well when characters are playing a lot of animations),=20 > or you can > prefetch character 2 before you sample & blend character 1=20 > (this works well > when characters are playing much fewer or simpler=20 > animations). If prefetch > times are long, and local storage/cache space is large, you=20 > can prefetch > character 3 or 4 before doing character 1, but that tends to get > counterproductive and can thrash caches (which is of course lethal). >=20 > Now, you could possibly orient it around animations instead. Load each > animation just once, then do all the sampling of that=20 > animation required for > all the characters in the scene. This means you only fetch=20 > each animation's > data just once. The problem is, the results of the sampling need to be > stored somewhere, but because you're doing all the sampling of that > animation in one go, you're unlikely to use that result soon=20 > enough for it > to stay in a cache or fit in limited local storage. And it's=20 > big - typically > a vec3 of position and a vec4 of orientation, or 28 bytes in=20 > total, and you > need to write the data then read it back later - bus traffic=20 > of 56 bytes. > Whereas source animations can be compressed far smaller than this - we > average around 4 bytes per spline control, and you usually=20 > need to sample > the nearest three controls in a spline (quadratic=20 > interpolation), and it's > read-only data, so around 12 bytes of traffic. Obviously=20 > these are really > rough figures, but there's still a significant difference. >=20 > > Granularity. By doing this unless you build some very fine > > scheduling metric > > you are going to have trouble balancing PU utilization. Some > > characters > > will have longer blend trees than other, taking more=20 > processing times, > > etc. Unless you are doing all anim operations on one PU.=20 > Still without > > test data on this one I may be completely off target. >=20 > In general (obviously waving my hands a bit here), if you=20 > have more than > about 4x the number of tasks than processors, and the time=20 > taken by each > task is distributed in a roughly bell-curvish way, your=20 > granularity is "fine > enough". Splitting the work into finer chunks might get=20 > slightly more level > load-balancing, but the overhead of doing the splitting, and the extra > synchronisation required because you have more tasks, means=20 > that you don't > actually get any benefit. >=20 > To put this another way - the worst case scenario is that all=20 > PUs finish all > available tasks, and have to wait for a single PU to do a=20 > single task. They > never have to wait for that last PU to do more than one task (because > otherwise one of the idle ones would do it instead). So if=20 > the total number > of tasks divided by the number of PUs is more than 10, then=20 > the worst you > can do is waste 10% of your processing power. Wasting only=20 > 10% of your total > power is considered superb efficiency in the multiprocessing=20 > world! This > assumes that the tasks are not dependent and can be processed=20 > by any PU in > any order, but if you partition by characters, this is 99%=20 > true (and for the > few that are dependent on something else, just schedule those first). > Certainly with 1000s of characters and only around 8 PUs=20 > (e.g. PS3), I just > can't see it being a problem. >=20 >=20 > TomF. >=20 >=20 > > -----Original Message----- > > From: gda...@li... > > [mailto:gda...@li...] On > > Behalf Of yg...@gy... > > Sent: 28 October 2005 03:08 > > To: gda...@li... > > Subject: RE: [Algorithms] skeletal animation system > > > > > > Hi Tom, > > > > Points, well made. > > > > I think we come from different points of view. You are > > talking about these > > few most important characters that you see in the foreground. > > I'm focusing on LOTS of characters in the background > > (instancing numbers, > > hundred, thousands, etc) and not so complicated animation operations > > (cross-fade blends being by far the most popular blend in my > > experience in > > these cases). > > Due to the overlapping nature of the memory allocation=20 > (especially in > > cross-fade blends) you have implied data separation that > > might as well be > > utilized. > > > > Also in my experience especially when animating all kinds of > > entities (and > > have some external attachment between them) its good to process all > > animations before you get to actually processing your > > entities (characters > > if you want). I think we agree on this point. > > > > > Every now and then you get multiple characters doing > > roughly the same > > > animation, and the cache helps, but I regard this as luck. > > We have actually done some research and testing on that. I > > dont think I > > agree with this statement when it comes to crowds. We (a=20 > colleague of > > mine) do massive character scenes by sorting morphed data. We > > upload the > > morph targets and then send instance data (translate,rotate). This > > achieves amazing (talking 1000+) quantities of characters on > > screen on a > > current console platform. I was convinced at first that this > > would never > > work because like you I didnt imagine there would be any reasonable > > exploitable "roughly the same keyframe" animation. > > > > > we need to optimise the worse case scenarios, not the best! > > Quite right ;) In my case worst scenario happens when you do > > fast forward > > cause you hit the anim stream way too often. Unless > > pre-streamed like you > > suggest. > > > > > So, in conclusion - assume every time you sample an > > animation, you will > > > miss every cache. > > Agreed. However, I'm arguing (and in my experience and > > analyser data) that > > most of the time you are doing interpolation between key > > frames (that are > > stored locally) rather than fetching them. If your streams are that > > densely populated that you have to fetch every update you are doing > > something wrong, which i cant imagine being the case with Granny for > > example. > > > > I do agree that blend trees are usually not that tall and > > binary blending > > (except for cross-fades) is not the best under the sun. > > > > > You do very few blends between already-sampled data. > > Actually we almost never do anything but that. Different > > perspective here. > > We do a few N-blends based on parameters like vectors, > > quaternions, etc. > > You can do an "artistic" IK look-alike that way. I'm sure > > there was link > > in this list some time ago... about this. > > > > > Thus, although it is elegant to split the processing into > > (a) sample all > > > your animations into temporary buffers and then (b) blend > > the buffers > > > together, in practice it makes more sense to have your > > animation sampling > > > also do the weighted accumulation, so that you can focus all your > > > optimisation on that one routine, and not have that > > intermediate data hang > > > around polluting caches - as soon as you sample a bone's > > data, you add it > > > to > > > its blend. > > My approach is: dont do any blends if you dont have to. So you > > sample/interpolate your anims and most often, if they are not > > in a cross > > fade, you are done. If there is something else to be done=20 > its another > > task. Makes for better code separation. With your solution=20 > you have to > > make a fit-all process function. > > Still your approach will be less memory consuming and less > > burden on the > > bus so it could be worth the maintainability-performance trade-off. > > > > > So the point I'm trying to (finally) get to is that trying > > to parallelise > > > the operations done on one particular character seems like > > a lot of effort > > > for little benefit. > > I'm not after parallelsing operations on one character. I'm for > > parallelizing all operations on all characters on one go. The > > animation > > system doesnt know about "character" as such... it just animating > > entities... > > Your main point here is that the serial term of the algorithm being > > dominant is reducing or invalidating any benefit from the > > parallel part. I > > must agree that its a good point but without any test data I > > cant say if > > that is the case or not. If task scheduling (and DMA transfers) is > > properly interlieved (double or tripple buffered) you dont > > get any penalty > > for that. Still, a very valid point. > > > > > IMHO, a far better thing is to assume that the number of=20 > characters > > > on-screen is significantly larger than the number of PUs, > > and simply do > > > all > > > the anim sampling and blending of each character on a single PU. > > Granularity. By doing this unless you build some very fine > > scheduling metric > > you are going to have trouble balancing PU utilization. Some > > characters > > will have longer blend trees than other, taking more=20 > processing times, > > etc. Unless you are doing all anim operations on one PU.=20 > Still without > > test data on this one I may be completely off target. > > > > Thanks ever so much for the feedback. In fact I think I'm=20 > going to try > > some of you ideas and see how they compare with the=20 > existing solution. > > > > -Yordan >=20 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_ida88 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188 >=20 |