Thread: [Algorithms] skeletal animation system
Brought to you by:
vexxed72
From: Yordan G. <yg...@gy...> - 2005-10-28 06:16:22
|
Hey everyone, I wrote this some time ago but never really got to releasing it. http://www.gyurchev.com/docs/animsys_gyurchev05.pdf Its a skeletal animation system that attempts to utilize concurrency in a multiprocessor environment in order to deliver masses of animated characters. Its based on a working solution that was targeted at the PlayStation2 but is scalable to next-gen. I hope you find it useful. Feedback is appreciated. ;) Thanks, Yordan PS: I posted this from my other account by mistake and it says it awaits approval due to a non-member post. So, sorry if it arrives twice. |
From: Tom F. <tom...@ee...> - 2005-10-28 07:23:35
|
Interesting. I've thought about this a fair bit myself (porting Granny = to the PS3 and Xbox360 being the obvious applications). In general, we have similar overall structures, but there's a few points where we don't = agree. "[there are] resources that are still in the shared main memory - the animation streams. In the majority of the cases the animation stream is = used to fetch new key frames. However if the delta times are large or/and key-frames densely populate the timeline it is possible that using this read-only resource will become a performance problem." My experience is that in scenes with a "decent" number of animated characters, caching the animation source data is almost useless. The = chances of two characters being at roughly the same time of the same source animation in a single rendered frame is rather low. So the only way a = cache can help is if it can cache data from doing a character's animation in = frame 1, all the way through to doing the same character's animation in frame = 2. This never ever happens of course - even the largest caches are far too small to get any coherency from one frame to the next! By "decent", I mean - enough to cause the above situation. Rather self-fulfilling definition, I know. The thing is, if your animation (and general game) demands are low enough that you don't completely fill all = your caches multiple times from one frame to the next, then you are rendering some very simple scenes indeed. Which means that typically you are = spending trivial amounts of single-thread CPU time doing animation, and the = effort to multi-thread it is a total waste of time! Every now and then you get multiple characters doing roughly the same animation, and the cache helps, but I regard this as luck. As others = have said frequently - a smooth framerate is more important than a good = average - we need to optimise the worse case scenarios, not the best! Unless it's something very specific like a squad of soldiers marching in a parade, having two characters do the same animation a significant amount of the = time usually means you don't have enough interesting animations! So, in conclusion - assume every time you sample an animation, you will = miss every cache. So now the biggest speed improvement is _not_ = multi-threading, it's that you arrange your animation sampling so that you can prefetch = the data in advance. Fortunately, the same thing that makes animation = relatively easy to thread also makes it relatively easy to stream your processing = so that you prefetch/DMA the data well in advance of actually using it. Second, you talk about taking a single character's blend tree (i.e. the hierarchy of animation sampling and blending of the results) and parallelising that. Typically, blend trees are not actually that large = on a single character - they have maybe 5-10 simultaneous animations on = average, with peaks of 20-30 for characters doing complex things. In addition, = the blend trees are not very "tall" - the common operation is a cumulative = sum of a bunch of animations (each with a weight - sometimes a per-bone = weight), followed by a normalise. You do very few blends between already-sampled data. Thus, although it is elegant to split the processing into (a) sample all your animations into temporary buffers and then (b) blend the buffers together, in practice it makes more sense to have your animation = sampling also do the weighted accumulation, so that you can focus all your optimisation on that one routine, and not have that intermediate data = hang around polluting caches - as soon as you sample a bone's data, you add = it to its blend. Also, splitting everything into a binary blend is far from efficient. = Yes, it's general, but usually people want n-way blends (and _commuting_ = n-way blends!), not binary ones. There are specific uses for binary blends (usually what I call "cross-fade" blends), but they're certainly not the common case. So the point I'm trying to (finally) get to is that trying to = parallelise the operations done on one particular character seems like a lot of = effort for little benefit. You have quite a lot of overhead, because if you = sample different anims on different PUs, to combine them together, you need to transfer the data from one PU to the other (at minimum - sometimes you = may need to transfer both results to a third PU). By "transfer" I either = mean an explicit thing like a PS3 SPU DMA, or I simply mean an inter-cache communication on a multiprocessor system (often via main memory). IMHO, a far better thing is to assume that the number of characters on-screen is significantly larger than the number of PUs, and simply do = all the anim sampling and blending of each character on a single PU. No transfers, no syncronisation, no dependency or priority scheduling, you = can do complex serialised operations in the middle of the blend tree (e.g. = IK fixups, then blend the results of that), you can break the strict = hierarchy of the blend tree in places without worrying about multiple PUs (which = is sometimes necessary for some effects), and you get the best possible = caching of intermediate results. The only way you use multiple PUs is that each character can go onto a different one. As stated before, there is almost no sharing of source = data between different characters, and (by design) no sharing of intermediate data, so this is pretty simple to do. Also, there are only two synchronisation points - the first is to make sure two PUs don't try to process the same character, and the second is to make sure all animation = is finished before going on to do the rest of the game (you can do per-character synchronisation just before you render each particular character, but I'm not sure it's worth the extra complication). Again, my claim is that if you have the same or fewer characters than = PUs, then well - you're just not doing very much animation - go think about parallelising something that takes more than a few % of CPU time! Hmm... that turned out longer and slightly more ranty than I intended. = Sorry - take it with a pinch of salt. TomF. > -----Original Message----- > From: gda...@li...=20 > [mailto:gda...@li...] On=20 > Behalf Of Yordan Gyurchev > Sent: 27 October 2005 23:16 > To: gda...@li... > Subject: [Algorithms] skeletal animation system >=20 >=20 > Hey everyone, >=20 > I wrote this some time ago but never really got to releasing it. > http://www.gyurchev.com/docs/animsys_gyurchev05.pdf >=20 > Its a skeletal animation system that attempts to utilize=20 > concurrency in a > multiprocessor environment in order to deliver masses of animated > characters. Its based on a working solution that was targeted at the > PlayStation2 but is scalable to next-gen. >=20 > I hope you find it useful. >=20 > Feedback is appreciated. ;) >=20 > Thanks, > Yordan >=20 > PS: I posted this from my other account by mistake and it=20 > says it awaits=20 > approval due to a non-member post. So, sorry if it arrives twice.=20 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188 >=20 |
From: <yg...@gy...> - 2005-10-28 10:08:29
|
Hi Tom, Points, well made. I think we come from different points of view. You are talking about these few most important characters that you see in the foreground. I'm focusing on LOTS of characters in the background (instancing numbers, hundred, thousands, etc) and not so complicated animation operations (cross-fade blends being by far the most popular blend in my experience in these cases). Due to the overlapping nature of the memory allocation (especially in cross-fade blends) you have implied data separation that might as well be utilized. Also in my experience especially when animating all kinds of entities (and have some external attachment between them) its good to process all animations before you get to actually processing your entities (characters if you want). I think we agree on this point. > Every now and then you get multiple characters doing roughly the same > animation, and the cache helps, but I regard this as luck. We have actually done some research and testing on that. I dont think I agree with this statement when it comes to crowds. We (a colleague of mine) do massive character scenes by sorting morphed data. We upload the morph targets and then send instance data (translate,rotate). This achieves amazing (talking 1000+) quantities of characters on screen on a current console platform. I was convinced at first that this would never work because like you I didnt imagine there would be any reasonable exploitable "roughly the same keyframe" animation. > we need to optimise the worse case scenarios, not the best! Quite right ;) In my case worst scenario happens when you do fast forward cause you hit the anim stream way too often. Unless pre-streamed like you suggest. > So, in conclusion - assume every time you sample an animation, you will > miss every cache. Agreed. However, I'm arguing (and in my experience and analyser data) that most of the time you are doing interpolation between key frames (that are stored locally) rather than fetching them. If your streams are that densely populated that you have to fetch every update you are doing something wrong, which i cant imagine being the case with Granny for example. I do agree that blend trees are usually not that tall and binary blending (except for cross-fades) is not the best under the sun. > You do very few blends between already-sampled data. Actually we almost never do anything but that. Different perspective here. We do a few N-blends based on parameters like vectors, quaternions, etc. You can do an "artistic" IK look-alike that way. I'm sure there was link in this list some time ago... about this. > Thus, although it is elegant to split the processing into (a) sample all > your animations into temporary buffers and then (b) blend the buffers > together, in practice it makes more sense to have your animation sampling > also do the weighted accumulation, so that you can focus all your > optimisation on that one routine, and not have that intermediate data hang > around polluting caches - as soon as you sample a bone's data, you add it > to > its blend. My approach is: dont do any blends if you dont have to. So you sample/interpolate your anims and most often, if they are not in a cross fade, you are done. If there is something else to be done its another task. Makes for better code separation. With your solution you have to make a fit-all process function. Still your approach will be less memory consuming and less burden on the bus so it could be worth the maintainability-performance trade-off. > So the point I'm trying to (finally) get to is that trying to parallelise > the operations done on one particular character seems like a lot of effort > for little benefit. I'm not after parallelsing operations on one character. I'm for parallelizing all operations on all characters on one go. The animation system doesnt know about "character" as such... it just animating entities... Your main point here is that the serial term of the algorithm being dominant is reducing or invalidating any benefit from the parallel part. I must agree that its a good point but without any test data I cant say if that is the case or not. If task scheduling (and DMA transfers) is properly interlieved (double or tripple buffered) you dont get any penalty for that. Still, a very valid point. > IMHO, a far better thing is to assume that the number of characters > on-screen is significantly larger than the number of PUs, and simply do > all > the anim sampling and blending of each character on a single PU. Granularity. By doing this unless you build some very fine scheduling metric you are going to have trouble balancing PU utilization. Some characters will have longer blend trees than other, taking more processing times, etc. Unless you are doing all anim operations on one PU. Still without test data on this one I may be completely off target. Thanks ever so much for the feedback. In fact I think I'm going to try some of you ideas and see how they compare with the existing solution. -Yordan |
From: Tom F. <tom...@ee...> - 2005-10-28 19:27:22
|
> You are talking about these > few most important characters that you see in the foreground. I'm also talking about decent numbers of characters. In fact, if there's 1000s of characters in the scene, then keeping each character's = animation inside a single PU makes even more sense. > Due to the overlapping nature of the memory allocation (especially in > cross-fade blends) you have implied data separation that=20 > might as well be utilized. Sorry, you've totally lost me there :-) > > You do very few blends between already-sampled data. > Actually we almost never do anything but that. Different=20 > perspective here. > We do a few N-blends based on parameters like vectors,=20 > quaternions, etc. > You can do an "artistic" IK look-alike that way. I'm sure=20 > there was link > in this list some time ago... about this. Bad mis-communication on my part. What I meant is that because trees are broad but shallow, you are almost always blending together data sampled directly from an animation (the leaves of the tree). You rarely blend = data that is the result of a previous blend (the branches and trunk of the = tree). So IMHO it's best to focus optimisation on what you do at each leaf, = which can usually be expressed as sample-and-blend. Of course, in a purely binary tree, you have roughly equal numbers of = the two operations. But that's one reason why a binary tree isn't that great = an idea in general :-) However, you do say that in your data sets you often have just one = animation running, so it's a sample with no blending. Granny also has a = special-cased path for this (only one active animation). In practice, some games hit = that a lot, some hit it almost never - it all depends on the data sets. I = would be cautious about optimising for the case that is already pretty fast! There's a minor code-maintenance issue, but it's only two paths, and = they call a lot of the same functions (that the compiler inlines perfectly = well). > most of the time you are doing interpolation between key=20 > frames (that are > stored locally) rather than fetching them. If your streams are that > densely populated that you have to fetch every update you are doing > something wrong, which i cant imagine being the case with Granny for > example. By "fetch", I meant you're fetching them from main memory into either a cache or some sort of local storage (e.g. SPU memory). I was assuming = the animation has already been loaded off disk, otherwise you're doomed! = Granny uses splines rather than keyframes, but the data access patterns are = roughly similar to a sparse keyframe system. The way Granny is oriented is on a character-by-character basis. Each character is sampled & blended fully, pulling in the data on demand, and then you do the next one. There's obvious exceptions for characters that interact with each other, but they're certainly not the average case. = This keeps all the intermediate data nice and warm in caches or local = storage, and only the inputs (animations) and final outputs (what you render) = live in main memory. To do effective streaming from main memory, you can either prefetch/DMA = all the animations for a character before you start sampling the first one = (this works well when characters are playing a lot of animations), or you can prefetch character 2 before you sample & blend character 1 (this works = well when characters are playing much fewer or simpler animations). If = prefetch times are long, and local storage/cache space is large, you can prefetch character 3 or 4 before doing character 1, but that tends to get counterproductive and can thrash caches (which is of course lethal). Now, you could possibly orient it around animations instead. Load each animation just once, then do all the sampling of that animation required = for all the characters in the scene. This means you only fetch each = animation's data just once. The problem is, the results of the sampling need to be stored somewhere, but because you're doing all the sampling of that animation in one go, you're unlikely to use that result soon enough for = it to stay in a cache or fit in limited local storage. And it's big - = typically a vec3 of position and a vec4 of orientation, or 28 bytes in total, and = you need to write the data then read it back later - bus traffic of 56 = bytes. Whereas source animations can be compressed far smaller than this - we average around 4 bytes per spline control, and you usually need to = sample the nearest three controls in a spline (quadratic interpolation), and = it's read-only data, so around 12 bytes of traffic. Obviously these are = really rough figures, but there's still a significant difference. > Granularity. By doing this unless you build some very fine=20 > scheduling metric > you are going to have trouble balancing PU utilization. Some=20 > characters > will have longer blend trees than other, taking more processing times, > etc. Unless you are doing all anim operations on one PU. Still without > test data on this one I may be completely off target. In general (obviously waving my hands a bit here), if you have more than about 4x the number of tasks than processors, and the time taken by each task is distributed in a roughly bell-curvish way, your granularity is = "fine enough". Splitting the work into finer chunks might get slightly more = level load-balancing, but the overhead of doing the splitting, and the extra synchronisation required because you have more tasks, means that you = don't actually get any benefit. To put this another way - the worst case scenario is that all PUs finish = all available tasks, and have to wait for a single PU to do a single task. = They never have to wait for that last PU to do more than one task (because otherwise one of the idle ones would do it instead). So if the total = number of tasks divided by the number of PUs is more than 10, then the worst = you can do is waste 10% of your processing power. Wasting only 10% of your = total power is considered superb efficiency in the multiprocessing world! This assumes that the tasks are not dependent and can be processed by any PU = in any order, but if you partition by characters, this is 99% true (and for = the few that are dependent on something else, just schedule those first). Certainly with 1000s of characters and only around 8 PUs (e.g. PS3), I = just can't see it being a problem. TomF. > -----Original Message----- > From: gda...@li...=20 > [mailto:gda...@li...] On=20 > Behalf Of yg...@gy... > Sent: 28 October 2005 03:08 > To: gda...@li... > Subject: RE: [Algorithms] skeletal animation system >=20 >=20 > Hi Tom, >=20 > Points, well made. >=20 > I think we come from different points of view. You are=20 > talking about these > few most important characters that you see in the foreground. > I'm focusing on LOTS of characters in the background=20 > (instancing numbers, > hundred, thousands, etc) and not so complicated animation operations > (cross-fade blends being by far the most popular blend in my=20 > experience in > these cases). > Due to the overlapping nature of the memory allocation (especially in > cross-fade blends) you have implied data separation that=20 > might as well be > utilized. >=20 > Also in my experience especially when animating all kinds of=20 > entities (and > have some external attachment between them) its good to process all > animations before you get to actually processing your=20 > entities (characters > if you want). I think we agree on this point. >=20 > > Every now and then you get multiple characters doing=20 > roughly the same > > animation, and the cache helps, but I regard this as luck. > We have actually done some research and testing on that. I=20 > dont think I > agree with this statement when it comes to crowds. We (a colleague of > mine) do massive character scenes by sorting morphed data. We=20 > upload the > morph targets and then send instance data (translate,rotate). This > achieves amazing (talking 1000+) quantities of characters on=20 > screen on a > current console platform. I was convinced at first that this=20 > would never > work because like you I didnt imagine there would be any reasonable > exploitable "roughly the same keyframe" animation. >=20 > > we need to optimise the worse case scenarios, not the best! > Quite right ;) In my case worst scenario happens when you do=20 > fast forward > cause you hit the anim stream way too often. Unless=20 > pre-streamed like you > suggest. >=20 > > So, in conclusion - assume every time you sample an=20 > animation, you will > > miss every cache. > Agreed. However, I'm arguing (and in my experience and=20 > analyser data) that > most of the time you are doing interpolation between key=20 > frames (that are > stored locally) rather than fetching them. If your streams are that > densely populated that you have to fetch every update you are doing > something wrong, which i cant imagine being the case with Granny for > example. >=20 > I do agree that blend trees are usually not that tall and=20 > binary blending > (except for cross-fades) is not the best under the sun. >=20 > > You do very few blends between already-sampled data. > Actually we almost never do anything but that. Different=20 > perspective here. > We do a few N-blends based on parameters like vectors,=20 > quaternions, etc. > You can do an "artistic" IK look-alike that way. I'm sure=20 > there was link > in this list some time ago... about this. >=20 > > Thus, although it is elegant to split the processing into=20 > (a) sample all > > your animations into temporary buffers and then (b) blend=20 > the buffers > > together, in practice it makes more sense to have your=20 > animation sampling > > also do the weighted accumulation, so that you can focus all your > > optimisation on that one routine, and not have that=20 > intermediate data hang > > around polluting caches - as soon as you sample a bone's=20 > data, you add it > > to > > its blend. > My approach is: dont do any blends if you dont have to. So you > sample/interpolate your anims and most often, if they are not=20 > in a cross > fade, you are done. If there is something else to be done its another > task. Makes for better code separation. With your solution you have to > make a fit-all process function. > Still your approach will be less memory consuming and less=20 > burden on the > bus so it could be worth the maintainability-performance trade-off. >=20 > > So the point I'm trying to (finally) get to is that trying=20 > to parallelise > > the operations done on one particular character seems like=20 > a lot of effort > > for little benefit. > I'm not after parallelsing operations on one character. I'm for > parallelizing all operations on all characters on one go. The=20 > animation > system doesnt know about "character" as such... it just animating > entities... > Your main point here is that the serial term of the algorithm being > dominant is reducing or invalidating any benefit from the=20 > parallel part. I > must agree that its a good point but without any test data I=20 > cant say if > that is the case or not. If task scheduling (and DMA transfers) is > properly interlieved (double or tripple buffered) you dont=20 > get any penalty > for that. Still, a very valid point. >=20 > > IMHO, a far better thing is to assume that the number of characters > > on-screen is significantly larger than the number of PUs,=20 > and simply do > > all > > the anim sampling and blending of each character on a single PU. > Granularity. By doing this unless you build some very fine=20 > scheduling metric > you are going to have trouble balancing PU utilization. Some=20 > characters > will have longer blend trees than other, taking more processing times, > etc. Unless you are doing all anim operations on one PU. Still without > test data on this one I may be completely off target. >=20 > Thanks ever so much for the feedback. In fact I think I'm going to try > some of you ideas and see how they compare with the existing solution. >=20 > -Yordan |
From: Yordan G. <yg...@gy...> - 2005-10-28 21:29:37
|
Ok, you've won me over. Processing one character/entity on one PU thing. So, one character = one task. When I think about it though it fits my approach, just the scale is different. > By "fetch", I meant you're fetching them from main memory into either a > cache or some sort of local storage (e.g. SPU memory). I was assuming the > animation has already been loaded off disk, otherwise you're doomed! > Granny > uses splines rather than keyframes, but the data access patterns are > roughly > similar to a sparse keyframe system. Fetch is = main memory => PU I use the word keyframe for both normal keyframes and spline knots (probably wrong but in our case its still some data at some point in time regardless of the interp method). > So IMHO it's best to focus optimisation on what you do at each leaf, which > can usually be expressed as sample-and-blend. Agreed. This is what I'm most interested in as well. And analyser data shows most time is spend there. I find it difficult to follow what exactly you do with the PU and what goes to local memory and back to main etc (see later my rant on DMA transfers). IMHO there needs to be a number of support structures. Inputs like where in the animation you are, blending weights, parameters, etc. And outputs like the final skeleton state (what is rendered), other output info (we return for example velocity of the root or absolute movement). In my approach I put all my support structures in one solid chunk and call that animation state. Its all the data the task will need (regardless of the scale of the task - be it a full character or not). So we move the animation state into local memory, process that character, then move it back. It can be structured so read-only data goes only one way without problem. Now the question is what goes in this animation state and what stays in main memory and is fetched on request (or pre-streamed). With my approach I keep a duplicates of the control points (key frames) that are currently interpolated into local interpolation structures (local to the animation state). So, only when I move to the next keyframe (require next knot) I touch main memory to fetch it (or not if it is pre-streamed in local). My argument here was that new keyframes/knots are required infrequently. With a good spline fitting algorithm you should be experiencing the same. Mine is (I assume) worse than yours by a mile and still keyframe fetches (and for that reason cache misses) are not an issue when I analyze the execution. Even more you can fetch asynchronously an extra keyframe/control so you don't have any stalls. Do you have similar control structures or you rely that animation is going to be prefetched in local storage? If so do you have guarantee that every single of your animations is going to fit there? Remember this would be about 1/3 (in practice more like 1/4) of real local storage space if you plan to triple buffer to keep the PU busy. May be I'm missing something here. > Whereas source animations can be compressed far smaller than this - we > average around 4 bytes per spline control, and you usually need to sample > the nearest three controls in a spline (quadratic interpolation), and it's > read-only data, so around 12 bytes of traffic. Obviously these are really > rough figures, but there's still a significant difference. Yes I understand the maths, but I think you are focusing on the wrong thing here. Its not always how much you transfer but on what chunks and how often.You have to decide where in the animation stream are each quadruples of controls for each bone in order to get only those quadruples and DMA them in the local PU memory. Now I can tell you stright away (and I'm almost sure things wont change much) that on PS2 one solid DMA transfer is far better than a number of small ones. They have some setup time and need to be prioritized and etc. So the less you have the better. Not only that you have to spend CPU some time deciding/calculating where exactly in your animation are these quadruples of keyframes and that is not something you want to be doing every frame, or you do? If you want to directly sample data you need to move most of your animation (if possible all of it) in local mem. Or consider some genius animation format where you can DMA solid blocks every time you need to sample data (once per animation). I must say that load every animation and sample all characters seems attractive. Though it hides potential problems of either not being able to fit animation (too long animation) or sample data (too many character). On the topic of perfect load balancing I agree with you. -Yordan ----- Original Message ----- From: "Tom Forsyth" <tom...@ee...> To: <gda...@li...> Sent: Friday, October 28, 2005 8:27 PM Subject: RE: [Algorithms] skeletal animation system > You are talking about these > few most important characters that you see in the foreground. I'm also talking about decent numbers of characters. In fact, if there's 1000s of characters in the scene, then keeping each character's animation inside a single PU makes even more sense. > Due to the overlapping nature of the memory allocation (especially in > cross-fade blends) you have implied data separation that > might as well be utilized. Sorry, you've totally lost me there :-) > > You do very few blends between already-sampled data. > Actually we almost never do anything but that. Different > perspective here. > We do a few N-blends based on parameters like vectors, > quaternions, etc. > You can do an "artistic" IK look-alike that way. I'm sure > there was link > in this list some time ago... about this. Bad mis-communication on my part. What I meant is that because trees are broad but shallow, you are almost always blending together data sampled directly from an animation (the leaves of the tree). You rarely blend data that is the result of a previous blend (the branches and trunk of the tree). So IMHO it's best to focus optimisation on what you do at each leaf, which can usually be expressed as sample-and-blend. Of course, in a purely binary tree, you have roughly equal numbers of the two operations. But that's one reason why a binary tree isn't that great an idea in general :-) However, you do say that in your data sets you often have just one animation running, so it's a sample with no blending. Granny also has a special-cased path for this (only one active animation). In practice, some games hit that a lot, some hit it almost never - it all depends on the data sets. I would be cautious about optimising for the case that is already pretty fast! There's a minor code-maintenance issue, but it's only two paths, and they call a lot of the same functions (that the compiler inlines perfectly well). > most of the time you are doing interpolation between key > frames (that are > stored locally) rather than fetching them. If your streams are that > densely populated that you have to fetch every update you are doing > something wrong, which i cant imagine being the case with Granny for > example. By "fetch", I meant you're fetching them from main memory into either a cache or some sort of local storage (e.g. SPU memory). I was assuming the animation has already been loaded off disk, otherwise you're doomed! Granny uses splines rather than keyframes, but the data access patterns are roughly similar to a sparse keyframe system. The way Granny is oriented is on a character-by-character basis. Each character is sampled & blended fully, pulling in the data on demand, and then you do the next one. There's obvious exceptions for characters that interact with each other, but they're certainly not the average case. This keeps all the intermediate data nice and warm in caches or local storage, and only the inputs (animations) and final outputs (what you render) live in main memory. To do effective streaming from main memory, you can either prefetch/DMA all the animations for a character before you start sampling the first one (this works well when characters are playing a lot of animations), or you can prefetch character 2 before you sample & blend character 1 (this works well when characters are playing much fewer or simpler animations). If prefetch times are long, and local storage/cache space is large, you can prefetch character 3 or 4 before doing character 1, but that tends to get counterproductive and can thrash caches (which is of course lethal). Now, you could possibly orient it around animations instead. Load each animation just once, then do all the sampling of that animation required for all the characters in the scene. This means you only fetch each animation's data just once. The problem is, the results of the sampling need to be stored somewhere, but because you're doing all the sampling of that animation in one go, you're unlikely to use that result soon enough for it to stay in a cache or fit in limited local storage. And it's big - typically a vec3 of position and a vec4 of orientation, or 28 bytes in total, and you need to write the data then read it back later - bus traffic of 56 bytes. Whereas source animations can be compressed far smaller than this - we average around 4 bytes per spline control, and you usually need to sample the nearest three controls in a spline (quadratic interpolation), and it's read-only data, so around 12 bytes of traffic. Obviously these are really rough figures, but there's still a significant difference. > Granularity. By doing this unless you build some very fine > scheduling metric > you are going to have trouble balancing PU utilization. Some > characters > will have longer blend trees than other, taking more processing times, > etc. Unless you are doing all anim operations on one PU. Still without > test data on this one I may be completely off target. In general (obviously waving my hands a bit here), if you have more than about 4x the number of tasks than processors, and the time taken by each task is distributed in a roughly bell-curvish way, your granularity is "fine enough". Splitting the work into finer chunks might get slightly more level load-balancing, but the overhead of doing the splitting, and the extra synchronisation required because you have more tasks, means that you don't actually get any benefit. To put this another way - the worst case scenario is that all PUs finish all available tasks, and have to wait for a single PU to do a single task. They never have to wait for that last PU to do more than one task (because otherwise one of the idle ones would do it instead). So if the total number of tasks divided by the number of PUs is more than 10, then the worst you can do is waste 10% of your processing power. Wasting only 10% of your total power is considered superb efficiency in the multiprocessing world! This assumes that the tasks are not dependent and can be processed by any PU in any order, but if you partition by characters, this is 99% true (and for the few that are dependent on something else, just schedule those first). Certainly with 1000s of characters and only around 8 PUs (e.g. PS3), I just can't see it being a problem. TomF. > -----Original Message----- > From: gda...@li... > [mailto:gda...@li...] On > Behalf Of yg...@gy... > Sent: 28 October 2005 03:08 > To: gda...@li... > Subject: RE: [Algorithms] skeletal animation system > > > Hi Tom, > > Points, well made. > > I think we come from different points of view. You are > talking about these > few most important characters that you see in the foreground. > I'm focusing on LOTS of characters in the background > (instancing numbers, > hundred, thousands, etc) and not so complicated animation operations > (cross-fade blends being by far the most popular blend in my > experience in > these cases). > Due to the overlapping nature of the memory allocation (especially in > cross-fade blends) you have implied data separation that > might as well be > utilized. > > Also in my experience especially when animating all kinds of > entities (and > have some external attachment between them) its good to process all > animations before you get to actually processing your > entities (characters > if you want). I think we agree on this point. > > > Every now and then you get multiple characters doing > roughly the same > > animation, and the cache helps, but I regard this as luck. > We have actually done some research and testing on that. I > dont think I > agree with this statement when it comes to crowds. We (a colleague of > mine) do massive character scenes by sorting morphed data. We > upload the > morph targets and then send instance data (translate,rotate). This > achieves amazing (talking 1000+) quantities of characters on > screen on a > current console platform. I was convinced at first that this > would never > work because like you I didnt imagine there would be any reasonable > exploitable "roughly the same keyframe" animation. > > > we need to optimise the worse case scenarios, not the best! > Quite right ;) In my case worst scenario happens when you do > fast forward > cause you hit the anim stream way too often. Unless > pre-streamed like you > suggest. > > > So, in conclusion - assume every time you sample an > animation, you will > > miss every cache. > Agreed. However, I'm arguing (and in my experience and > analyser data) that > most of the time you are doing interpolation between key > frames (that are > stored locally) rather than fetching them. If your streams are that > densely populated that you have to fetch every update you are doing > something wrong, which i cant imagine being the case with Granny for > example. > > I do agree that blend trees are usually not that tall and > binary blending > (except for cross-fades) is not the best under the sun. > > > You do very few blends between already-sampled data. > Actually we almost never do anything but that. Different > perspective here. > We do a few N-blends based on parameters like vectors, > quaternions, etc. > You can do an "artistic" IK look-alike that way. I'm sure > there was link > in this list some time ago... about this. > > > Thus, although it is elegant to split the processing into > (a) sample all > > your animations into temporary buffers and then (b) blend > the buffers > > together, in practice it makes more sense to have your > animation sampling > > also do the weighted accumulation, so that you can focus all your > > optimisation on that one routine, and not have that > intermediate data hang > > around polluting caches - as soon as you sample a bone's > data, you add it > > to > > its blend. > My approach is: dont do any blends if you dont have to. So you > sample/interpolate your anims and most often, if they are not > in a cross > fade, you are done. If there is something else to be done its another > task. Makes for better code separation. With your solution you have to > make a fit-all process function. > Still your approach will be less memory consuming and less > burden on the > bus so it could be worth the maintainability-performance trade-off. > > > So the point I'm trying to (finally) get to is that trying > to parallelise > > the operations done on one particular character seems like > a lot of effort > > for little benefit. > I'm not after parallelsing operations on one character. I'm for > parallelizing all operations on all characters on one go. The > animation > system doesnt know about "character" as such... it just animating > entities... > Your main point here is that the serial term of the algorithm being > dominant is reducing or invalidating any benefit from the > parallel part. I > must agree that its a good point but without any test data I > cant say if > that is the case or not. If task scheduling (and DMA transfers) is > properly interlieved (double or tripple buffered) you dont > get any penalty > for that. Still, a very valid point. > > > IMHO, a far better thing is to assume that the number of characters > > on-screen is significantly larger than the number of PUs, > and simply do > > all > > the anim sampling and blending of each character on a single PU. > Granularity. By doing this unless you build some very fine > scheduling metric > you are going to have trouble balancing PU utilization. Some > characters > will have longer blend trees than other, taking more processing times, > etc. Unless you are doing all anim operations on one PU. Still without > test data on this one I may be completely off target. > > Thanks ever so much for the feedback. In fact I think I'm going to try > some of you ideas and see how they compare with the existing solution. > > -Yordan ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_ida88 |
From: Tom F. <tom...@ee...> - 2005-10-29 03:12:43
|
> My argument here was that new keyframes/knots are required=20 > infrequently. With a=20 > good spline fitting algorithm you should be experiencing the=20 > same. Well, from frame 1 of character 3, bone 4 to frame 2 of character 3, = bone 4, yes - you will need almost exactly the same data. The problem is, those = two samplings are an entire 60th of a second apart. You'll want to fill = local store/cache with tons of other data in the meantime. The chances of characters 4, 5, 6, etc needing the same 3-4 knots or keyframes in frame 1 is tiny, according to my data. But yes, maybe if = you have a big crowd of characters all running the same animation at roughly = the same phase, you'll get re-use. I'm just saying, I don't think that is = even close to the worst case performance, nor do I think it is even a common case. But it does depend on your game type. > Do you have similar control structures or you rely that=20 > animation is going=20 > to be prefetched in local storage? If so do you have=20 > guarantee that every=20 > single of your animations is going to fit there? Remember=20 > this would be=20 > about 1/3 (in practice more like 1/4) of real local storage=20 > space if you=20 > plan to triple buffer to keep the PU busy. I only fetch the 3-4 control points in each spline that the current character needs. If the next character happens to also need them, then = hey - that's fine - the cache might do useful work. But I assume it doesn't, = and I assume that each character has to fetch all-new data, because that is = the worse case, and it's also the common case. So I need to prefetch that = data before I do the sampling & blending Space usually isn't a problem - I only need 4 control points from pos & = orn for each bone for each animation. In practice, the size of the required control data (e.g. how long each animation is, what sort of compression = each spline uses, etc) is bigger than the control point data I need this particular frame. But none of it is really very large - it fits in all sensible-sized caches and local data storage just fine. Note that I have not tried this on the PS2's scratchpad - that is rather small. The cache performance on the PS2 was so abysmal in every way that = it would have been a big big job to get it even half-decent on there. Since = I have five other platforms to deal with, I just waited for someone to try = to run a complex scene on the PS2 before I worried about it. And nobody = ever did, so it never had to be done. And probably never will now. Which I am very happy about :-) > You have to decide where in the animation stream are=20 > each quadruples=20 > of controls for each bone in order to get only those=20 > quadruples and DMA them=20 > in the local PU memory. Correct. > Now I can tell you stright away (and=20 > I'm almost sure=20 > things wont change much) that on PS2 one solid DMA transfer=20 > is far better=20 > than a number of small ones. They have some setup time and need to be=20 > prioritized and etc. So the less you have the better. Not=20 > only that you have=20 > to spend CPU some time deciding/calculating where exactly in=20 > your animation=20 > are these quadruples of keyframes and that is not something=20 > you want to be=20 > doing every frame, or you do? Well, you have to decide which controls you need anyway - that's part of = the spline sampling process. So that's no extra cost. Let me preface this with a warning - I have not tried a DMA-based scheme yet. That's because on every other platform, we're not talking about = DMAs, we're talking about normal cache accesses, and they don't care about small/large transfers - everything is just the size of a cache line, and setting up a transfer is just executing a "prefetch" with the relevant address - extremely cheap. All you need to do is make sure you do those prefetches far enough in advance (to absorb the latency of the transfer = from main memory), and life is good. The only platforms that DMAs are needed = for are PS2 (avoided doing this - see above) and PS3. So, here is my entirely theoretical plan for PS3 - due to be implemented some time next year. Granny allows animation splines to be listed in any order you like in = memory - they are remapped to the skeleton on the fly. This is a big bonus for = a bunch of things, but that would be a very long discussion, so let's just take that as read. Some splines will have lots of knots/controls, and = some will be able to use different sorts of compression, and so they will all = be different lengths in bytes. So, given that, order your spline data in = memory from smallest to largest. Now, decide what your magic byte length is for your platform. This is = going to be largely trial and error, but it's roughly the place where the cost = of setting up a new transfer exceeds the cost of transferring the extra = data. So let's say that's 64 bytes (chosen at random). That is, if you wanted = to transfer two chunks of 12 bytes that were 64 bytes apart, it is the same speed to set up a single transfer that is 64+12 bytes long, as to do two transfers of 12 bytes each. OK, so we know we need 12 bytes (roughly - varies with compression and spline degree) from each spline in the list. And they're ordered by = size. So we just need to find the first spline with data longer than 64 bytes. = All the splines before it in the list just get their entire data DMA'd over = in a single chunk - it's not efficient to do them with separate transfers. = All the ones afterwards just get a 12-byte transfer set up for each of them, because that's the only bit you need, and you'd probably run out of = local store if you transferred the whole lot, as well as having to wait longer = for the transfer. So that's the theory. I have yet to try it in practice though. Now, it = may be that the number is not 64 bytes, it's more like 4kbytes or something. = In which case this mostly just collapses to "transfer the whole animation over". Well, so be it. But people do use some pretty long animations = with Granny (I've had people export 15-minute credit sequences as a single animation!), so at some point you need to cope with not being able to = DMA the whole thing over. As I said above, on every platform but the PS*, it's trivial. They have caches, and you just do a prefetch instruction per spline. On PS3, it's somewhat more effort <shakes fist at Sony/IBM for their crazy = architecture>. TomF. > -----Original Message----- > From: gda...@li...=20 > [mailto:gda...@li...] On=20 > Behalf Of Yordan Gyurchev > Sent: 28 October 2005 14:29 > To: gda...@li... > Subject: Re: [Algorithms] skeletal animation system >=20 >=20 > Ok, you've won me over. Processing one character/entity on=20 > one PU thing. So,=20 > one character =3D one task. >=20 > When I think about it though it fits my approach, just the scale is=20 > different. >=20 > > By "fetch", I meant you're fetching them from main memory=20 > into either a > > cache or some sort of local storage (e.g. SPU memory). I=20 > was assuming the > > animation has already been loaded off disk, otherwise=20 > you're doomed!=20 > > Granny > > uses splines rather than keyframes, but the data access=20 > patterns are=20 > > roughly > > similar to a sparse keyframe system. > Fetch is =3D main memory =3D> PU > I use the word keyframe for both normal keyframes and spline=20 > knots (probably=20 > wrong but in our case its still some data at some point in=20 > time regardless=20 > of the interp method). >=20 > > So IMHO it's best to focus optimisation on what you do at=20 > each leaf, which > > can usually be expressed as sample-and-blend. > Agreed. This is what I'm most interested in as well. And=20 > analyser data shows=20 > most time is spend there. >=20 > I find it difficult to follow what exactly you do with the PU=20 > and what goes=20 > to local memory and back to main etc (see later my rant on=20 > DMA transfers).=20 > IMHO there needs to be a number of support structures. Inputs=20 > like where in=20 > the animation you are, blending weights, parameters, etc. And=20 > outputs like=20 > the final skeleton state (what is rendered), other output=20 > info (we return=20 > for example velocity of the root or absolute movement). >=20 > In my approach I put all my support structures in one solid=20 > chunk and call=20 > that animation state. Its all the data the task will need=20 > (regardless of the=20 > scale of the task - be it a full character or not). >=20 > So we move the animation state into local memory, process=20 > that character,=20 > then move it back. It can be structured so read-only data=20 > goes only one way=20 > without problem. >=20 > Now the question is what goes in this animation state and=20 > what stays in main=20 > memory and is fetched on request (or pre-streamed). With my=20 > approach I keep=20 > a duplicates of the control points (key frames) that are currently=20 > interpolated into local interpolation structures (local to=20 > the animation=20 > state). So, only when I move to the next keyframe (require=20 > next knot) I=20 > touch main memory to fetch it (or not if it is pre-streamed=20 > in local). My=20 > argument here was that new keyframes/knots are required=20 > infrequently. With a=20 > good spline fitting algorithm you should be experiencing the=20 > same. Mine is=20 > (I assume) worse than yours by a mile and still keyframe=20 > fetches (and for=20 > that reason cache misses) are not an issue when I analyze the=20 > execution.=20 > Even more you can fetch asynchronously an extra=20 > keyframe/control so you=20 > don't have any stalls. >=20 > Do you have similar control structures or you rely that=20 > animation is going=20 > to be prefetched in local storage? If so do you have=20 > guarantee that every=20 > single of your animations is going to fit there? Remember=20 > this would be=20 > about 1/3 (in practice more like 1/4) of real local storage=20 > space if you=20 > plan to triple buffer to keep the PU busy. >=20 > May be I'm missing something here. >=20 > > Whereas source animations can be compressed far smaller=20 > than this - we > > average around 4 bytes per spline control, and you usually=20 > need to sample > > the nearest three controls in a spline (quadratic=20 > interpolation), and it's > > read-only data, so around 12 bytes of traffic. Obviously=20 > these are really > > rough figures, but there's still a significant difference. > Yes I understand the maths, but I think you are focusing on=20 > the wrong thing=20 > here. Its not always how much you transfer but on what=20 > chunks and how=20 > often.You have to decide where in the animation stream are=20 > each quadruples=20 > of controls for each bone in order to get only those=20 > quadruples and DMA them=20 > in the local PU memory. Now I can tell you stright away (and=20 > I'm almost sure=20 > things wont change much) that on PS2 one solid DMA transfer=20 > is far better=20 > than a number of small ones. They have some setup time and need to be=20 > prioritized and etc. So the less you have the better. Not=20 > only that you have=20 > to spend CPU some time deciding/calculating where exactly in=20 > your animation=20 > are these quadruples of keyframes and that is not something=20 > you want to be=20 > doing every frame, or you do? >=20 > If you want to directly sample data you need to move most of=20 > your animation=20 > (if possible all of it) in local mem. Or consider some genius=20 > animation=20 > format where you can DMA solid blocks every time you need to=20 > sample data=20 > (once per animation). >=20 > I must say that load every animation and sample all characters seems=20 > attractive. Though it hides potential problems of either not=20 > being able to=20 > fit animation (too long animation) or sample data (too many=20 > character). >=20 > On the topic of perfect load balancing I agree with you. >=20 > -Yordan >=20 > ----- Original Message -----=20 > From: "Tom Forsyth" <tom...@ee...> > To: <gda...@li...> > Sent: Friday, October 28, 2005 8:27 PM > Subject: RE: [Algorithms] skeletal animation system >=20 >=20 > > You are talking about these > > few most important characters that you see in the foreground. >=20 > I'm also talking about decent numbers of characters. In fact,=20 > if there's > 1000s of characters in the scene, then keeping each=20 > character's animation > inside a single PU makes even more sense. >=20 > > Due to the overlapping nature of the memory allocation=20 > (especially in > > cross-fade blends) you have implied data separation that > > might as well be utilized. >=20 > Sorry, you've totally lost me there :-) >=20 > > > You do very few blends between already-sampled data. > > Actually we almost never do anything but that. Different > > perspective here. > > We do a few N-blends based on parameters like vectors, > > quaternions, etc. > > You can do an "artistic" IK look-alike that way. I'm sure > > there was link > > in this list some time ago... about this. >=20 > Bad mis-communication on my part. What I meant is that=20 > because trees are > broad but shallow, you are almost always blending together=20 > data sampled > directly from an animation (the leaves of the tree). You=20 > rarely blend data > that is the result of a previous blend (the branches and=20 > trunk of the tree). > So IMHO it's best to focus optimisation on what you do at=20 > each leaf, which > can usually be expressed as sample-and-blend. >=20 > Of course, in a purely binary tree, you have roughly equal=20 > numbers of the > two operations. But that's one reason why a binary tree isn't=20 > that great an > idea in general :-) >=20 > However, you do say that in your data sets you often have=20 > just one animation > running, so it's a sample with no blending. Granny also has a=20 > special-cased > path for this (only one active animation). In practice, some=20 > games hit that > a lot, some hit it almost never - it all depends on the data=20 > sets. I would > be cautious about optimising for the case that is already pretty fast! > There's a minor code-maintenance issue, but it's only two=20 > paths, and they > call a lot of the same functions (that the compiler inlines=20 > perfectly well). >=20 > > most of the time you are doing interpolation between key > > frames (that are > > stored locally) rather than fetching them. If your streams are that > > densely populated that you have to fetch every update you are doing > > something wrong, which i cant imagine being the case with Granny for > > example. >=20 > By "fetch", I meant you're fetching them from main memory=20 > into either a > cache or some sort of local storage (e.g. SPU memory). I was=20 > assuming the > animation has already been loaded off disk, otherwise you're=20 > doomed! Granny > uses splines rather than keyframes, but the data access=20 > patterns are roughly > similar to a sparse keyframe system. >=20 > The way Granny is oriented is on a character-by-character basis. Each > character is sampled & blended fully, pulling in the data on=20 > demand, and > then you do the next one. There's obvious exceptions for=20 > characters that > interact with each other, but they're certainly not the=20 > average case. This > keeps all the intermediate data nice and warm in caches or=20 > local storage, > and only the inputs (animations) and final outputs (what you=20 > render) live in > main memory. >=20 > To do effective streaming from main memory, you can either=20 > prefetch/DMA all > the animations for a character before you start sampling the=20 > first one (this > works well when characters are playing a lot of animations),=20 > or you can > prefetch character 2 before you sample & blend character 1=20 > (this works well > when characters are playing much fewer or simpler=20 > animations). If prefetch > times are long, and local storage/cache space is large, you=20 > can prefetch > character 3 or 4 before doing character 1, but that tends to get > counterproductive and can thrash caches (which is of course lethal). >=20 > Now, you could possibly orient it around animations instead. Load each > animation just once, then do all the sampling of that=20 > animation required for > all the characters in the scene. This means you only fetch=20 > each animation's > data just once. The problem is, the results of the sampling need to be > stored somewhere, but because you're doing all the sampling of that > animation in one go, you're unlikely to use that result soon=20 > enough for it > to stay in a cache or fit in limited local storage. And it's=20 > big - typically > a vec3 of position and a vec4 of orientation, or 28 bytes in=20 > total, and you > need to write the data then read it back later - bus traffic=20 > of 56 bytes. > Whereas source animations can be compressed far smaller than this - we > average around 4 bytes per spline control, and you usually=20 > need to sample > the nearest three controls in a spline (quadratic=20 > interpolation), and it's > read-only data, so around 12 bytes of traffic. Obviously=20 > these are really > rough figures, but there's still a significant difference. >=20 > > Granularity. By doing this unless you build some very fine > > scheduling metric > > you are going to have trouble balancing PU utilization. Some > > characters > > will have longer blend trees than other, taking more=20 > processing times, > > etc. Unless you are doing all anim operations on one PU.=20 > Still without > > test data on this one I may be completely off target. >=20 > In general (obviously waving my hands a bit here), if you=20 > have more than > about 4x the number of tasks than processors, and the time=20 > taken by each > task is distributed in a roughly bell-curvish way, your=20 > granularity is "fine > enough". Splitting the work into finer chunks might get=20 > slightly more level > load-balancing, but the overhead of doing the splitting, and the extra > synchronisation required because you have more tasks, means=20 > that you don't > actually get any benefit. >=20 > To put this another way - the worst case scenario is that all=20 > PUs finish all > available tasks, and have to wait for a single PU to do a=20 > single task. They > never have to wait for that last PU to do more than one task (because > otherwise one of the idle ones would do it instead). So if=20 > the total number > of tasks divided by the number of PUs is more than 10, then=20 > the worst you > can do is waste 10% of your processing power. Wasting only=20 > 10% of your total > power is considered superb efficiency in the multiprocessing=20 > world! This > assumes that the tasks are not dependent and can be processed=20 > by any PU in > any order, but if you partition by characters, this is 99%=20 > true (and for the > few that are dependent on something else, just schedule those first). > Certainly with 1000s of characters and only around 8 PUs=20 > (e.g. PS3), I just > can't see it being a problem. >=20 >=20 > TomF. >=20 >=20 > > -----Original Message----- > > From: gda...@li... > > [mailto:gda...@li...] On > > Behalf Of yg...@gy... > > Sent: 28 October 2005 03:08 > > To: gda...@li... > > Subject: RE: [Algorithms] skeletal animation system > > > > > > Hi Tom, > > > > Points, well made. > > > > I think we come from different points of view. You are > > talking about these > > few most important characters that you see in the foreground. > > I'm focusing on LOTS of characters in the background > > (instancing numbers, > > hundred, thousands, etc) and not so complicated animation operations > > (cross-fade blends being by far the most popular blend in my > > experience in > > these cases). > > Due to the overlapping nature of the memory allocation=20 > (especially in > > cross-fade blends) you have implied data separation that > > might as well be > > utilized. > > > > Also in my experience especially when animating all kinds of > > entities (and > > have some external attachment between them) its good to process all > > animations before you get to actually processing your > > entities (characters > > if you want). I think we agree on this point. > > > > > Every now and then you get multiple characters doing > > roughly the same > > > animation, and the cache helps, but I regard this as luck. > > We have actually done some research and testing on that. I > > dont think I > > agree with this statement when it comes to crowds. We (a=20 > colleague of > > mine) do massive character scenes by sorting morphed data. We > > upload the > > morph targets and then send instance data (translate,rotate). This > > achieves amazing (talking 1000+) quantities of characters on > > screen on a > > current console platform. I was convinced at first that this > > would never > > work because like you I didnt imagine there would be any reasonable > > exploitable "roughly the same keyframe" animation. > > > > > we need to optimise the worse case scenarios, not the best! > > Quite right ;) In my case worst scenario happens when you do > > fast forward > > cause you hit the anim stream way too often. Unless > > pre-streamed like you > > suggest. > > > > > So, in conclusion - assume every time you sample an > > animation, you will > > > miss every cache. > > Agreed. However, I'm arguing (and in my experience and > > analyser data) that > > most of the time you are doing interpolation between key > > frames (that are > > stored locally) rather than fetching them. If your streams are that > > densely populated that you have to fetch every update you are doing > > something wrong, which i cant imagine being the case with Granny for > > example. > > > > I do agree that blend trees are usually not that tall and > > binary blending > > (except for cross-fades) is not the best under the sun. > > > > > You do very few blends between already-sampled data. > > Actually we almost never do anything but that. Different > > perspective here. > > We do a few N-blends based on parameters like vectors, > > quaternions, etc. > > You can do an "artistic" IK look-alike that way. I'm sure > > there was link > > in this list some time ago... about this. > > > > > Thus, although it is elegant to split the processing into > > (a) sample all > > > your animations into temporary buffers and then (b) blend > > the buffers > > > together, in practice it makes more sense to have your > > animation sampling > > > also do the weighted accumulation, so that you can focus all your > > > optimisation on that one routine, and not have that > > intermediate data hang > > > around polluting caches - as soon as you sample a bone's > > data, you add it > > > to > > > its blend. > > My approach is: dont do any blends if you dont have to. So you > > sample/interpolate your anims and most often, if they are not > > in a cross > > fade, you are done. If there is something else to be done=20 > its another > > task. Makes for better code separation. With your solution=20 > you have to > > make a fit-all process function. > > Still your approach will be less memory consuming and less > > burden on the > > bus so it could be worth the maintainability-performance trade-off. > > > > > So the point I'm trying to (finally) get to is that trying > > to parallelise > > > the operations done on one particular character seems like > > a lot of effort > > > for little benefit. > > I'm not after parallelsing operations on one character. I'm for > > parallelizing all operations on all characters on one go. The > > animation > > system doesnt know about "character" as such... it just animating > > entities... > > Your main point here is that the serial term of the algorithm being > > dominant is reducing or invalidating any benefit from the > > parallel part. I > > must agree that its a good point but without any test data I > > cant say if > > that is the case or not. If task scheduling (and DMA transfers) is > > properly interlieved (double or tripple buffered) you dont > > get any penalty > > for that. Still, a very valid point. > > > > > IMHO, a far better thing is to assume that the number of=20 > characters > > > on-screen is significantly larger than the number of PUs, > > and simply do > > > all > > > the anim sampling and blending of each character on a single PU. > > Granularity. By doing this unless you build some very fine > > scheduling metric > > you are going to have trouble balancing PU utilization. Some > > characters > > will have longer blend trees than other, taking more=20 > processing times, > > etc. Unless you are doing all anim operations on one PU.=20 > Still without > > test data on this one I may be completely off target. > > > > Thanks ever so much for the feedback. In fact I think I'm=20 > going to try > > some of you ideas and see how they compare with the=20 > existing solution. > > > > -Yordan >=20 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_ida88 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188 >=20 |
From: Yordan G. <yg...@gy...> - 2005-10-29 12:33:05
|
I just want to point out that I never advocated reusing same animation samples (or caching them) in multiple characters. And as Tom points out chances of that happening are tiny unless your scene dictates otherwise or you are not dealing with skeletal animation at all. So lets put this point to rest. ;) What I have are small structures per-character that hold cached sample info and are used mostly when characters are updated.One structures = one character update. They have maximum locality and require minimum fetch operations that can be predicted/anticipated. This is where my no-cache misses approach comes from. It also allows you to not care if you have the whole animation in memory allowing you to stream from disk you have some crazy long cutscenes. Think of it as a reader head that goes through the stream and caches the current data... when new data is needed it fetches it from the stream. Fetch here can be asynchronous operation as the sample cache is reading ahead. (cache is used as a programming concept here rather than real hw cache) structure ------ control data time in the animation control points for bone 1 (could be 5 controls - 4used and 1 prefetched in anticipation) control points for bone 2 ... control points for bone N ... OuputSkeleton State ------ So the algorithm works all the time on this local to the PU (or prefetched) structure. No main memory access. When the time in the animation is advanced enough so control point 4 becomes control point 3 and 5 -> 4... we need to prefetch new control point 5 and all this asynchronous so no stalls occur anywhere i.e. code waiting for the data Update will be something like this: ---- pre-fetch nextcharacter calclulate current character store curretncharacter move next : make current character = next character ---- > Note that I have not tried this on the PS2's scratchpad - that is rather > small. The cache performance on the PS2 was so abysmal in every way that > it > would have been a big big job to get it even half-decent on there. My approach works on PS2 scratchpad. The differences to PC code are 5 lines of code that are ifdef-ed. All other code is identical in all platforms. (this is with normal lerp for key frames... with four controls per bone spline I don't fit in scratchpad either). I also double buffer the scratch pad so I keep the processor busy. What makes you think that cache performance is going to improve on PS3? Or xbox360? As far as I have read the penalties for cache misses in both platforms are considerable. Actually relative to the speed of the processing units situation is much worse than on PS2. If you have a system working on PS2 it much easier to port it to PC/Xbox than the other way around - especially if you want to get some performance out. PS3 is again different from the other two and still closer to PS2 in programming ideology. As you point out Tom the other (non PS) two platforms are trivial. Most techniques John mentioned are supported directly by OpenMP. This is no coincidence as the symmetric/shared memory architecture favours that. They are trivial because most techniques in these architectures are based around "how to quickly make a serial program into a parallel one". Loop parallelism being by far the most popular. My approach is based around mainly that pattern with several sync points. Anyway I don't pretend what I have is best. What I'm saying is try ASAP your PS3 plan. Tweak/change your plan until you are happy with the performance. When that happens see how it maps to the other platforms and you'll find that there is hardly anything left to do. ;) John, you talk about writing schedulers. Do these run in the main thread? Have you tried separate threads/processors to schedule themselves with shared task queue/task pools approach? Just curious what the differences are going to be in PU utilization when you add more PUs although 1.9 is pretty good :) -Yordan ----- Original Message ----- From: "Tom Forsyth" <tom...@ee...> To: <gda...@li...> Sent: Saturday, October 29, 2005 4:12 AM Subject: RE: [Algorithms] skeletal animation system > My argument here was that new keyframes/knots are required > infrequently. With a > good spline fitting algorithm you should be experiencing the > same. Well, from frame 1 of character 3, bone 4 to frame 2 of character 3, bone 4, yes - you will need almost exactly the same data. The problem is, those two samplings are an entire 60th of a second apart. You'll want to fill local store/cache with tons of other data in the meantime. The chances of characters 4, 5, 6, etc needing the same 3-4 knots or keyframes in frame 1 is tiny, according to my data. But yes, maybe if you have a big crowd of characters all running the same animation at roughly the same phase, you'll get re-use. I'm just saying, I don't think that is even close to the worst case performance, nor do I think it is even a common case. But it does depend on your game type. > Do you have similar control structures or you rely that > animation is going > to be prefetched in local storage? If so do you have > guarantee that every > single of your animations is going to fit there? Remember > this would be > about 1/3 (in practice more like 1/4) of real local storage > space if you > plan to triple buffer to keep the PU busy. I only fetch the 3-4 control points in each spline that the current character needs. If the next character happens to also need them, then hey - that's fine - the cache might do useful work. But I assume it doesn't, and I assume that each character has to fetch all-new data, because that is the worse case, and it's also the common case. So I need to prefetch that data before I do the sampling & blending Space usually isn't a problem - I only need 4 control points from pos & orn for each bone for each animation. In practice, the size of the required control data (e.g. how long each animation is, what sort of compression each spline uses, etc) is bigger than the control point data I need this particular frame. But none of it is really very large - it fits in all sensible-sized caches and local data storage just fine. Note that I have not tried this on the PS2's scratchpad - that is rather small. The cache performance on the PS2 was so abysmal in every way that it would have been a big big job to get it even half-decent on there. Since I have five other platforms to deal with, I just waited for someone to try to run a complex scene on the PS2 before I worried about it. And nobody ever did, so it never had to be done. And probably never will now. Which I am very happy about :-) > You have to decide where in the animation stream are > each quadruples > of controls for each bone in order to get only those > quadruples and DMA them > in the local PU memory. Correct. > Now I can tell you stright away (and > I'm almost sure > things wont change much) that on PS2 one solid DMA transfer > is far better > than a number of small ones. They have some setup time and need to be > prioritized and etc. So the less you have the better. Not > only that you have > to spend CPU some time deciding/calculating where exactly in > your animation > are these quadruples of keyframes and that is not something > you want to be > doing every frame, or you do? Well, you have to decide which controls you need anyway - that's part of the spline sampling process. So that's no extra cost. Let me preface this with a warning - I have not tried a DMA-based scheme yet. That's because on every other platform, we're not talking about DMAs, we're talking about normal cache accesses, and they don't care about small/large transfers - everything is just the size of a cache line, and setting up a transfer is just executing a "prefetch" with the relevant address - extremely cheap. All you need to do is make sure you do those prefetches far enough in advance (to absorb the latency of the transfer from main memory), and life is good. The only platforms that DMAs are needed for are PS2 (avoided doing this - see above) and PS3. So, here is my entirely theoretical plan for PS3 - due to be implemented some time next year. Granny allows animation splines to be listed in any order you like in memory - they are remapped to the skeleton on the fly. This is a big bonus for a bunch of things, but that would be a very long discussion, so let's just take that as read. Some splines will have lots of knots/controls, and some will be able to use different sorts of compression, and so they will all be different lengths in bytes. So, given that, order your spline data in memory from smallest to largest. Now, decide what your magic byte length is for your platform. This is going to be largely trial and error, but it's roughly the place where the cost of setting up a new transfer exceeds the cost of transferring the extra data. So let's say that's 64 bytes (chosen at random). That is, if you wanted to transfer two chunks of 12 bytes that were 64 bytes apart, it is the same speed to set up a single transfer that is 64+12 bytes long, as to do two transfers of 12 bytes each. OK, so we know we need 12 bytes (roughly - varies with compression and spline degree) from each spline in the list. And they're ordered by size. So we just need to find the first spline with data longer than 64 bytes. All the splines before it in the list just get their entire data DMA'd over in a single chunk - it's not efficient to do them with separate transfers. All the ones afterwards just get a 12-byte transfer set up for each of them, because that's the only bit you need, and you'd probably run out of local store if you transferred the whole lot, as well as having to wait longer for the transfer. So that's the theory. I have yet to try it in practice though. Now, it may be that the number is not 64 bytes, it's more like 4kbytes or something. In which case this mostly just collapses to "transfer the whole animation over". Well, so be it. But people do use some pretty long animations with Granny (I've had people export 15-minute credit sequences as a single animation!), so at some point you need to cope with not being able to DMA the whole thing over. As I said above, on every platform but the PS*, it's trivial. They have caches, and you just do a prefetch instruction per spline. On PS3, it's somewhat more effort <shakes fist at Sony/IBM for their crazy architecture>. TomF. > -----Original Message----- > From: gda...@li... > [mailto:gda...@li...] On > Behalf Of Yordan Gyurchev > Sent: 28 October 2005 14:29 > To: gda...@li... > Subject: Re: [Algorithms] skeletal animation system > > > Ok, you've won me over. Processing one character/entity on > one PU thing. So, > one character = one task. > > When I think about it though it fits my approach, just the scale is > different. > > > By "fetch", I meant you're fetching them from main memory > into either a > > cache or some sort of local storage (e.g. SPU memory). I > was assuming the > > animation has already been loaded off disk, otherwise > you're doomed! > > Granny > > uses splines rather than keyframes, but the data access > patterns are > > roughly > > similar to a sparse keyframe system. > Fetch is = main memory => PU > I use the word keyframe for both normal keyframes and spline > knots (probably > wrong but in our case its still some data at some point in > time regardless > of the interp method). > > > So IMHO it's best to focus optimisation on what you do at > each leaf, which > > can usually be expressed as sample-and-blend. > Agreed. This is what I'm most interested in as well. And > analyser data shows > most time is spend there. > > I find it difficult to follow what exactly you do with the PU > and what goes > to local memory and back to main etc (see later my rant on > DMA transfers). > IMHO there needs to be a number of support structures. Inputs > like where in > the animation you are, blending weights, parameters, etc. And > outputs like > the final skeleton state (what is rendered), other output > info (we return > for example velocity of the root or absolute movement). > > In my approach I put all my support structures in one solid > chunk and call > that animation state. Its all the data the task will need > (regardless of the > scale of the task - be it a full character or not). > > So we move the animation state into local memory, process > that character, > then move it back. It can be structured so read-only data > goes only one way > without problem. > > Now the question is what goes in this animation state and > what stays in main > memory and is fetched on request (or pre-streamed). With my > approach I keep > a duplicates of the control points (key frames) that are currently > interpolated into local interpolation structures (local to > the animation > state). So, only when I move to the next keyframe (require > next knot) I > touch main memory to fetch it (or not if it is pre-streamed > in local). My > argument here was that new keyframes/knots are required > infrequently. With a > good spline fitting algorithm you should be experiencing the > same. Mine is > (I assume) worse than yours by a mile and still keyframe > fetches (and for > that reason cache misses) are not an issue when I analyze the > execution. > Even more you can fetch asynchronously an extra > keyframe/control so you > don't have any stalls. > > Do you have similar control structures or you rely that > animation is going > to be prefetched in local storage? If so do you have > guarantee that every > single of your animations is going to fit there? Remember > this would be > about 1/3 (in practice more like 1/4) of real local storage > space if you > plan to triple buffer to keep the PU busy. > > May be I'm missing something here. > > > Whereas source animations can be compressed far smaller > than this - we > > average around 4 bytes per spline control, and you usually > need to sample > > the nearest three controls in a spline (quadratic > interpolation), and it's > > read-only data, so around 12 bytes of traffic. Obviously > these are really > > rough figures, but there's still a significant difference. > Yes I understand the maths, but I think you are focusing on > the wrong thing > here. Its not always how much you transfer but on what > chunks and how > often.You have to decide where in the animation stream are > each quadruples > of controls for each bone in order to get only those > quadruples and DMA them > in the local PU memory. Now I can tell you stright away (and > I'm almost sure > things wont change much) that on PS2 one solid DMA transfer > is far better > than a number of small ones. They have some setup time and need to be > prioritized and etc. So the less you have the better. Not > only that you have > to spend CPU some time deciding/calculating where exactly in > your animation > are these quadruples of keyframes and that is not something > you want to be > doing every frame, or you do? > > If you want to directly sample data you need to move most of > your animation > (if possible all of it) in local mem. Or consider some genius > animation > format where you can DMA solid blocks every time you need to > sample data > (once per animation). > > I must say that load every animation and sample all characters seems > attractive. Though it hides potential problems of either not > being able to > fit animation (too long animation) or sample data (too many > character). > > On the topic of perfect load balancing I agree with you. > > -Yordan > > ----- Original Message ----- > From: "Tom Forsyth" <tom...@ee...> > To: <gda...@li...> > Sent: Friday, October 28, 2005 8:27 PM > Subject: RE: [Algorithms] skeletal animation system > > > > You are talking about these > > few most important characters that you see in the foreground. > > I'm also talking about decent numbers of characters. In fact, > if there's > 1000s of characters in the scene, then keeping each > character's animation > inside a single PU makes even more sense. > > > Due to the overlapping nature of the memory allocation > (especially in > > cross-fade blends) you have implied data separation that > > might as well be utilized. > > Sorry, you've totally lost me there :-) > > > > You do very few blends between already-sampled data. > > Actually we almost never do anything but that. Different > > perspective here. > > We do a few N-blends based on parameters like vectors, > > quaternions, etc. > > You can do an "artistic" IK look-alike that way. I'm sure > > there was link > > in this list some time ago... about this. > > Bad mis-communication on my part. What I meant is that > because trees are > broad but shallow, you are almost always blending together > data sampled > directly from an animation (the leaves of the tree). You > rarely blend data > that is the result of a previous blend (the branches and > trunk of the tree). > So IMHO it's best to focus optimisation on what you do at > each leaf, which > can usually be expressed as sample-and-blend. > > Of course, in a purely binary tree, you have roughly equal > numbers of the > two operations. But that's one reason why a binary tree isn't > that great an > idea in general :-) > > However, you do say that in your data sets you often have > just one animation > running, so it's a sample with no blending. Granny also has a > special-cased > path for this (only one active animation). In practice, some > games hit that > a lot, some hit it almost never - it all depends on the data > sets. I would > be cautious about optimising for the case that is already pretty fast! > There's a minor code-maintenance issue, but it's only two > paths, and they > call a lot of the same functions (that the compiler inlines > perfectly well). > > > most of the time you are doing interpolation between key > > frames (that are > > stored locally) rather than fetching them. If your streams are that > > densely populated that you have to fetch every update you are doing > > something wrong, which i cant imagine being the case with Granny for > > example. > > By "fetch", I meant you're fetching them from main memory > into either a > cache or some sort of local storage (e.g. SPU memory). I was > assuming the > animation has already been loaded off disk, otherwise you're > doomed! Granny > uses splines rather than keyframes, but the data access > patterns are roughly > similar to a sparse keyframe system. > > The way Granny is oriented is on a character-by-character basis. Each > character is sampled & blended fully, pulling in the data on > demand, and > then you do the next one. There's obvious exceptions for > characters that > interact with each other, but they're certainly not the > average case. This > keeps all the intermediate data nice and warm in caches or > local storage, > and only the inputs (animations) and final outputs (what you > render) live in > main memory. > > To do effective streaming from main memory, you can either > prefetch/DMA all > the animations for a character before you start sampling the > first one (this > works well when characters are playing a lot of animations), > or you can > prefetch character 2 before you sample & blend character 1 > (this works well > when characters are playing much fewer or simpler > animations). If prefetch > times are long, and local storage/cache space is large, you > can prefetch > character 3 or 4 before doing character 1, but that tends to get > counterproductive and can thrash caches (which is of course lethal). > > Now, you could possibly orient it around animations instead. Load each > animation just once, then do all the sampling of that > animation required for > all the characters in the scene. This means you only fetch > each animation's > data just once. The problem is, the results of the sampling need to be > stored somewhere, but because you're doing all the sampling of that > animation in one go, you're unlikely to use that result soon > enough for it > to stay in a cache or fit in limited local storage. And it's > big - typically > a vec3 of position and a vec4 of orientation, or 28 bytes in > total, and you > need to write the data then read it back later - bus traffic > of 56 bytes. > Whereas source animations can be compressed far smaller than this - we > average around 4 bytes per spline control, and you usually > need to sample > the nearest three controls in a spline (quadratic > interpolation), and it's > read-only data, so around 12 bytes of traffic. Obviously > these are really > rough figures, but there's still a significant difference. > > > Granularity. By doing this unless you build some very fine > > scheduling metric > > you are going to have trouble balancing PU utilization. Some > > characters > > will have longer blend trees than other, taking more > processing times, > > etc. Unless you are doing all anim operations on one PU. > Still without > > test data on this one I may be completely off target. > > In general (obviously waving my hands a bit here), if you > have more than > about 4x the number of tasks than processors, and the time > taken by each > task is distributed in a roughly bell-curvish way, your > granularity is "fine > enough". Splitting the work into finer chunks might get > slightly more level > load-balancing, but the overhead of doing the splitting, and the extra > synchronisation required because you have more tasks, means > that you don't > actually get any benefit. > > To put this another way - the worst case scenario is that all > PUs finish all > available tasks, and have to wait for a single PU to do a > single task. They > never have to wait for that last PU to do more than one task (because > otherwise one of the idle ones would do it instead). So if > the total number > of tasks divided by the number of PUs is more than 10, then > the worst you > can do is waste 10% of your processing power. Wasting only > 10% of your total > power is considered superb efficiency in the multiprocessing > world! This > assumes that the tasks are not dependent and can be processed > by any PU in > any order, but if you partition by characters, this is 99% > true (and for the > few that are dependent on something else, just schedule those first). > Certainly with 1000s of characters and only around 8 PUs > (e.g. PS3), I just > can't see it being a problem. > > > TomF. > > > > -----Original Message----- > > From: gda...@li... > > [mailto:gda...@li...] On > > Behalf Of yg...@gy... > > Sent: 28 October 2005 03:08 > > To: gda...@li... > > Subject: RE: [Algorithms] skeletal animation system > > > > > > Hi Tom, > > > > Points, well made. > > > > I think we come from different points of view. You are > > talking about these > > few most important characters that you see in the foreground. > > I'm focusing on LOTS of characters in the background > > (instancing numbers, > > hundred, thousands, etc) and not so complicated animation operations > > (cross-fade blends being by far the most popular blend in my > > experience in > > these cases). > > Due to the overlapping nature of the memory allocation > (especially in > > cross-fade blends) you have implied data separation that > > might as well be > > utilized. > > > > Also in my experience especially when animating all kinds of > > entities (and > > have some external attachment between them) its good to process all > > animations before you get to actually processing your > > entities (characters > > if you want). I think we agree on this point. > > > > > Every now and then you get multiple characters doing > > roughly the same > > > animation, and the cache helps, but I regard this as luck. > > We have actually done some research and testing on that. I > > dont think I > > agree with this statement when it comes to crowds. We (a > colleague of > > mine) do massive character scenes by sorting morphed data. We > > upload the > > morph targets and then send instance data (translate,rotate). This > > achieves amazing (talking 1000+) quantities of characters on > > screen on a > > current console platform. I was convinced at first that this > > would never > > work because like you I didnt imagine there would be any reasonable > > exploitable "roughly the same keyframe" animation. > > > > > we need to optimise the worse case scenarios, not the best! > > Quite right ;) In my case worst scenario happens when you do > > fast forward > > cause you hit the anim stream way too often. Unless > > pre-streamed like you > > suggest. > > > > > So, in conclusion - assume every time you sample an > > animation, you will > > > miss every cache. > > Agreed. However, I'm arguing (and in my experience and > > analyser data) that > > most of the time you are doing interpolation between key > > frames (that are > > stored locally) rather than fetching them. If your streams are that > > densely populated that you have to fetch every update you are doing > > something wrong, which i cant imagine being the case with Granny for > > example. > > > > I do agree that blend trees are usually not that tall and > > binary blending > > (except for cross-fades) is not the best under the sun. > > > > > You do very few blends between already-sampled data. > > Actually we almost never do anything but that. Different > > perspective here. > > We do a few N-blends based on parameters like vectors, > > quaternions, etc. > > You can do an "artistic" IK look-alike that way. I'm sure > > there was link > > in this list some time ago... about this. > > > > > Thus, although it is elegant to split the processing into > > (a) sample all > > > your animations into temporary buffers and then (b) blend > > the buffers > > > together, in practice it makes more sense to have your > > animation sampling > > > also do the weighted accumulation, so that you can focus all your > > > optimisation on that one routine, and not have that > > intermediate data hang > > > around polluting caches - as soon as you sample a bone's > > data, you add it > > > to > > > its blend. > > My approach is: dont do any blends if you dont have to. So you > > sample/interpolate your anims and most often, if they are not > > in a cross > > fade, you are done. If there is something else to be done > its another > > task. Makes for better code separation. With your solution > you have to > > make a fit-all process function. > > Still your approach will be less memory consuming and less > > burden on the > > bus so it could be worth the maintainability-performance trade-off. > > > > > So the point I'm trying to (finally) get to is that trying > > to parallelise > > > the operations done on one particular character seems like > > a lot of effort > > > for little benefit. > > I'm not after parallelsing operations on one character. I'm for > > parallelizing all operations on all characters on one go. The > > animation > > system doesnt know about "character" as such... it just animating > > entities... > > Your main point here is that the serial term of the algorithm being > > dominant is reducing or invalidating any benefit from the > > parallel part. I > > must agree that its a good point but without any test data I > > cant say if > > that is the case or not. If task scheduling (and DMA transfers) is > > properly interlieved (double or tripple buffered) you dont > > get any penalty > > for that. Still, a very valid point. > > > > > IMHO, a far better thing is to assume that the number of > characters > > > on-screen is significantly larger than the number of PUs, > > and simply do > > > all > > > the anim sampling and blending of each character on a single PU. > > Granularity. By doing this unless you build some very fine > > scheduling metric > > you are going to have trouble balancing PU utilization. Some > > characters > > will have longer blend trees than other, taking more > processing times, > > etc. Unless you are doing all anim operations on one PU. > Still without > > test data on this one I may be completely off target. > > > > Thanks ever so much for the feedback. In fact I think I'm > going to try > > some of you ideas and see how they compare with the > existing solution. > > > > -Yordan > > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_ida88 > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=6188 > ------------------------------------------------------- This SF.Net email is sponsored by the JBoss Inc. Get Certified Today * Register for a JBoss Training Course Free Certification Exam for All Training Attendees Through End of 2005 Visit http://www.jboss.com/services/certification for more information _______________________________________________ GDAlgorithms-list mailing list GDA...@li... https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list Archives: http://sourceforge.net/mailarchive/forum.php?forum_ida88 |