RE: [Algorithms] skeletal animation system
Brought to you by:
vexxed72
From: Tom F. <tom...@ee...> - 2005-10-29 18:51:17
|
Oh, I see what you mean now! That's pretty cunning. > What makes you think that cache performance is going to=20 > improve on PS3? Or xbox360? > As far as I have read the penalties for cache misses in both=20 > platforms are=20 > considerable. Actually relative to the speed of the processing units=20 > situation is much worse than on PS2. Agreed. In fact, I believe the miss to main memory is almost exactly the same between PS2 and Xbox 360 in nanoseconds, but of course the = instruction throughput has increased by somewhere around 30x! However, that's the latency that is the same. For DMA systems, setting = up a transfer is somewhat costly, and there's often an overhead to starting = and stopping. On cache-based system, it's a single instruction, and since everything is cache-line-sized, there's very little penalty for small transfers. Or actually, it's that there's very little gain from large transfers! So while you need to set up your fetches earlier in both systems, = setting them up in a cache-based system is cheaper. Most crucially, if you mis-predict one in a thousand times, or run out of local storage, cache systems don't crash, they just run a bit slower :-) For example, looping animations need you to fetch the very end and the very start of the animation. On a cache system, just prefetch one place - it's not worth = the extra computation to optimise for that rare case. On a DMA-based system, = you don't have a choice. Sorry, little rant about caches vs DMA there. I know you weren't = suggesting they were inherently good. And of course it's perfectly true that a = cache takes more gates to implement. > If you have a system working on PS2 it much easier to port it=20 > to PC/Xbox=20 > than the other way around - especially if you want to get=20 > some performance=20 > out. PS3 is again different from the other two and still=20 > closer to PS2 in=20 > programming ideology. Erm... not really. The PS3 is similar to nothing in its ideology. Having = a good PS2 engine doesn't help you at all in having a good PS3 engine - = except in raising your pain threashold. In theory you could say that VU0 was = like an SPU, but in practice VU0 was so crippled it was almost useless (in my previous life writing games we used it as extra scratchpad memory :-). Hopefully SPUs will be slightly more useful. It is true that if you have a PS2 engine you can relatively easily port = it to PC/Xbox (and I have done exactly that - when designing a = cross-platform engine, the PS2 was the focus and then the other two were easy). = However, the PS3 is somewhat at right-angles to everything. It's true that you = can simply put each SPU on a "thread" and then instead of DMAs you do = memcpy, but it's not exactly elegant or efficient. > Anyway I don't pretend what I have is best. What I'm saying=20 > is try ASAP your=20 > PS3 plan. Tweak/change your plan until you are happy with the=20 > performance.=20 > When that happens see how it maps to the other platforms and=20 > you'll find=20 > that there is hardly anything left to do. ;) Well, maybe. It does mangle the PC/Xb/Xb360/Mac/GC/PS2/PSP/Rev (yeah, we = do a lot of platforms!) code quite a lot for this one platform, and one of = the things we value highly is a clean codebase, since customers actually get = to see it all. I also think that at best the PS3->whatever back-port will = be the same speed - but it might well be slower, because essentially you're trading computing cycles to make up for not having as much intelligence = in the memory hierarchy. But I do like your incremental update idea - certainly worth looking at. = I'm slightly worried about keeping chunks of temporary memory around per-instance, but it probably isn't all that big. TomF. > -----Original Message----- > From: gda...@li...=20 > [mailto:gda...@li...] On=20 > Behalf Of Yordan Gyurchev > Sent: 29 October 2005 05:33 > To: gda...@li... > Subject: Re: [Algorithms] skeletal animation system >=20 >=20 > I just want to point out that I never advocated reusing same=20 > animation=20 > samples (or caching them) in multiple characters. And as Tom=20 > points out=20 > chances of that happening are tiny unless your scene dictates=20 > otherwise or=20 > you are not dealing with skeletal animation at all. > So lets put this point to rest. ;) >=20 > What I have are small structures per-character that hold=20 > cached sample info=20 > and are used mostly when characters are updated.One structures =3D one = > character update. They have maximum locality and require=20 > minimum fetch=20 > operations that can be predicted/anticipated. This is where=20 > my no-cache=20 > misses approach comes from. It also allows you to not care if=20 > you have the=20 > whole animation in memory allowing you to stream from disk=20 > you have some=20 > crazy long cutscenes. >=20 > Think of it as a reader head that goes through the stream and=20 > caches the=20 > current data... when new data is needed it fetches it from=20 > the stream. Fetch=20 > here can be asynchronous operation as the sample cache is=20 > reading ahead.=20 > (cache is used as a programming concept here rather than real=20 > hw cache) >=20 > structure > ------ > control data > time in the animation > control points for bone 1 (could be 5 controls - 4used and 1=20 > prefetched in=20 > anticipation) > control points for bone 2 > ... > control points for bone N > ... > OuputSkeleton State > ------ >=20 > So the algorithm works all the time on this local to the PU=20 > (or prefetched)=20 > structure. No main memory access. When the time in the=20 > animation is advanced=20 > enough so control point 4 becomes control point 3 and 5 ->=20 > 4... we need to=20 > prefetch new control point 5 and all this asynchronous so no=20 > stalls occur=20 > anywhere i.e. code waiting for the data >=20 > Update will be something like this: > ---- > pre-fetch nextcharacter > calclulate current character > store curretncharacter > move next : make current character =3D next character > ---- >=20 > > Note that I have not tried this on the PS2's scratchpad -=20 > that is rather > > small. The cache performance on the PS2 was so abysmal in=20 > every way that=20 > > it > > would have been a big big job to get it even half-decent on there. > My approach works on PS2 scratchpad. The differences to PC=20 > code are 5 lines=20 > of code that are ifdef-ed. All other code is identical in all=20 > platforms.=20 > (this is with normal lerp for key frames... with four=20 > controls per bone=20 > spline I don't fit in scratchpad either). I also double=20 > buffer the scratch=20 > pad so I keep the processor busy. >=20 > What makes you think that cache performance is going to=20 > improve on PS3? Or=20 > xbox360? > As far as I have read the penalties for cache misses in both=20 > platforms are=20 > considerable. Actually relative to the speed of the processing units=20 > situation is much worse than on PS2. >=20 > If you have a system working on PS2 it much easier to port it=20 > to PC/Xbox=20 > than the other way around - especially if you want to get=20 > some performance=20 > out. PS3 is again different from the other two and still=20 > closer to PS2 in=20 > programming ideology. >=20 > As you point out Tom the other (non PS) two platforms are=20 > trivial. Most=20 > techniques John mentioned are supported directly by OpenMP.=20 > This is no=20 > coincidence as the symmetric/shared memory architecture=20 > favours that. They=20 > are trivial because most techniques in these architectures=20 > are based around=20 > "how to quickly make a serial program into a parallel one".=20 > Loop parallelism=20 > being by far the most popular. My approach is based around=20 > mainly that=20 > pattern with several sync points. >=20 > Anyway I don't pretend what I have is best. What I'm saying=20 > is try ASAP your=20 > PS3 plan. Tweak/change your plan until you are happy with the=20 > performance.=20 > When that happens see how it maps to the other platforms and=20 > you'll find=20 > that there is hardly anything left to do. ;) >=20 > John, you talk about writing schedulers. Do these run in the=20 > main thread?=20 > Have you tried separate threads/processors to schedule=20 > themselves with=20 > shared task queue/task pools approach? Just curious what the=20 > differences are=20 > going to be in PU utilization when you add more PUs although=20 > 1.9 is pretty=20 > good :) >=20 > -Yordan >=20 > ----- Original Message -----=20 > From: "Tom Forsyth" <tom...@ee...> > To: <gda...@li...> > Sent: Saturday, October 29, 2005 4:12 AM > Subject: RE: [Algorithms] skeletal animation system >=20 >=20 > > My argument here was that new keyframes/knots are required > > infrequently. With a > > good spline fitting algorithm you should be experiencing the > > same. >=20 > Well, from frame 1 of character 3, bone 4 to frame 2 of=20 > character 3, bone 4, > yes - you will need almost exactly the same data. The problem=20 > is, those two > samplings are an entire 60th of a second apart. You'll want=20 > to fill local > store/cache with tons of other data in the meantime. >=20 > The chances of characters 4, 5, 6, etc needing the same 3-4 knots or > keyframes in frame 1 is tiny, according to my data. But yes,=20 > maybe if you > have a big crowd of characters all running the same animation=20 > at roughly the > same phase, you'll get re-use. I'm just saying, I don't think=20 > that is even > close to the worst case performance, nor do I think it is=20 > even a common > case. But it does depend on your game type. >=20 > > Do you have similar control structures or you rely that > > animation is going > > to be prefetched in local storage? If so do you have > > guarantee that every > > single of your animations is going to fit there? Remember > > this would be > > about 1/3 (in practice more like 1/4) of real local storage > > space if you > > plan to triple buffer to keep the PU busy. >=20 > I only fetch the 3-4 control points in each spline that the current > character needs. If the next character happens to also need=20 > them, then hey - > that's fine - the cache might do useful work. But I assume it=20 > doesn't, and I > assume that each character has to fetch all-new data, because=20 > that is the > worse case, and it's also the common case. So I need to=20 > prefetch that data > before I do the sampling & blending >=20 > Space usually isn't a problem - I only need 4 control points=20 > from pos & orn > for each bone for each animation. In practice, the size of=20 > the required > control data (e.g. how long each animation is, what sort of=20 > compression each > spline uses, etc) is bigger than the control point data I need this > particular frame. But none of it is really very large - it fits in all > sensible-sized caches and local data storage just fine. >=20 > Note that I have not tried this on the PS2's scratchpad -=20 > that is rather > small. The cache performance on the PS2 was so abysmal in=20 > every way that it > would have been a big big job to get it even half-decent on=20 > there. Since I > have five other platforms to deal with, I just waited for=20 > someone to try to > run a complex scene on the PS2 before I worried about it. And=20 > nobody ever > did, so it never had to be done. And probably never will now.=20 > Which I am > very happy about :-) >=20 > > You have to decide where in the animation stream are > > each quadruples > > of controls for each bone in order to get only those > > quadruples and DMA them > > in the local PU memory. >=20 > Correct. >=20 > > Now I can tell you stright away (and > > I'm almost sure > > things wont change much) that on PS2 one solid DMA transfer > > is far better > > than a number of small ones. They have some setup time and=20 > need to be > > prioritized and etc. So the less you have the better. Not > > only that you have > > to spend CPU some time deciding/calculating where exactly in > > your animation > > are these quadruples of keyframes and that is not something > > you want to be > > doing every frame, or you do? >=20 > Well, you have to decide which controls you need anyway -=20 > that's part of the > spline sampling process. So that's no extra cost. >=20 > Let me preface this with a warning - I have not tried a=20 > DMA-based scheme > yet. That's because on every other platform, we're not=20 > talking about DMAs, > we're talking about normal cache accesses, and they don't care about > small/large transfers - everything is just the size of a=20 > cache line, and > setting up a transfer is just executing a "prefetch" with the relevant > address - extremely cheap. All you need to do is make sure=20 > you do those > prefetches far enough in advance (to absorb the latency of=20 > the transfer from > main memory), and life is good. The only platforms that DMAs=20 > are needed for > are PS2 (avoided doing this - see above) and PS3. >=20 > So, here is my entirely theoretical plan for PS3 - due to be=20 > implemented > some time next year. >=20 > Granny allows animation splines to be listed in any order you=20 > like in memory > - they are remapped to the skeleton on the fly. This is a big=20 > bonus for a > bunch of things, but that would be a very long discussion, so=20 > let's just > take that as read. Some splines will have lots of=20 > knots/controls, and some > will be able to use different sorts of compression, and so=20 > they will all be > different lengths in bytes. So, given that, order your spline=20 > data in memory > from smallest to largest. >=20 > Now, decide what your magic byte length is for your platform.=20 > This is going > to be largely trial and error, but it's roughly the place=20 > where the cost of > setting up a new transfer exceeds the cost of transferring=20 > the extra data. > So let's say that's 64 bytes (chosen at random). That is, if=20 > you wanted to > transfer two chunks of 12 bytes that were 64 bytes apart, it=20 > is the same > speed to set up a single transfer that is 64+12 bytes long,=20 > as to do two > transfers of 12 bytes each. >=20 > OK, so we know we need 12 bytes (roughly - varies with compression and > spline degree) from each spline in the list. And they're=20 > ordered by size. So > we just need to find the first spline with data longer than=20 > 64 bytes. All > the splines before it in the list just get their entire data=20 > DMA'd over in a > single chunk - it's not efficient to do them with separate=20 > transfers. All > the ones afterwards just get a 12-byte transfer set up for=20 > each of them, > because that's the only bit you need, and you'd probably run=20 > out of local > store if you transferred the whole lot, as well as having to=20 > wait longer for > the transfer. >=20 > So that's the theory. I have yet to try it in practice=20 > though. Now, it may > be that the number is not 64 bytes, it's more like 4kbytes or=20 > something. In > which case this mostly just collapses to "transfer the whole animation > over". Well, so be it. But people do use some pretty long=20 > animations with > Granny (I've had people export 15-minute credit sequences as a single > animation!), so at some point you need to cope with not being=20 > able to DMA > the whole thing over. >=20 > As I said above, on every platform but the PS*, it's trivial.=20 > They have > caches, and you just do a prefetch instruction per spline. On=20 > PS3, it's > somewhat more effort <shakes fist at Sony/IBM for their crazy=20 > architecture>. >=20 >=20 > TomF. >=20 >=20 > > -----Original Message----- > > From: gda...@li... > > [mailto:gda...@li...] On > > Behalf Of Yordan Gyurchev > > Sent: 28 October 2005 14:29 > > To: gda...@li... > > Subject: Re: [Algorithms] skeletal animation system > > > > > > Ok, you've won me over. Processing one character/entity on > > one PU thing. So, > > one character =3D one task. > > > > When I think about it though it fits my approach, just the scale is > > different. > > > > > By "fetch", I meant you're fetching them from main memory > > into either a > > > cache or some sort of local storage (e.g. SPU memory). I > > was assuming the > > > animation has already been loaded off disk, otherwise > > you're doomed! > > > Granny > > > uses splines rather than keyframes, but the data access > > patterns are > > > roughly > > > similar to a sparse keyframe system. > > Fetch is =3D main memory =3D> PU > > I use the word keyframe for both normal keyframes and spline > > knots (probably > > wrong but in our case its still some data at some point in > > time regardless > > of the interp method). > > > > > So IMHO it's best to focus optimisation on what you do at > > each leaf, which > > > can usually be expressed as sample-and-blend. > > Agreed. This is what I'm most interested in as well. And > > analyser data shows > > most time is spend there. > > > > I find it difficult to follow what exactly you do with the PU > > and what goes > > to local memory and back to main etc (see later my rant on > > DMA transfers). > > IMHO there needs to be a number of support structures. Inputs > > like where in > > the animation you are, blending weights, parameters, etc. And > > outputs like > > the final skeleton state (what is rendered), other output > > info (we return > > for example velocity of the root or absolute movement). > > > > In my approach I put all my support structures in one solid > > chunk and call > > that animation state. Its all the data the task will need > > (regardless of the > > scale of the task - be it a full character or not). > > > > So we move the animation state into local memory, process > > that character, > > then move it back. It can be structured so read-only data > > goes only one way > > without problem. > > > > Now the question is what goes in this animation state and > > what stays in main > > memory and is fetched on request (or pre-streamed). With my > > approach I keep > > a duplicates of the control points (key frames) that are currently > > interpolated into local interpolation structures (local to > > the animation > > state). So, only when I move to the next keyframe (require > > next knot) I > > touch main memory to fetch it (or not if it is pre-streamed > > in local). My > > argument here was that new keyframes/knots are required > > infrequently. With a > > good spline fitting algorithm you should be experiencing the > > same. Mine is > > (I assume) worse than yours by a mile and still keyframe > > fetches (and for > > that reason cache misses) are not an issue when I analyze the > > execution. > > Even more you can fetch asynchronously an extra > > keyframe/control so you > > don't have any stalls. > > > > Do you have similar control structures or you rely that > > animation is going > > to be prefetched in local storage? If so do you have > > guarantee that every > > single of your animations is going to fit there? Remember > > this would be > > about 1/3 (in practice more like 1/4) of real local storage > > space if you > > plan to triple buffer to keep the PU busy. > > > > May be I'm missing something here. > > > > > Whereas source animations can be compressed far smaller > > than this - we > > > average around 4 bytes per spline control, and you usually > > need to sample > > > the nearest three controls in a spline (quadratic > > interpolation), and it's > > > read-only data, so around 12 bytes of traffic. Obviously > > these are really > > > rough figures, but there's still a significant difference. > > Yes I understand the maths, but I think you are focusing on > > the wrong thing > > here. Its not always how much you transfer but on what > > chunks and how > > often.You have to decide where in the animation stream are > > each quadruples > > of controls for each bone in order to get only those > > quadruples and DMA them > > in the local PU memory. Now I can tell you stright away (and > > I'm almost sure > > things wont change much) that on PS2 one solid DMA transfer > > is far better > > than a number of small ones. They have some setup time and=20 > need to be > > prioritized and etc. So the less you have the better. Not > > only that you have > > to spend CPU some time deciding/calculating where exactly in > > your animation > > are these quadruples of keyframes and that is not something > > you want to be > > doing every frame, or you do? > > > > If you want to directly sample data you need to move most of > > your animation > > (if possible all of it) in local mem. Or consider some genius > > animation > > format where you can DMA solid blocks every time you need to > > sample data > > (once per animation). > > > > I must say that load every animation and sample all characters seems > > attractive. Though it hides potential problems of either not > > being able to > > fit animation (too long animation) or sample data (too many > > character). > > > > On the topic of perfect load balancing I agree with you. > > > > -Yordan > > > > ----- Original Message -----=20 > > From: "Tom Forsyth" <tom...@ee...> > > To: <gda...@li...> > > Sent: Friday, October 28, 2005 8:27 PM > > Subject: RE: [Algorithms] skeletal animation system > > > > > > > You are talking about these > > > few most important characters that you see in the foreground. > > > > I'm also talking about decent numbers of characters. In fact, > > if there's > > 1000s of characters in the scene, then keeping each > > character's animation > > inside a single PU makes even more sense. > > > > > Due to the overlapping nature of the memory allocation > > (especially in > > > cross-fade blends) you have implied data separation that > > > might as well be utilized. > > > > Sorry, you've totally lost me there :-) > > > > > > You do very few blends between already-sampled data. > > > Actually we almost never do anything but that. Different > > > perspective here. > > > We do a few N-blends based on parameters like vectors, > > > quaternions, etc. > > > You can do an "artistic" IK look-alike that way. I'm sure > > > there was link > > > in this list some time ago... about this. > > > > Bad mis-communication on my part. What I meant is that > > because trees are > > broad but shallow, you are almost always blending together > > data sampled > > directly from an animation (the leaves of the tree). You > > rarely blend data > > that is the result of a previous blend (the branches and > > trunk of the tree). > > So IMHO it's best to focus optimisation on what you do at > > each leaf, which > > can usually be expressed as sample-and-blend. > > > > Of course, in a purely binary tree, you have roughly equal > > numbers of the > > two operations. But that's one reason why a binary tree isn't > > that great an > > idea in general :-) > > > > However, you do say that in your data sets you often have > > just one animation > > running, so it's a sample with no blending. Granny also has a > > special-cased > > path for this (only one active animation). In practice, some > > games hit that > > a lot, some hit it almost never - it all depends on the data > > sets. I would > > be cautious about optimising for the case that is already=20 > pretty fast! > > There's a minor code-maintenance issue, but it's only two > > paths, and they > > call a lot of the same functions (that the compiler inlines > > perfectly well). > > > > > most of the time you are doing interpolation between key > > > frames (that are > > > stored locally) rather than fetching them. If your=20 > streams are that > > > densely populated that you have to fetch every update you=20 > are doing > > > something wrong, which i cant imagine being the case with=20 > Granny for > > > example. > > > > By "fetch", I meant you're fetching them from main memory > > into either a > > cache or some sort of local storage (e.g. SPU memory). I was > > assuming the > > animation has already been loaded off disk, otherwise you're > > doomed! Granny > > uses splines rather than keyframes, but the data access > > patterns are roughly > > similar to a sparse keyframe system. > > > > The way Granny is oriented is on a character-by-character=20 > basis. Each > > character is sampled & blended fully, pulling in the data on > > demand, and > > then you do the next one. There's obvious exceptions for > > characters that > > interact with each other, but they're certainly not the > > average case. This > > keeps all the intermediate data nice and warm in caches or > > local storage, > > and only the inputs (animations) and final outputs (what you > > render) live in > > main memory. > > > > To do effective streaming from main memory, you can either > > prefetch/DMA all > > the animations for a character before you start sampling the > > first one (this > > works well when characters are playing a lot of animations), > > or you can > > prefetch character 2 before you sample & blend character 1 > > (this works well > > when characters are playing much fewer or simpler > > animations). If prefetch > > times are long, and local storage/cache space is large, you > > can prefetch > > character 3 or 4 before doing character 1, but that tends to get > > counterproductive and can thrash caches (which is of course lethal). > > > > Now, you could possibly orient it around animations=20 > instead. Load each > > animation just once, then do all the sampling of that > > animation required for > > all the characters in the scene. This means you only fetch > > each animation's > > data just once. The problem is, the results of the sampling=20 > need to be > > stored somewhere, but because you're doing all the sampling of that > > animation in one go, you're unlikely to use that result soon > > enough for it > > to stay in a cache or fit in limited local storage. And it's > > big - typically > > a vec3 of position and a vec4 of orientation, or 28 bytes in > > total, and you > > need to write the data then read it back later - bus traffic > > of 56 bytes. > > Whereas source animations can be compressed far smaller=20 > than this - we > > average around 4 bytes per spline control, and you usually > > need to sample > > the nearest three controls in a spline (quadratic > > interpolation), and it's > > read-only data, so around 12 bytes of traffic. Obviously > > these are really > > rough figures, but there's still a significant difference. > > > > > Granularity. By doing this unless you build some very fine > > > scheduling metric > > > you are going to have trouble balancing PU utilization. Some > > > characters > > > will have longer blend trees than other, taking more > > processing times, > > > etc. Unless you are doing all anim operations on one PU. > > Still without > > > test data on this one I may be completely off target. > > > > In general (obviously waving my hands a bit here), if you > > have more than > > about 4x the number of tasks than processors, and the time > > taken by each > > task is distributed in a roughly bell-curvish way, your > > granularity is "fine > > enough". Splitting the work into finer chunks might get > > slightly more level > > load-balancing, but the overhead of doing the splitting,=20 > and the extra > > synchronisation required because you have more tasks, means > > that you don't > > actually get any benefit. > > > > To put this another way - the worst case scenario is that all > > PUs finish all > > available tasks, and have to wait for a single PU to do a > > single task. They > > never have to wait for that last PU to do more than one=20 > task (because > > otherwise one of the idle ones would do it instead). So if > > the total number > > of tasks divided by the number of PUs is more than 10, then > > the worst you > > can do is waste 10% of your processing power. Wasting only > > 10% of your total > > power is considered superb efficiency in the multiprocessing > > world! This > > assumes that the tasks are not dependent and can be processed > > by any PU in > > any order, but if you partition by characters, this is 99% > > true (and for the > > few that are dependent on something else, just schedule=20 > those first). > > Certainly with 1000s of characters and only around 8 PUs > > (e.g. PS3), I just > > can't see it being a problem. > > > > > > TomF. > > > > > > > -----Original Message----- > > > From: gda...@li... > > > [mailto:gda...@li...] On > > > Behalf Of yg...@gy... > > > Sent: 28 October 2005 03:08 > > > To: gda...@li... > > > Subject: RE: [Algorithms] skeletal animation system > > > > > > > > > Hi Tom, > > > > > > Points, well made. > > > > > > I think we come from different points of view. You are > > > talking about these > > > few most important characters that you see in the foreground. > > > I'm focusing on LOTS of characters in the background > > > (instancing numbers, > > > hundred, thousands, etc) and not so complicated animation=20 > operations > > > (cross-fade blends being by far the most popular blend in my > > > experience in > > > these cases). > > > Due to the overlapping nature of the memory allocation > > (especially in > > > cross-fade blends) you have implied data separation that > > > might as well be > > > utilized. > > > > > > Also in my experience especially when animating all kinds of > > > entities (and > > > have some external attachment between them) its good to=20 > process all > > > animations before you get to actually processing your > > > entities (characters > > > if you want). I think we agree on this point. > > > > > > > Every now and then you get multiple characters doing > > > roughly the same > > > > animation, and the cache helps, but I regard this as luck. > > > We have actually done some research and testing on that. I > > > dont think I > > > agree with this statement when it comes to crowds. We (a > > colleague of > > > mine) do massive character scenes by sorting morphed data. We > > > upload the > > > morph targets and then send instance data (translate,rotate). This > > > achieves amazing (talking 1000+) quantities of characters on > > > screen on a > > > current console platform. I was convinced at first that this > > > would never > > > work because like you I didnt imagine there would be any=20 > reasonable > > > exploitable "roughly the same keyframe" animation. > > > > > > > we need to optimise the worse case scenarios, not the best! > > > Quite right ;) In my case worst scenario happens when you do > > > fast forward > > > cause you hit the anim stream way too often. Unless > > > pre-streamed like you > > > suggest. > > > > > > > So, in conclusion - assume every time you sample an > > > animation, you will > > > > miss every cache. > > > Agreed. However, I'm arguing (and in my experience and > > > analyser data) that > > > most of the time you are doing interpolation between key > > > frames (that are > > > stored locally) rather than fetching them. If your=20 > streams are that > > > densely populated that you have to fetch every update you=20 > are doing > > > something wrong, which i cant imagine being the case with=20 > Granny for > > > example. > > > > > > I do agree that blend trees are usually not that tall and > > > binary blending > > > (except for cross-fades) is not the best under the sun. > > > > > > > You do very few blends between already-sampled data. > > > Actually we almost never do anything but that. Different > > > perspective here. > > > We do a few N-blends based on parameters like vectors, > > > quaternions, etc. > > > You can do an "artistic" IK look-alike that way. I'm sure > > > there was link > > > in this list some time ago... about this. > > > > > > > Thus, although it is elegant to split the processing into > > > (a) sample all > > > > your animations into temporary buffers and then (b) blend > > > the buffers > > > > together, in practice it makes more sense to have your > > > animation sampling > > > > also do the weighted accumulation, so that you can=20 > focus all your > > > > optimisation on that one routine, and not have that > > > intermediate data hang > > > > around polluting caches - as soon as you sample a bone's > > > data, you add it > > > > to > > > > its blend. > > > My approach is: dont do any blends if you dont have to. So you > > > sample/interpolate your anims and most often, if they are not > > > in a cross > > > fade, you are done. If there is something else to be done > > its another > > > task. Makes for better code separation. With your solution > > you have to > > > make a fit-all process function. > > > Still your approach will be less memory consuming and less > > > burden on the > > > bus so it could be worth the maintainability-performance=20 > trade-off. > > > > > > > So the point I'm trying to (finally) get to is that trying > > > to parallelise > > > > the operations done on one particular character seems like > > > a lot of effort > > > > for little benefit. > > > I'm not after parallelsing operations on one character. I'm for > > > parallelizing all operations on all characters on one go. The > > > animation > > > system doesnt know about "character" as such... it just animating > > > entities... > > > Your main point here is that the serial term of the=20 > algorithm being > > > dominant is reducing or invalidating any benefit from the > > > parallel part. I > > > must agree that its a good point but without any test data I > > > cant say if > > > that is the case or not. If task scheduling (and DMA transfers) is > > > properly interlieved (double or tripple buffered) you dont > > > get any penalty > > > for that. Still, a very valid point. > > > > > > > IMHO, a far better thing is to assume that the number of > > characters > > > > on-screen is significantly larger than the number of PUs, > > > and simply do > > > > all > > > > the anim sampling and blending of each character on a single PU. > > > Granularity. By doing this unless you build some very fine > > > scheduling metric > > > you are going to have trouble balancing PU utilization. Some > > > characters > > > will have longer blend trees than other, taking more > > processing times, > > > etc. Unless you are doing all anim operations on one PU. > > Still without > > > test data on this one I may be completely off target. > > > > > > Thanks ever so much for the feedback. In fact I think I'm > > going to try > > > some of you ideas and see how they compare with the > > existing solution. > > > > > > -Yordan > > > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by the JBoss Inc. > > Get Certified Today * Register for a JBoss Training Course > > Free Certification Exam for All Training Attendees Through=20 > End of 2005 > > Visit http://www.jboss.com/services/certification for more=20 > information > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_ida88 > > > > > > > > > > ------------------------------------------------------- > > This SF.Net email is sponsored by the JBoss Inc. > > Get Certified Today * Register for a JBoss Training Course > > Free Certification Exam for All Training Attendees Through=20 > End of 2005 > > Visit http://www.jboss.com/services/certification for more=20 > information > > _______________________________________________ > > GDAlgorithms-list mailing list > > GDA...@li... > > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > > Archives: > > http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188 > > >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_ida88 >=20 >=20 >=20 >=20 > ------------------------------------------------------- > This SF.Net email is sponsored by the JBoss Inc. > Get Certified Today * Register for a JBoss Training Course > Free Certification Exam for All Training Attendees Through End of 2005 > Visit http://www.jboss.com/services/certification for more information > _______________________________________________ > GDAlgorithms-list mailing list > GDA...@li... > https://lists.sourceforge.net/lists/listinfo/gdalgorithms-list > Archives: > http://sourceforge.net/mailarchive/forum.php?forum_id=3D6188 >=20 |