## Re: [Quesa-develop] Re: Quesa - Optimization ideas.

 Re: [Quesa-develop] Re: Quesa - Optimization ideas. From: Edward K. Chew - 2004-02-12 19:01:20 ```On Feb 12, 2004, at 12:50, Dair Grant wrote: > Edward K. Chew wrote: > >> You know, if you added a fourth redundant dimension to the Cartesian >> coordinate system, you could load points, vectors, etc. as true >> altivec >> vectors and perform a number of operations VERY quickly, albeit at the >> cost of some unused memory. Hmm... > > Right now QD3D vectors and colours are always three components. > > This is something we might want to change, or at least add the option > that you can specify points as quads and colours as argb (rather than > distinct diffuse colour + diffuse transparency colour). > > I believe OpenGL on the Mac can perform better if you're sending down > vertices which are aligned like this, so it may be worth it even if we > don't do any vector processing ourselves. I have been thinking about this. Even if you don't restructure everything globally, you might derive some benefit from applying it locally. In other words, you could temporarily realign the vectors, do something time-consuming, and set them back to normal. Altivec has useful instructions for moving memory around in just this way. I work on scientific apps and deal with Cartesian coordinates pretty frequently. One other trick I have learned to help get altivec in on the action is to rearrange your array of vectors to look like this: Original: x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 Optimized: x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 Now you can do just about anything you were used to doing with scalar instructions, only four times faster! Well, in theory, anyway... ;-) Turns out it takes six vec_perm instructions to convert every group of four vectors into this form, which is not too shabby, really. At any rate, if you are interested in a function which does this, let me know. -Ted ```

### Thread view

 Re: [Quesa-develop] Re: Quesa - Optimization ideas. From: Edward K. Chew - 2004-02-12 17:29:46 ```On Feb 12, 2004, at 11:06, Dair Grant wrote: > Edward K. Chew wrote: > >>> What about __fsqrt? Brian used it in all of his code. I'm assuming it >>> was because it was/is faster than sqrt. >> >> I think __fsqrt is only implemented on the G5 (PPC970), am I right? > > That's correct, I think Seth is referring to __frsqrte. > > >> It would be great to use __fsqrt, but presumably you would then need >> two version of the Quesa library: one for the G5 and one for pre-G5s. > > The optimisations would be done through some kind of function pointer > which Q3Math_SquareRoot and Q3Math_InvSquareRoot could invoke. Sounds good, though __fsqrt, being the single instruction in assembly that it is, is just itching to be inlined, isn't it? ;-) But you're right: even in a function which does nothing else, you should see a significant benefit. > I.e., if you knew you could be happy with lower precision results you > can use one of those routines rather than sqrt or 1/sqrt. > > Today they just call through to sqrt and 1/sqrt, but if this was done > through a function pointer you could have a sqrt-based approach for > non-PowerPCs and __frsqrte for PowerPCs (and __fsqrt for G5s). I had one case recently in which a unit normal vector somehow wound up with a length of 1.00001, presumably thanks to my version of sqrt. Then I think I did an asin on it and it was, of course, NaN City after that. Generally, I doubt that for 3D graphics, a slight loss of precision would cause any harm, but there may be a few special cases to watch out for. I guess if you are using one level of indirection through a function pointer, you could add an API to let one choose between the fast transcendental functions and safer ones? >> If you need a whole batch of single-precision square roots, I would >> recommend vec_rsqrte as an alternative to __frsqrte. > > Yes, Q3Vector3D_DotArray and Q3Triangle_CrossProductArray could also > use > a similar scheme - calling a scalar implementation for the general > case, > and AltiVec/SSE when available. You know, if you added a fourth redundant dimension to the Cartesian coordinate system, you could load points, vectors, etc. as true altivec vectors and perform a number of operations VERY quickly, albeit at the cost of some unused memory. Hmm... -Ted ```
 Re: [Quesa-develop] Re: Quesa - Optimization ideas. From: Dair Grant - 2004-02-12 17:52:01 ```Edward K. Chew wrote: >Sounds good, though __fsqrt, being the single instruction in assembly=20 >that it is, is just itching to be inlined, isn't it? ;-) Yes, this is actually the reason I haven't got round to implementing this yet... :-) =46or the Mac, what we would probaly want would be a Q3Fastxxx version of these calls so that we could do the operation inline. That would then need to be conditionalised by platform - was in two minds what the cleanest way to do this would be, given that we want a macro/function pointer/standard implementation. >that. Generally, I doubt that for 3D graphics, a slight loss of=20 >precision would cause any harm, but there may be a few special cases to=20 >watch out for. Yes, for interactive graphics you can almost always get away with it: it does tend to be more of a problem for non-interactive rendering (or for anything cumulative). >I guess if you are using one level of indirection through a function=20 >pointer, you could add an API to let one choose between the fast=20 >transcendental functions and safer ones? The Q3Math root functions are deliberately intended as fast and approximate: rather than have a fast version and a safe version, it felt better to just provide a fast implementation and assume that anyone using precision can call sqrt as normal. Of course at the moment our fast implementation is no faster, since it also calls sqrt - so it's really "<=3D" in terms of speed rather than "<"... :-) >You know, if you added a fourth redundant dimension to the Cartesian=20 >coordinate system, you could load points, vectors, etc. as true altivec=20 >vectors and perform a number of operations VERY quickly, albeit at the=20 >cost of some unused memory. Hmm... Right now QD3D vectors and colours are always three components. This is something we might want to change, or at least add the option that you can specify points as quads and colours as argb (rather than distinct diffuse colour + diffuse transparency colour). I believe OpenGL on the Mac can perform better if you're sending down vertices which are aligned like this, so it may be worth it even if we don't do any vector processing ourselves. -dair (ideally I think we should do as little work as possible: i.e., we manage the data, but we should touch it as infrequently as we can as it passes from the app through us to the rendering API) ___________________________________________________ mailto:dair+refnum.com http://www.refnum.com/ ```
 Re: [Quesa-develop] Re: Quesa - Optimization ideas. From: Edward K. Chew - 2004-02-12 19:01:20 ```On Feb 12, 2004, at 12:50, Dair Grant wrote: > Edward K. Chew wrote: > >> You know, if you added a fourth redundant dimension to the Cartesian >> coordinate system, you could load points, vectors, etc. as true >> altivec >> vectors and perform a number of operations VERY quickly, albeit at the >> cost of some unused memory. Hmm... > > Right now QD3D vectors and colours are always three components. > > This is something we might want to change, or at least add the option > that you can specify points as quads and colours as argb (rather than > distinct diffuse colour + diffuse transparency colour). > > I believe OpenGL on the Mac can perform better if you're sending down > vertices which are aligned like this, so it may be worth it even if we > don't do any vector processing ourselves. I have been thinking about this. Even if you don't restructure everything globally, you might derive some benefit from applying it locally. In other words, you could temporarily realign the vectors, do something time-consuming, and set them back to normal. Altivec has useful instructions for moving memory around in just this way. I work on scientific apps and deal with Cartesian coordinates pretty frequently. One other trick I have learned to help get altivec in on the action is to rearrange your array of vectors to look like this: Original: x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 Optimized: x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 Now you can do just about anything you were used to doing with scalar instructions, only four times faster! Well, in theory, anyway... ;-) Turns out it takes six vec_perm instructions to convert every group of four vectors into this form, which is not too shabby, really. At any rate, if you are interested in a function which does this, let me know. -Ted ```
 Re: [Quesa-develop] Re: Quesa - Optimization ideas. From: Edward K. Chew - 2004-02-13 17:39:14 ```On Feb 12, 2004, at 13:59, Edward K.Chew wrote: > On Feb 12, 2004, at 12:50, Dair Grant wrote: > >> Edward K. Chew wrote: >> >>> You know, if you added a fourth redundant dimension to the Cartesian >>> coordinate system, you could load points, vectors, etc. as true >>> altivec >>> vectors and perform a number of operations VERY quickly, albeit at >>> the >>> cost of some unused memory. Hmm... >> >> Right now QD3D vectors and colours are always three components. >> >> This is something we might want to change, or at least add the option >> that you can specify points as quads and colours as argb (rather than >> distinct diffuse colour + diffuse transparency colour). >> >> I believe OpenGL on the Mac can perform better if you're sending down >> vertices which are aligned like this, so it may be worth it even if we >> don't do any vector processing ourselves. > > I have been thinking about this. Even if you don't restructure > everything globally, you might derive some benefit from applying it > locally. In other words, you could temporarily realign the vectors, > do something time-consuming, and set them back to normal. Altivec has > useful instructions for moving memory around in just this way. > > I work on scientific apps and deal with Cartesian coordinates pretty > frequently. One other trick I have learned to help get altivec in on > the action is to rearrange your array of vectors to look like this: > > Original: x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 > Optimized: x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 > > Now you can do just about anything you were used to doing with scalar > instructions, only four times faster! Well, in theory, anyway... ;-) > Turns out it takes six vec_perm instructions to convert every group of > four vectors into this form, which is not too shabby, really. > > At any rate, if you are interested in a function which does this, let > me know. Well, here it is anyway. Maybe someone might find it useful. Note: I had to mess with it a bit to dredge it out of the morass of interdependent C++ classes it came from into something stand-alone and in plain C, so hopefully I didn't break it in the process! :-) -Ted /* -------------------------------------------------- */ #include typedef __vector unsigned char VUInt8; typedef __vector unsigned long VUInt32; typedef __vector float VFloat32; #define USING_CODEWARRIOR /* This is a hack to overcome a shortcoming of the C/C++ compiler in CodeWarrior 8.3 and earlier. (This is the latest version I own, so I cannot say whether newer versions would also benefit.) The problem is that the compiler always wants to load vector constants at the last moment before they are used. You can see this upon disassembly even at the highest level of optimization. In many cases, however, you want to pre-load a few registers BEFORE you enter a tight loop, thank you very much. The seemingly redundant vec_min instruction gives the compiler just the kick in the pants it needs to force it to load the register immediately. */ #ifdef USING_CODEWARRIOR #define LoadVUInt32(kr0,kr1,kr2,kr3) \ vec_min((VUInt32)(kr0,kr1,kr2,kr3),(VUInt32)(kr0,kr1,kr2,kr3)) #else #define LoadVUInt32(kr0,kr1,kr2,kr3) (VUInt32)(kr0,kr1,kr2,kr3) #endif /* PackQuads takes an array of 3D Cartesian coordinates (PCoords) and rearranges it in a form that is more amenable to vector operations using altivec. This is what you get on output: PCoords: x0 y0 z0 x1 y1 z1 x2 y2 z2 x3 y3 z3 x4 y4 z4 x5... vpQuads: x0 x1 x2 x3 y0 y1 y2 y3 z0 z1 z2 z3 x4 x5 x6 x7... If the number of vectors is not a multiple of 4, the last 3 quads will be partially complete and padded out with zeros. For example, say there are 10 vectors. The last 3 quads should read: vpQuads: x8 x9 0 0 y8 y9 0 0 z8 z9 0 0 PackQuads can perform this rearrangement in place. That is, you can pass the same address in for both vpQuads and PCoords if you like. Its counterpart, UnpackQuads, can also restore the original coordinates in place. If you plan to do this, however, make sure you have allocated enough memory to house all the vectors in quad-form. Here are some formulae you can use: UInt32 NumCoords = NumVectors * 3; UInt32 NumQuadVectors = (NumCoords + 11) / 12; UInt32 NumQuads = NumQuadVectors * 3; This is a low-level function which doesn't make any assumptions about how you organize your data, but you will likely want to define two structures along the lines: struct TVector { Float32 x, y, z; }; struct TQuadVector { VFloat32 x, y, z; }; You can then cast arrays of these into the appropriate types for the PackQuads and UnpackQuads functions. */ void PackQuads(VFloat32* vpQuads,const Float32* PCoords,UInt32 NumVectors) { const VFloat32* VPCoords; VUInt8 VPerm0A,VPerm0B,VPerm1A,VPerm1B,VPerm2A,VPerm2B; VFloat32 VZero; UInt32 NumGroups,NumExtras; VFloat32 vCoords0,vCoords1,vCoords2; VFloat32 vQuad0,vQuad1,vQuad2; VFloat32 vExtras[3]; Float32* pExtras; UInt32 i; /* Permutation tables for the vector permute (vec_perm) instruction. */ VPerm0A = (VUInt8)LoadVUInt32(0x00010203,0x0C0D0E0F,0x18191A1B,0x00000000); VPerm0B = (VUInt8)LoadVUInt32(0x00010203,0x04050607,0x08090A0B,0x14151617); VPerm1A = (VUInt8)LoadVUInt32(0x04050607,0x10111213,0x1C1D1E1F,0x00000000); VPerm1B = (VUInt8)LoadVUInt32(0x00010203,0x04050607,0x08090A0B,0x18191A1B); VPerm2A = (VUInt8)LoadVUInt32(0x08090A0B,0x14151617,0x00000000,0x00000000); VPerm2B = (VUInt8)LoadVUInt32(0x00010203,0x04050607,0x10111213,0x1C1D1E1F); /* The upcoming loop deals with groups of 3 altivec words, converting each into 1 quad vector. */ NumGroups = NumVectors >> 2; /* = NumVectors / 4 */ VPCoords = (const VFloat32*)PCoords; for(i = 0; i < NumGroups; i++, VPCoords += 3, vpQuads += 3) { /* Load words from PCoords array. */ vCoords0 = VPCoords[0]; vCoords1 = VPCoords[1]; vCoords2 = VPCoords[2]; /* Permute them into quad vector form. */ vQuad0 = vec_perm(vCoords0,vCoords1,VPerm0A); vQuad1 = vec_perm(vCoords0,vCoords1,VPerm1A); vQuad2 = vec_perm(vCoords0,vCoords1,VPerm2A); vQuad0 = vec_perm(vQuad0,vCoords2,VPerm0B); vQuad1 = vec_perm(vQuad1,vCoords2,VPerm1B); vQuad2 = vec_perm(vQuad2,vCoords2,VPerm2B); /* Write them back to the quads array. */ vpQuads[0] = vQuad0; /* x quad */ vpQuads[1] = vQuad1; /* y quad */ vpQuads[2] = vQuad2; /* z quad */ } /* Check if there are any extra Cartesian vectors which have yet to be converted straggling along at the end. These will need to be processed with regular scalar operations. */ NumExtras = NumVectors & 3; /* = NumVectors % 4 */ if(NumExtras > 0) { PCoords = (const Float32*)VPCoords; /* At this point, I use a small array called vExtras to prepare the final 3 quads. First I clear the whole thing so that any unwritten areas will contain zero. The vec_splat instruction takes the last word from VPerm0A (which happens to be 0) and assigns it to all 4 words making up VZero. */ VZero = vec_splat((VFloat32)VPerm0A,3); for(i = 0; i < 3; i++) vExtras[i] = VZero; /* Copy each remaining Cartesian coordinate to the appropriate location in vExtras. */ pExtras = (Float32*)vExtras; for(i = 0; i < NumExtras; i++, pExtras++) { pExtras[0] = *PCoords++; /* x(i) */ pExtras[4] = *PCoords++; /* y(i) */ pExtras[8] = *PCoords++; /* z(i) */ } /* Finally, copy vExtras over to vpQuads and we're done. */ for(i = 0; i < 3; i++) vpQuads[i] = vExtras[i]; } } /* This is the functional reverse of PackQuads. As it works in much the same way, I have not bothered to document it. */ void UnpackQuads(Float32* pCoords,const VFloat32* VPQuads,UInt32 NumVectors) { VUInt8 VPerm0A,VPerm0B,VPerm1A,VPerm1B,VPerm2A,VPerm2B; UInt32 NumGroups,NumExtras; const Float32* PExtras; VFloat32* vpCoords; VFloat32 vCoords0,vCoords1,vCoords2; VFloat32 vQuad0,vQuad1,vQuad2; VFloat32 vExtras[3]; UInt32 i; VPerm0A = (VUInt8)LoadVUInt32(0x00010203,0x10111213,0x00000000,0x04050607); VPerm0B = (VUInt8)LoadVUInt32(0x00010203,0x04050607,0x10111213,0x0C0D0E0F); VPerm1A = (VUInt8)LoadVUInt32(0x14151617,0x00000000,0x08090A0B,0x18191A1B); VPerm1B = (VUInt8)LoadVUInt32(0x00010203,0x14151617,0x08090A0B,0x0C0D0E0F); VPerm2A = (VUInt8)LoadVUInt32(0x00000000,0x0C0D0E0F,0x1C1D1E1F,0x00000000); VPerm2B = (VUInt8)LoadVUInt32(0x18191A1B,0x04050607,0x08090A0B,0x1C1D1E1F); NumGroups = NumVectors >> 2; vpCoords = (VFloat32*)pCoords; for(i = 0; i < NumGroups; i++, VPQuads += 3, vpCoords += 3) { vQuad0 = VPQuads[0]; vQuad1 = VPQuads[1]; vQuad2 = VPQuads[2]; vCoords0 = vec_perm(vQuad0,vQuad1,VPerm0A); vCoords1 = vec_perm(vQuad0,vQuad1,VPerm1A); vCoords2 = vec_perm(vQuad0,vQuad1,VPerm2A); vCoords0 = vec_perm(vCoords0,vQuad2,VPerm0B); vCoords1 = vec_perm(vCoords1,vQuad2,VPerm1B); vCoords2 = vec_perm(vCoords2,vQuad2,VPerm2B); vpCoords[0] = vCoords0; vpCoords[1] = vCoords1; vpCoords[2] = vCoords2; } NumExtras = NumVectors & 3; if(NumExtras > 0) { for(i = 0; i < 3; i++) vExtras[i] = VPQuads[i]; PExtras = (const Float32*)vExtras; pCoords = (Float32*)vpCoords; for(i = 0; i < NumExtras; i++, PExtras++) { *pCoords++ = PExtras[0]; *pCoords++ = PExtras[4]; *pCoords++ = PExtras[8]; } } } /* -------------------------------------------------- */ ```
 Re: [Quesa-develop] Re: Quesa - Optimization ideas. From: Roger Holmes - 2004-02-13 13:16:52 ```On Thursday, February 12, 2004, at 05:50 pm, Dair Grant wrote: >> You know, if you added a fourth redundant dimension to the Cartesian >> coordinate system, you could load points, vectors, etc. as true >> altivec >> vectors and perform a number of operations VERY quickly, albeit at the >> cost of some unused memory. Hmm... > > Right now QD3D vectors and colours are always three components. > > This is something we might want to change, or at least add the option > that you can specify points as quads and colours as argb (rather than > distinct diffuse colour + diffuse transparency colour). I may be saying the obvious as we are all at different points in out understanding of graphics/coordinate geometry, but when I saw the AltiVec specification I thought it was ideal for 3D work because a 3D vector has an implied Z of zero and a point has an implied Z of one. If you do a matrix multiply by a vector the bottom ( translation ) row gets multiplied by zero and has no effect whilst for a point it gets multiplied by one and added on. If you take the average of a number of points their Zs get added together and normally you divide by the number of points, bringing the Z back to one. Of course the implied divide by Z does it for you anyway. The other advantage is that C typing does not get in the way. For points, ( A + B ) / 2 is not normally allowed, though you can do A + ( B - A ) / 2 using standard routines as B - A gives a vector. Using 4 coordinates, point = point + vector also works automatically as lots of other stuff. Do we really need point3D and vector3D on the G4? I don't think so unless we have millions of them to pack into as small a space as possible. Of course we have to make it all work on other machines, like G3 too in a transparent way and that's the challenge. Roger. ```

## Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

No, thanks