From: Brian P. <br...@va...> - 2000-10-19 03:20:40
|
In Mesa, software blending was failing the Glean blendFunc test because of using bit-shifts instead of integer divide. I've fixed that. Using the tdfx-3 driver on Voodoo5 I found and fixed a few blending problems caused by mismapping of GL blend modes to Glide blend modes. Interestingly, the Voodoo5 hardware appears to have some inaccuracy in its blending circuits because a variety of blend modes fail in Glean with errors ranging from 1.02949 to 1.16156 bits. Not much we can do about that. I'll check these fixes in as soon as SourceForge CVS is working again. -Brian |
From: Allen A. <ak...@po...> - 2000-10-19 05:49:31
|
On Wed, Oct 18, 2000 at 08:21:52PM -0600, Brian Paul wrote: | | In Mesa, software blending was failing the Glean blendFunc test because | of using bit-shifts instead of integer divide. I've fixed that. I haven't checked, but I suspect something like unsigned char src = ...; unsigned char srcFactor = ...; unsigned char dst = ...; unsigned char dstFactor = ...; unsigned int pix = src * srcFactor + dst * dstFactor; unsigned char result = (pix + (pix >> 8)) >> 8; might be close enough to pass, and would avoid the performance cost of a full integer divide. (I'm not positive about the math, but something like that expression is probably pretty close. Boundary conditions still need to be checked.) | Using the tdfx-3 driver on Voodoo5 I found and fixed a few blending | problems caused by mismapping of GL blend modes to Glide blend modes. | Interestingly, the Voodoo5 hardware appears to have some inaccuracy in | its blending circuits because a variety of blend modes fail in Glean | with errors ranging from 1.02949 to 1.16156 bits. Not much we can do | about that. At this stage it's OK, just something people should know about. If lots of implementations fail, then the test is too stringent, and we should relax it a little. Allen |
From: Keith P. <ke...@ke...> - 2000-10-19 06:39:24
|
unsigned char src = ...; unsigned char srcFactor = ...; unsigned char dst = ...; unsigned char dstFactor = ...; unsigned int pix = src * srcFactor + dst * dstFactor; unsigned char result = (pix + (pix >> 8)) >> 8; I don't think this is quite soup yet, here's some more ideas -- from code stolen from Jim Blinn: /* multiply two bytes representing 0..1 yielding a byte representing 0..1 */ #define IntMult(a,b,t) ( (t) = (a) * (b) + 0x80, ( ( ( (t)>>8 ) + (t) )>>8 ) ) unsigned short t; unsigned short pix = IntMult(src,srcFactor,t) + IntMult(dst,dstFactor,t); unsigned char result = (pix | (0 - (pix >> 8))); The final bit of twiddling saturates the result at 0xff with no compare. I suspect the two IntMults could be merged to avoid all of the extra shifts, but that would take actually compiling and testing the code... ke...@ke... XFree86 Core Team SuSE, Inc. |
From: Brian P. <br...@va...> - 2000-10-19 13:57:54
|
Keith Packard wrote: > > unsigned char src = ...; > unsigned char srcFactor = ...; > unsigned char dst = ...; > unsigned char dstFactor = ...; > unsigned int pix = src * srcFactor + dst * dstFactor; > unsigned char result = (pix + (pix >> 8)) >> 8; > > I don't think this is quite soup yet, here's some more ideas -- from code > stolen from Jim Blinn: > > /* multiply two bytes representing 0..1 yielding a byte representing 0..1 */ > #define IntMult(a,b,t) ( (t) = (a) * (b) + 0x80, ( ( ( (t)>>8 ) + (t) )>>8 ) ) > > unsigned short t; > unsigned short pix = IntMult(src,srcFactor,t) + IntMult(dst,dstFactor,t); > unsigned char result = (pix | (0 - (pix >> 8))); > > The final bit of twiddling saturates the result at 0xff with no compare. > > I suspect the two IntMults could be merged to avoid all of the extra shifts, > but that would take actually compiling and testing the code... The actual computation I'm doing for the common case of glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA) is result = (source * alpha + dest * (255 - alpha)) / 255; for each color channel. All vars are in the range [0,255]. With a little tinkering, I found that the following macro exactly computes X / 255 for X in [0, 65535]: #define DIV255(X) (((X) * ((1 << 16) / 255) + 256) >> 16) The macro's cost is an integer multiply, add and shift. The total cost for one color channel is 3 mults, 3 adds/subs and 1 shift. Blinn's method uses 2 mults, 5 adds/subs, 5 shifts and a temp var. I haven't done any benchmarking. -Brian |
From: Nathan H. <na...@ma...> - 2000-10-19 15:55:44
Attachments:
t.c
|
On Thu, Oct 19, 2000 at 07:04:12AM -0600, Brian Paul wrote: > > #define DIV255(X) (((X) * ((1 << 16) / 255) + 256) >> 16) When this code is passed through gcc -O2 the assembly produced is movl %edx,%eax # %eax = x sall $8,%eax # %eax = x << 8 leal 256(%edx,%eax),%eax # %eax = (x << 8) + x + 256 sarl $16,%eax # %eax = ((x << 8) + x + 256) >> 16 So gcc converts DIV255 into the method I posted earlier. > The macro's cost is an integer multiply, add and shift. > The total cost for one color channel is 3 mults, 3 adds/subs and 1 shift. After -O2 the cost is 2 shifts and 2 adds. > Blinn's method uses 2 mults, 5 adds/subs, 5 shifts and a temp var. > > I haven't done any benchmarking. I've attached a simple benchmark. Compile with -O2. Results are DIV255 5639408 approx 5629603 exact 5629793 Smaller is better. Numbers represent microseconds elapsed. DIV255 without optimisation is 1 multiply, 1 add, 1 shift. (((x) * ((1 << 16) / 255) + 256) >> 16) The approximate version isn't a significant win. ((x << 8) + x) >> 16 The exact version is the same as DIV255 after -O2, but is explicit. (((x << 8) + x) + 256) >> 16 3dnow/MMX can do all 4 channels (rgba) at once, quadrupling performance. |
From: Brian P. <br...@va...> - 2000-10-19 16:08:12
|
Nathan Hand wrote: > > On Thu, Oct 19, 2000 at 07:04:12AM -0600, Brian Paul wrote: > > > > #define DIV255(X) (((X) * ((1 << 16) / 255) + 256) >> 16) > > When this code is passed through gcc -O2 the assembly produced is > > movl %edx,%eax # %eax = x > sall $8,%eax # %eax = x << 8 > leal 256(%edx,%eax),%eax # %eax = (x << 8) + x + 256 > sarl $16,%eax # %eax = ((x << 8) + x + 256) >> 16 Huh? I think you're looking at the x86 code for your method, not my DIV255 macro. > So gcc converts DIV255 into the method I posted earlier. > > > The macro's cost is an integer multiply, add and shift. > > The total cost for one color channel is 3 mults, 3 adds/subs and 1 shift. > > After -O2 the cost is 2 shifts and 2 adds. > > > Blinn's method uses 2 mults, 5 adds/subs, 5 shifts and a temp var. > > > > I haven't done any benchmarking. > > I've attached a simple benchmark. Compile with -O2. Results are > > DIV255 5639408 > approx 5629603 > exact 5629793 > > Smaller is better. Numbers represent microseconds elapsed. > > DIV255 without optimisation is 1 multiply, 1 add, 1 shift. > > (((x) * ((1 << 16) / 255) + 256) >> 16) > > The approximate version isn't a significant win. > > ((x << 8) + x) >> 16 > > The exact version is the same as DIV255 after -O2, but is explicit. > > (((x << 8) + x) + 256) >> 16 I don't see how the int multiply in the DIV255 macro can be optimized away and be the same as your macro. -Brian |
From: Nathan H. <na...@ma...> - 2000-10-19 16:49:37
|
On Thu, Oct 19, 2000 at 09:14:34AM -0600, Brian Paul wrote: > Nathan Hand wrote: > > > > On Thu, Oct 19, 2000 at 07:04:12AM -0600, Brian Paul wrote: > > > > > > #define DIV255(X) (((X) * ((1 << 16) / 255) + 256) >> 16) > > > > When this code is passed through gcc -O2 the assembly produced is > > > > movl %edx,%eax # %eax = x > > sall $8,%eax # %eax = x << 8 > > leal 256(%edx,%eax),%eax # %eax = (x << 8) + x + 256 > > sarl $16,%eax # %eax = ((x << 8) + x + 256) >> 16 > > Huh? I think you're looking at the x86 code for your method, > not my DIV255 macro. Both my method and your DIV255 method produce the same assembly if compiled with -O2. You can verify this yourself with objdump. > > So gcc converts DIV255 into the method I posted earlier. > > > > > The macro's cost is an integer multiply, add and shift. > > > The total cost for one color channel is 3 mults, 3 adds/subs and 1 shift. > > > > After -O2 the cost is 2 shifts and 2 adds. > > > > > Blinn's method uses 2 mults, 5 adds/subs, 5 shifts and a temp var. > > > > > > I haven't done any benchmarking. > > > > I've attached a simple benchmark. Compile with -O2. Results are > > > > DIV255 5639408 > > approx 5629603 > > exact 5629793 > > > > Smaller is better. Numbers represent microseconds elapsed. > > > > DIV255 without optimisation is 1 multiply, 1 add, 1 shift. > > > > (((x) * ((1 << 16) / 255) + 256) >> 16) > > > > The approximate version isn't a significant win. > > > > ((x << 8) + x) >> 16 > > > > The exact version is the same as DIV255 after -O2, but is explicit. > > > > (((x << 8) + x) + 256) >> 16 > > I don't see how the int multiply in the DIV255 macro can be optimized > away and be the same as your macro. The int multiply is by a constant. X * 257 = X * 256 + X * 1. It's possible to convert any constant multiplier into strings of shifts and adds. GCC is smart enough to know when doing so is a win. -- Sydney 2000 - The Best Olympic Games Ever |
From: Brian P. <br...@va...> - 2000-10-19 17:19:44
|
Nathan Hand wrote: > > On Thu, Oct 19, 2000 at 09:14:34AM -0600, Brian Paul wrote: > > Nathan Hand wrote: > > > > > > On Thu, Oct 19, 2000 at 07:04:12AM -0600, Brian Paul wrote: > > > > > > > > #define DIV255(X) (((X) * ((1 << 16) / 255) + 256) >> 16) > > > > > > When this code is passed through gcc -O2 the assembly produced is > > > > > > movl %edx,%eax # %eax = x > > > sall $8,%eax # %eax = x << 8 > > > leal 256(%edx,%eax),%eax # %eax = (x << 8) + x + 256 > > > sarl $16,%eax # %eax = ((x << 8) + x + 256) >> 16 > > > > Huh? I think you're looking at the x86 code for your method, > > not my DIV255 macro. > > Both my method and your DIV255 method produce the same assembly if > compiled with -O2. You can verify this yourself with objdump. > > > > So gcc converts DIV255 into the method I posted earlier. > > > > > > > The macro's cost is an integer multiply, add and shift. > > > > The total cost for one color channel is 3 mults, 3 adds/subs and 1 shift. > > > > > > After -O2 the cost is 2 shifts and 2 adds. > > > > > > > Blinn's method uses 2 mults, 5 adds/subs, 5 shifts and a temp var. > > > > > > > > I haven't done any benchmarking. > > > > > > I've attached a simple benchmark. Compile with -O2. Results are > > > > > > DIV255 5639408 > > > approx 5629603 > > > exact 5629793 > > > > > > Smaller is better. Numbers represent microseconds elapsed. > > > > > > DIV255 without optimisation is 1 multiply, 1 add, 1 shift. > > > > > > (((x) * ((1 << 16) / 255) + 256) >> 16) > > > > > > The approximate version isn't a significant win. > > > > > > ((x << 8) + x) >> 16 > > > > > > The exact version is the same as DIV255 after -O2, but is explicit. > > > > > > (((x << 8) + x) + 256) >> 16 > > > > I don't see how the int multiply in the DIV255 macro can be optimized > > away and be the same as your macro. > > The int multiply is by a constant. X * 257 = X * 256 + X * 1. It's > possible to convert any constant multiplier into strings of shifts > and adds. GCC is smart enough to know when doing so is a win. Ah, right. I didn't think it through that far. Thanks. -Brian |
From: Ian D R. <id...@cs...> - 2000-10-19 17:29:23
|
> > > #define DIV255(X) (((X) * ((1 << 16) / 255) + 256) >> 16) > > > > When this code is passed through gcc -O2 the assembly produced is > > > > movl %edx,%eax # %eax = x > > sall $8,%eax # %eax = x << 8 > > leal 256(%edx,%eax),%eax # %eax = (x << 8) + x + 256 > > sarl $16,%eax # %eax = ((x << 8) + x + 256) >> 16 > > Huh? I think you're looking at the x86 code for your method, > not my DIV255 macro. > > > So gcc converts DIV255 into the method I posted earlier. Actually, he's probably right. I converted the two macros to functions and pumped them through Sequent's (now IBM) C compiler for ptx. Here's what it came up with: .text .align 16 .BK1: div255: / FUNCTION movl 0x4(%esp),%eax imull $0x101,%eax addl $0x100,%eax sarl $0x10,%eax ret ..fs2: .text .align 16 .BK2: div255_2: / FUNCTION movl 0x4(%esp),%eax movl %eax,%edx shll $0x8,%edx addl %edx,%eax addl $0x100,%eax sarl $0x10,%eax ret ..fs4: Both blocks of code are the same. I guess our compiler just isn't quite as aggressive about optimizing multiples (does it really make a difference on the P6 arch?) as GCC. -- "Government is like a baby. An alimentary canal with a big appetite at one end and no sense of responsibility at the other." -- Ronald Reagan There is no responsibility at http://www.cs.pdx.edu/~idr |
From: Allen A. <ak...@po...> - 2000-10-19 15:17:10
|
On Wed, Oct 18, 2000 at 11:39:21PM -0700, Keith Packard wrote: | | I don't think this is quite soup yet, ... Well, I *did* offer a few caveats in the original post. :-) | ...here's some more ideas -- from code | stolen from Jim Blinn... | [ good code omitted for brevity ] Fortunately we don't have to be too anal-retentive, for two reasons: 1. OpenGL's blending arithmetic specification leaves wiggle room in a few areas. For example, the precision of the source blend factor isn't necessarily related to the depth of the framebuffer's color channels. So going the extra mile to get a precise result using 8-bit colors might still yield results that differ from "ideal" hardware. 2. Mesa's software renderer is a production renderer for some people, so it's OK to provide slightly less blending precision in return for some extra speed. Simpler blending arithmetic might cover both of those bases. From Brian's followup it looks like he's got a good solution. Allen |
From: Rik F. <fa...@va...> - 2000-10-19 08:35:36
|
On Wed 18 Oct 2000 22:49:29 -0700, Allen Akin <ak...@po...> wrote: > On Wed, Oct 18, 2000 at 08:21:52PM -0600, Brian Paul wrote: > | > | In Mesa, software blending was failing the Glean blendFunc test because > | of using bit-shifts instead of integer divide. I've fixed that. > > I haven't checked, but I suspect something like > > unsigned char src = ...; > unsigned char srcFactor = ...; > unsigned char dst = ...; > unsigned char dstFactor = ...; > unsigned int pix = src * srcFactor + dst * dstFactor; > unsigned char result = (pix + (pix >> 8)) >> 8; > > might be close enough to pass, and would avoid the performance cost of > a full integer divide. (I'm not positive about the math, but > something like that expression is probably pretty close. Boundary > conditions still need to be checked.) Most (probably all) modern compilers will change a division to a shift when they can. Even with -O0, gcc changes 'a/256' to a 'sarl $8'. In this case (and other cases with constant divisors), I think it's worth looking at the code that the compiler generates before you re-write it just to get the shift -- you may already be getting a shift and can leave the code in a more readable form. [If you don't, I'd like to see the code to understand why.] [When a constant is involved, I advocate writing code using / instead of >> because it's more readable, often more correct (because ANSI C doesn't define how sign bits are handled for shifts), and because the compiler will use a shift anyway. Even when a constant isn't involve, unless the pathway is performance critical, writing a / often produces more readable code.] |
From: Brian P. <br...@va...> - 2000-10-19 13:16:19
|
Rik Faith wrote: > > On Wed 18 Oct 2000 22:49:29 -0700, > Allen Akin <ak...@po...> wrote: > > On Wed, Oct 18, 2000 at 08:21:52PM -0600, Brian Paul wrote: > > | > > | In Mesa, software blending was failing the Glean blendFunc test because > > | of using bit-shifts instead of integer divide. I've fixed that. > > > > I haven't checked, but I suspect something like > > > > unsigned char src = ...; > > unsigned char srcFactor = ...; > > unsigned char dst = ...; > > unsigned char dstFactor = ...; > > unsigned int pix = src * srcFactor + dst * dstFactor; > > unsigned char result = (pix + (pix >> 8)) >> 8; > > > > might be close enough to pass, and would avoid the performance cost of > > a full integer divide. (I'm not positive about the math, but > > something like that expression is probably pretty close. Boundary > > conditions still need to be checked.) > > Most (probably all) modern compilers will change a division to a shift > when they can. Even with -O0, gcc changes 'a/256' to a 'sarl $8'. In > this case (and other cases with constant divisors), I think it's worth > looking at the code that the compiler generates before you re-write it > just to get the shift -- you may already be getting a shift and can > leave the code in a more readable form. [If you don't, I'd like to > see the code to understand why.] > > [When a constant is involved, I advocate writing code using / instead > of >> because it's more readable, often more correct (because ANSI C > doesn't define how sign bits are handled for shifts), and because the > compiler will use a shift anyway. Even when a constant isn't involve, > unless the pathway is performance critical, writing a / often produces > more readable code.] We need to divide by 255, not 256. Shifting by 8 is a close approximation, but not quite good enough. -Brian |
From: Nathan H. <na...@ma...> - 2000-10-19 14:29:05
|
On Thu, Oct 19, 2000 at 06:22:43AM -0600, Brian Paul wrote: > > We need to divide by 255, not 256. Shifting by 8 is a close approximation, > but not quite good enough. Better approximation is (x << 8 - x) >> 16. Close enough? |
From: Nathan H. <na...@ma...> - 2000-10-19 15:07:24
|
On Fri, Oct 20, 2000 at 01:28:48AM +1100, nathanh wrote: > On Thu, Oct 19, 2000 at 06:22:43AM -0600, Brian Paul wrote: > > > > We need to divide by 255, not 256. Shifting by 8 is a close approximation, > > but not quite good enough. > > Better approximation is (x << 8 - x) >> 16. Close enough? Blah, I meant ((x << 8) + x) >> 16. Check twice. Post once. A more accurate version is (((x << 8) + x) + 256) >> 16. But the more accurate version requires an extra add. -- Sydney 2000 - The Best Olympic Games Ever |
From: Brian P. <br...@va...> - 2000-10-19 15:46:39
|
Nathan Hand wrote: > > On Fri, Oct 20, 2000 at 01:28:48AM +1100, nathanh wrote: > > On Thu, Oct 19, 2000 at 06:22:43AM -0600, Brian Paul wrote: > > > > > > We need to divide by 255, not 256. Shifting by 8 is a close approximation, > > > but not quite good enough. > > > > Better approximation is (x << 8 - x) >> 16. Close enough? > > Blah, I meant ((x << 8) + x) >> 16. Check twice. Post once. > > A more accurate version is (((x << 8) + x) + 256) >> 16. > > But the more accurate version requires an extra add. Nice! It gives exact results for X in [0,65535] without the multiply that I used. I'll use it. Thanks! -Brian |
From: Brian P. <br...@va...> - 2000-10-19 13:09:16
|
Allen Akin wrote: > > On Wed, Oct 18, 2000 at 08:21:52PM -0600, Brian Paul wrote: > | > | In Mesa, software blending was failing the Glean blendFunc test because > | of using bit-shifts instead of integer divide. I've fixed that. > > I haven't checked, but I suspect something like > > unsigned char src = ...; > unsigned char srcFactor = ...; > unsigned char dst = ...; > unsigned char dstFactor = ...; > unsigned int pix = src * srcFactor + dst * dstFactor; > unsigned char result = (pix + (pix >> 8)) >> 8; > > might be close enough to pass, and would avoid the performance cost of > a full integer divide. (I'm not positive about the math, but > something like that expression is probably pretty close. Boundary > conditions still need to be checked.) I've got some integer divide approximation code around here somewhere I could dig up. I think I've seen what you suggest above before and will try it out. > | Using the tdfx-3 driver on Voodoo5 I found and fixed a few blending > | problems caused by mismapping of GL blend modes to Glide blend modes. > | Interestingly, the Voodoo5 hardware appears to have some inaccuracy in > | its blending circuits because a variety of blend modes fail in Glean > | with errors ranging from 1.02949 to 1.16156 bits. Not much we can do > | about that. > > At this stage it's OK, just something people should know about. If > lots of implementations fail, then the test is too stringent, and we > should relax it a little. OK. -Brian |