Thread: RE: [GD-Windows] BitBlt() syncing to VBL
Brought to you by:
vexxed72
From: Brian S. <bs...@mi...> - 2002-01-15 18:24:40
|
A lot of what you're describing is surprising. When I was doing Mac stuff PCI video was the high end and NuBus was more common, so that's a ways away from your G4 - but I never had an app that ran faster on the Mac than the PC. Also, what you say about DDraw blits seems strange. Given that memcpy blitters are so easy to write, don't you think that the DX team would have used memcpy if it was dramatically faster? :) I'm not saying you're wrong, I haven't tested it myself, it just seems strange. You also don't mention if you're page flipping, which would obviously lock you to the refresh rate. DrawSprocket and QuickDraw don't support page flipping (this is on pre-OSX) so that could be letting your Mac app run faster than your PC. On OSX, I don't know what the state of affairs is. Sorry that none of that really helps your problem, --brian -----Original Message----- From: Brian Hook [mailto:bri...@py...]=20 Sent: Tuesday, January 15, 2002 8:13 AM To: gam...@li... Subject: [GD-Windows] BitBlt() syncing to VBL My blitting code on a PowerMac G4/867 is running over 50% faster than on a=20 PC of higher clock speed (and with a faster video card to boot). It feels=20 like it's syncing to VBL on the PC -- 75fps on a 76Hz LCD -- but I was=20 under the impression that GDI's BitBlt() doesn't sync to VBL. Is this not=20 the case? I was getting the distinct feeling that it was syncing to VBL using DDraw's Blt() function, which is one reason I switched to GDI (not to=20 mention that blitting using memcpy() has proven to be light years faster than using DDraw's blitters). Brian _______________________________________________ Gamedevlists-windows mailing list Gam...@li... https://lists.sourceforge.net/lists/listinfo/gamedevlists-windows |
From: Brian H. <bri...@py...> - 2002-01-15 18:35:29
|
> A lot of what you're describing is surprising. When I was > doing Mac stuff PCI video was the high end and NuBus was more > common, so that's a ways away from your G4 - but I never had > an app that ran faster on the Mac than the PC. Well, Macs have gotten out of the Stone Age and are somewhere in the Iron Age, so they at least have AGP now =) > Also, what you say about DDraw blits seems strange. Well, it was a bit of an unfair comparison. With DDraw I was always assuming that the source surface had transparency, even if it didn't, so I'm sure it was going through a slower path than necessary. With my GDI blitter, I have separate opaque and transparent blitters. > You also don't mention if you're page flipping, which would > obviously lock you to the refresh rate. Nope. This is on OS X using off-screen Gworlds and CopyBits(). This is pretty much the MacOS equivalent to DIBSections and BitBlt(). In fact, I should have even WORSE performance under OS X because I'm actually triple buffering -- my blit goes into the window's off-screen backbuffer, which is then blitted by the OS later. Under Windows I'm just going straight from DIB section to the Window's front buffer (in theory). My guess is that there's still some VBL action going on somewhere (note: the Mac also VBLs, even though I'm getting 125fps, but the triple buffering is probably accelerating things by allowing multiple blits in a single refresh?), since I'm locked very close to my monitor's ostensible frame rate. I've tried disabling all the various "sync" parameters in the driver properties, but to no avail. I do find this quite a bit odd simply because I was expecting to do a lot of optimization work on the Mac since the Mac has a slower clock speed and significantly less memory bandwidth. My nearest guess is that I'm either doing something terribly wrong on the Windows side, or the Mac has some kind of mad, stupid Altivec optimized memcpy()/CopyBits(). -Brian |
From: Jon W. <hp...@mi...> - 2002-01-15 18:45:47
|
For what it's worth, when I was doing Mac programming, the Apple line was always: "You should assume that CopyBits() is written by super intelligent space aliens and will always perform optimally." Cheers, / h+ > I'm either doing something terribly wrong on the Windows side, or the > Mac has some kind of mad, stupid Altivec optimized memcpy()/CopyBits(). |
From: Tom H. <to...@3d...> - 2002-01-15 18:57:34
|
At 10:45 AM 1/15/2002, Jon Watte wrote: >For what it's worth, when I was doing Mac programming, the Apple line >was always: > > "You should assume that CopyBits() is written by super intelligent > space aliens and will always perform optimally." Sadly, CopyBits() is _much_ slower than a memcpy() buried in a for loop. So much for super intelligent space aliens :P Tom |
From: Jon W. <hp...@mi...> - 2002-01-15 20:24:18
|
> >For what it's worth, when I was doing Mac programming, the Apple line > >was always: > > > > "You should assume that CopyBits() is written by super intelligent > > space aliens and will always perform optimally." > > Sadly, CopyBits() is _much_ slower than a memcpy() buried in a > for loop. So > much for super intelligent space aliens :P Well, it usually was the case that you had to "prime" CopyBits by making sure the moon phases were aligned for the source and destination GWorlds, but you could usually get it rolling pretty well. All the cost in CopyBits comes from pre-copy set-up, so the bigger the copy, and the more moon phase alignment you can manage, the faster it gets. Of course, there were special cases, like the "pixel doubling blit" which stuffed one 32-bit pixel twice into a double and used the 64-bit data path to the frame buffer, where CopyBits was lagging behind for a while. But I'd be really surprised if they haven't gotten around to fixing that (and other) by now, seeing as that was five years ago... Cheers, / h+ |
From: Brian S. <bs...@mi...> - 2002-01-15 19:19:07
|
> Nope. This is on OS X using off-screen Gworlds and=20 > CopyBits(). This is > pretty much the MacOS equivalent to DIBSections and BitBlt().=20 > In fact, > I should have even WORSE performance under OS X because I'm actually > triple buffering -- my blit goes into the window's off-screen > backbuffer, which is then blitted by the OS later. Under Windows I'm > just going straight from DIB section to the Window's front buffer (in > theory). Yeah, but assuming that the offscreen backbuffer is on the video card, the vidmem-to-vidmem blit is so fast that it's essentially free. Certainly doesn't cost you any CPU time to queue it up. > My guess is that there's still some VBL action going on=20 > somewhere (note: > the Mac also VBLs, even though I'm getting 125fps, but the triple > buffering is probably accelerating things by allowing=20 > multiple blits in > a single refresh?), since I'm locked very close to my monitor's > ostensible frame rate. Yeah, the fact that your frame rate is at the refresh rate - it would be a stretch to suspect anything else. Can you disable bits of your pipeline and log your framerate to see where the bottleneck is? i.e. if build your frame but not blit it to the card, how many fps do you get? What if you just blit an empty frame to the card every time? Etc etc. I think the recipe for speed is to minimizing blits across the bus - composite the new frame in system memory, do one blit to the back buffer of the video card, then flip or blit back to front. You don't want to send something across the bus that will later be overdrawn. > I've tried disabling all the various "sync" parameters in the driver > properties, but to no avail. >=20 > I do find this quite a bit odd simply because I was expecting to do a > lot of optimization work on the Mac since the Mac has a slower clock > speed and significantly less memory bandwidth. My nearest=20 > guess is that > I'm either doing something terribly wrong on the Windows side, or the > Mac has some kind of mad, stupid Altivec optimized=20 > memcpy()/CopyBits(). I would bet that CopyBits is heavily optimized, but BitBlt should be too. I think the key for both is to make sure you're on the fast path - no transparency, pixel formats match, palettes (if any) match - so that the function can just blast bits. I prefer DirectDraw over GDI because if you're not on the fast path you can tell immediately - either nothing will draw, or in the case of 1555 vs. 565, everything looks very, very odd. --brian |
From: Brian H. <bri...@py...> - 2002-01-15 19:27:08
|
> Yeah, but assuming that the offscreen backbuffer is on the > video card, the vidmem-to-vidmem blit is so fast that it's > essentially free. Certainly doesn't cost you any CPU time to > queue it up. If only it were so simple =) You can specify where you want your offscreen GWorld allocated, and all my allocations are hardcoded into system memory. DX programming has taught me to stay away from anything twitchy -- like VRAM/AGP buffers =) > Yeah, the fact that your frame rate is at the refresh rate - > it would be a stretch to suspect anything else. Can you > disable bits of your pipeline and log your framerate to see > where the bottleneck is? i.e. if build your frame but not > blit it to the card, how many fps do you get? What if you > just blit an empty frame to the card every time? Etc etc. I'm going to check all that again. On the DX list this erupted into the "how to measure time" thread, but for now I'm using QPC and I'll see what my timings are like for the screen build and the blit. > I think the recipe for speed is to minimizing blits across > the bus - composite the new frame in system memory, do one > blit to the back buffer of the video card, then flip or blit > back to front. You don't want to send something across the > bus that will later be overdrawn. That's what I'm doing. > I would bet that CopyBits is heavily optimized, but BitBlt > should be too. Not necessarily -- I mean, I would imagine that the Windows engineers have long ago decided that GDI acceleration probably isn't going to be a major priority and have concentrated their efforts elsewhere. The Mac engineers, however, recognize that they need to look for optimizations for Altivec anywhere they can since that's a part of their marketing strategy (the "MHz Myth"). > I think the key for both is to make sure > you're on the fast path - no transparency, pixel formats > match, palettes (if any) match - so that the function can > just blast bits. Well, this goes back to the XLTE_ColorTransform thread. There is some conversion happening, but it's nearly unavoidable unless I write some huge explosion of blitters. Right now I'm taking advice from a friend to basically do everything in some canonical format (x555 in my case) and let the back blitter handle conversion. This is theoretically slower than making a DDB and then writing every permutation of blitter necessary to support building my buffers in DDB land. That just seems like a failure case waiting to happen though, given the sheer number of pixel formats that are available. Brian |
From: Andy G. <an...@mi...> - 2002-01-15 19:39:33
|
The comment at the end about using 555 buffers is interesting. Maybe that's why you are using GDI, not DDRAW? - DDRAW blits -do not- do any pixel format conversion - specifically because generally these are slow and it would be more optimum to convert your source data. I have used GDI in the past to do 555 to primary blits with very few issues, it's very fast, esp. when you involve stretching. However, I know of very few displays that actually have a desktop running in 555 - almost everything these days has a 565 desktop (in 16 bit mode). Maybe the difference between the mac and PC is that the mac has a 555 desktop? Maybe the PC is being forced to do an expensive conversion from 555 to 565. I bet you would get better performance if you wrote your own blitter using mmx code to go from 555->565 and 555->8888 and blit to an off screen buffer then flipped/blit this. If the primary is not 555, 565 or 8888 then just let GDI do it. There are some advantages to 555, like true gray scales, easier for a single code path etc... - but these are very very slight. Andy Glaister. -----Original Message----- From: Brian Hook [mailto:bri...@py...]=20 Sent: Tuesday, January 15, 2002 11:27 AM To: gam...@li... Subject: RE: [GD-Windows] BitBlt() syncing to VBL > Yeah, but assuming that the offscreen backbuffer is on the > video card, the vidmem-to-vidmem blit is so fast that it's=20 > essentially free. Certainly doesn't cost you any CPU time to=20 > queue it up. If only it were so simple =3D) You can specify where you want your offscreen GWorld allocated, and all my allocations are hardcoded into system memory. DX programming has taught me to stay away from anything twitchy -- like VRAM/AGP buffers =3D) > Yeah, the fact that your frame rate is at the refresh rate - > it would be a stretch to suspect anything else. Can you=20 > disable bits of your pipeline and log your framerate to see=20 > where the bottleneck is? i.e. if build your frame but not=20 > blit it to the card, how many fps do you get? What if you=20 > just blit an empty frame to the card every time? Etc etc. I'm going to check all that again. On the DX list this erupted into the "how to measure time" thread, but for now I'm using QPC and I'll see what my timings are like for the screen build and the blit. > I think the recipe for speed is to minimizing blits across > the bus - composite the new frame in system memory, do one=20 > blit to the back buffer of the video card, then flip or blit=20 > back to front. You don't want to send something across the=20 > bus that will later be overdrawn. That's what I'm doing. > I would bet that CopyBits is heavily optimized, but BitBlt > should be too. Not necessarily -- I mean, I would imagine that the Windows engineers have long ago decided that GDI acceleration probably isn't going to be a major priority and have concentrated their efforts elsewhere. The Mac engineers, however, recognize that they need to look for optimizations for Altivec anywhere they can since that's a part of their marketing strategy (the "MHz Myth"). > I think the key for both is to make sure > you're on the fast path - no transparency, pixel formats=20 > match, palettes (if any) match - so that the function can=20 > just blast bits. Well, this goes back to the XLTE_ColorTransform thread. There is some conversion happening, but it's nearly unavoidable unless I write some huge explosion of blitters. Right now I'm taking advice from a friend to basically do everything in some canonical format (x555 in my case) and let the back blitter handle conversion. This is theoretically slower than making a DDB and then writing every permutation of blitter necessary to support building my buffers in DDB land. That just seems like a failure case waiting to happen though, given the sheer number of pixel formats that are available. Brian _______________________________________________ Gamedevlists-windows mailing list Gam...@li... https://lists.sourceforge.net/lists/listinfo/gamedevlists-windows |
From: Brian H. <bri...@py...> - 2002-01-15 19:57:03
|
> The comment at the end about using 555 buffers is > interesting. Maybe that's why you are using GDI, not DDRAW? - Actually, I'm using 555 BECAUSE I'm using GDI, not the other way around. GDI's 16-bit DIB sections are 555. > I have used GDI in the past to do 555 to primary blits with > very few issues, it's very fast, esp. when you involve > stretching. However, I know of very few displays that > actually have a desktop running in 555 - almost everything > these days has a 565 desktop (in 16 bit mode). Correct. > difference between the mac and PC is that the mac has a 555 > desktop? Maybe the PC is being forced to do an expensive > conversion from 555 to 565. It definitely is (as per the XLATEOBJ_hGetColorTransform thread of a while ago), but I don't think that's accounting for all the difference. Both machines have NVidia graphics accelerators, and they're both running in 16-bit, and I don't think the GF series support a 555 mode. Brian |
From: Andy G. <an...@mi...> - 2002-01-15 20:15:46
|
CreateDIBSection supports 555 and 565 16 bit modes, check out the docs?? I would either work in 565 all the time (as most people will have this) or do your own custom color convert/blt code - you should be able to easily match or beat the speed of GDI and it's kinda fun mmx code. Andy. -----Original Message----- From: Brian Hook [mailto:bri...@py...]=20 Sent: Tuesday, January 15, 2002 11:57 AM To: gam...@li... Subject: RE: [GD-Windows] BitBlt() syncing to VBL > The comment at the end about using 555 buffers is > interesting. Maybe that's why you are using GDI, not DDRAW? -=20 Actually, I'm using 555 BECAUSE I'm using GDI, not the other way around. GDI's 16-bit DIB sections are 555. > I have used GDI in the past to do 555 to primary blits with > very few issues, it's very fast, esp. when you involve=20 > stretching. However, I know of very few displays that=20 > actually have a desktop running in 555 - almost everything=20 > these days has a 565 desktop (in 16 bit mode).=20 Correct. > difference between the mac and PC is that the mac has a 555 > desktop? Maybe the PC is being forced to do an expensive=20 > conversion from 555 to 565. It definitely is (as per the XLATEOBJ_hGetColorTransform thread of a while ago), but I don't think that's accounting for all the difference. Both machines have NVidia graphics accelerators, and they're both running in 16-bit, and I don't think the GF series support a 555 mode. Brian _______________________________________________ Gamedevlists-windows mailing list Gam...@li... https://lists.sourceforge.net/lists/listinfo/gamedevlists-windows |
From: Brian H. <bri...@py...> - 2002-01-15 20:33:04
|
Holy cow, you're right! It doesn't mention this in the CreateDIBSection() docs, I had to do an MSDN search for "555 CreateDIBSection", and even then it was only found in one sample doc. It's also in the docs for BITMAPINFOHEADER, but that didn't show in the search. Thanks for the heads up, that should be worth at least 10-20%. Brian |
From: Jon W. <hp...@mi...> - 2002-01-16 01:54:43
|
I'm hosting an Internet Explorer control in my application. I would like to expose some DOM properties with functions on them to JScript running in web pages inside that control. I've searched MSDN for a good 45 minutes, but couldn't find anything quite relevant (though lots of near misses. Has anyone done this, and/or know what interface name I should start my investigation at? Either "this is the interface name that you implement to publish one of these guys" or "this is the interface name to call to register your such guy" would be fine by me. Cheers, / h+ |