From: Brian J. J. <bjj...@ya...> - 2001-03-01 22:12:57
|
Folks, When looking over the window-mode VOSF screen update code, it struck me that maintaining the_buffer_copy and calculating the minimal rectangle to update was a rather expensive operation. It involves scanning many kilobytes of memory on every screen update, from both the_buffer and the_buffer_copy... if we're going to scan the entire page anyway, why not just blit the whole thing to the screen (after all, DGA and XShm blits are essentially memory copy operations), and bypass all the overhead of maintaining the_buffer_copy? So I ran some experiments. The machine: my dual-processor SGI Octane, IRIX 6.5.11. BII compiled with SGI's MIPSPro compilers with "-Ofast" optimization (massive interprocedural analysis), output to the local display (DISPLAY set to ":0", which should allow XShm BII version: CVS current as of around January 2000 BII video mode: 30 Hz, 800x600, window video (DGA doesn't work on IRIX. I should really write an OpenGL BII video driver, since OpenGL is the fast path to the screen under IRIX. One of these decades....) Bit depths as described below. The test: boot MacOS 7.6.1, start Speedometer, run 3 iters. of video test, save, quit and shut down. Speedometer doesn't do 24bpp, so the test for that depth was: boot 7.6.1, start the game Continuum (a real video hog), play 1 level running in to the far wall. I applied the following patch, which bypasses the the_buffer_copy code, assuming that all bytes in pages flagged by VOSF are modified: --- video_vosf.h 2001/02/10 15:29:01 1.14 +++ video_vosf.h 2001/03/01 03:17:34 @@ -234,6 +234,7 @@ int x1, x2, width; if (depth == 1) { +#ifdef MIMIMAL_VOSF_REDRAWS x1 = VideoMonitor.x - 1; for (j = y1; j <= y2; j++) { uint8 * const p1 = &the_buffer[j * bytes_per_row]; @@ -245,7 +246,11 @@ } } } +#else + x1 = 0; +#endif +#ifdef MIMIMAL_VOSF_REDRAWS x2 = x1; for (j = y2; j >= y1; j--) { uint8 * const p1 = &the_buffer[j * bytes_per_row]; @@ -257,18 +262,24 @@ } } } +#else + x2 = (((VideoMonitor.x>>3) - 1) << 3) + 7; +#endif width = x2 - x1 + 1; // Update the_host_buffer and copy of the_buffer i = y1 * bytes_per_row + (x1 >> 3); for (j = y1; j <= y2; j++) { Screen_blit(the_host_buffer + i, the_buffer + i, width >> 3); +#ifdef MIMIMAL_VOSF_REDRAWS memcpy(the_buffer_copy + i, the_buffer + i, width >> 3); +#endif i += bytes_per_row; } } else { +#ifdef MIMIMAL_VOSF_REDRAWS x1 = VideoMonitor.x * bytes_per_pixel - 1; for (j = y1; j <= y2; j++) { uint8 * const p1 = &the_buffer[j * bytes_per_row]; @@ -280,8 +291,12 @@ } } } +#else + x1 = 0; +#endif x1 /= bytes_per_pixel; +#ifdef MIMIMAL_VOSF_REDRAWS x2 = x1 * bytes_per_pixel; for (j = y2; j >= y1; j--) { uint8 * const p1 = &the_buffer[j * bytes_per_row]; @@ -293,6 +308,9 @@ } } } +#else + x2 = VideoMonitor.x * bytes_per_pixel - 1; +#endif x2 /= bytes_per_pixel; width = x2 - x1 + 1; @@ -300,7 +318,9 @@ i = y1 * bytes_per_row + x1 * bytes_per_pixel; for (j = y1; j <= y2; j++) { Screen_blit(the_host_buffer + i, the_buffer + i, bytes_per_pixel * width); +#ifdef MIMIMAL_VOSF_REDRAWS memcpy(the_buffer_copy + i, the_buffer + i, bytes_per_pixel * width); +#endif i += bytes_per_row; } } I used SGI's Speedshop profiler to measure the time the video update thread spent in the routine update_display_window_vosf and its children (I began the Speedshop experiment with the MacOS boot, after dismissing the BII configuration GUI.) In the tables below, "perf" is the result of 3 iterations of Speedometer's video test (higher is better), %self is the percent of the video update thread's time spent in update_display_window_vosf, and %incl is the percent of its time spent in update_display_window_vosf and its children. On all runs, the great majority of the update thread's time was spent in nanosleep(), as it should be. I can send full profile output if anyone's interested. Command line: setenv _SPEEDSHOP_INIT_DEFERRED_SIG 17 ; setenv _SPEEDSHOP_DEBUG_NO_SIG_TRAPS ; ssrun -totaltime ./BasiliskII Vanilla BasiliskII (MINIMAL_VOSF_REDRAWS turned on): perf %incl %self 8bpp .46 11.8 8.4 m48592 (288 samples) 15bpp .40 13.0 12.1 m48442 (745 samples. More time blitting) 15bpp .45 5.5 5.0 m50604 (289 samples. 20 sec. shorter) 24bpp - 29.9 28.0 m49782 (1030 samples) With MINIMAL_VOSF_REDRAWS turned off: perf %incl %self 8bpp .52 12.4 0.0 m48456 (Only 34 samples. Not much time spent in the video code!) 15bpp .53 4.1 0.0 m49772 (145 samples) 24bpp - 3.2 0.0 m51065 (84 samples) So at 8bpp, Speedometer measured a performance increase of 13%, and at 15bpp, an increase of 33% or 18%, depending on the run. On all settings I got occasional freezes, although they seemed less prevalent with the modified code and the lower bit depths. They probably affected the measurements, eg. the difference in the two 15bpp vanilla runs. (The hangs look like an IRIX pthreads bug: the video thread gets stuck in nanosleep(). SGI's internal bug database suggests that there's an unwholesome interaction among pthreads, signals, and nanosleep, and that pthreads_sv_timedwait can be used instead of nanosleep as a workaround. I'll have to give that a try.) The subjective performance increase was very noticeable as well. With my modifications, BII "felt" like a real Mac. Continuum was actually playable in 8bpp mode with MINIMAL_VOSF_REDRAWS turned off, unlike in any other mode! (It was still slightly sluggish, but then, it's slightly sluggish on a real '030 Mac in anything but black-and-white mode. Which raises another question: why isn't one-bit "color" supported in Quadra (as opposed to Classic) mode?) In addition, I tried adding an XSync call at the end of every video frame, to make sure that the X server keeps caught up with all the data BII is throwing at it. This made a tremendous improvement in a game I ported to an HP Bobcat (68020!) workstation ages ago, so I thought it might help BII: diff -u -r1.36 video_x.cpp --- video_x.cpp 2001/01/28 14:05:19 1.36 +++ video_x.cpp 2001/03/01 03:17:34 @@ -2053,6 +2053,9 @@ LOCK_VOSF; update_display_window_vosf(); UNLOCK_VOSF; +#ifndef NO_FRAME_SYNC + XSync(x_display, false); // Let the server catch up +#endif } } } This patch noticeably improved the smoothness of the video, and also seemed to reduce the hangs I was seeing. In fact, the _only_ way I've been able to run BII successfully on my SGI O2 workstation (as opposed to the Octane) is with the XSync call, and DISPLAY set to localhost:0 (which presumably defeats XShm.) And it runs quite well in that configuration. I'd imagine that minimal updates (the existing BII code) would give better performance for non-local displays, where the cost of shipping the pixmaps to the server is much greater, so perhaps MINIMAL_VOSF_REDRAWS should be a prefs item instead of a compile-time option. Ideas? I'd be very interested in hearing how these patches affect BII performance on other platforms, especially those with DGA (similar hacks would need to be made to update_display_dga_vosf(), of course.) Thanks, ===== Brian J. Johnson __________________________________________________ Do You Yahoo!? Get email at your own domain with Yahoo! Mail. http://personal.mail.yahoo.com/ |