From: Andrew M. <ak...@os...> - 2005-07-15 10:40:13
|
Begin forwarded message: Date: Fri, 15 Jul 2005 12:14:37 +0200 From: Knut Petersen <Knu...@t-...> To: lin...@vg... Subject: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3 Hi everybody! There is a serious performance loss between 2.6.12 and 2.6.13-rc3 affecting _all_ framebuffer devices, especially those with fast bitblit functions. System: Via Epia 5000 CPU: Via Samuel 2, 533MHz Graphics core: Cyberblade/i1 (Blade 3D core integrated in 8601A) Framebuffer driver: Not yet released fully accelerated framebuffer driver cyblafb Test setup ========== video mode: 1280x1024, vyres=2662, bpp=8, 8x16 font, ypan scrollmode kernel 2.6.13-rc3 is compiled with HZ==1000 Measurement 1: Compile framebuffer modules Result: 2.6.13-rc3 is slightly slower, but this is an almost invisible performance loss of about 1% Measurement 2: time cat of file consisting of 2000 empty lines Result: | 2.6.12 / 2.6.13-rc3 ------------------------------------------+---------------------- total time | 0.182s / 0.220s Measurement 3: time cat of file consisting of 2000 full lines of 160 characters each. Result: Result: | 2.6.12 / 2.6.13-rc3 ------------------------------------------+---------------------- total time | 0.853s / 1.062s time spent in framebuffer bitblit routine | 0.256s / 0.257s time spent for kernel bitblit overhead | 0,426s / 0.623s !!! other time (scrolling, disk io etc) | 0,171s / 0,182s Discussion of measurements ========================== Framebuffer compiling shows that the general kernel performance is more or less unchanged between 2.6.12 and 2.6.13-rc3. Cat-ing of the file consisting of 2000 empty lines takes about 20.9% more time, cat-ing of the file consisting of 2000 full lines takes about 24% more time. As the time spent in the bitblit function of the framebuffer driver does not change I do assume that the data sent to the framebuffer driver has not changed. But the new routines take about 46% longer. All framebuffer drivers should be affected by this performance loss, but the faster the bitblit of the used framebuffer driver is, the more it will affect the general performance. You will not see such a great difference if e.g. vesafb is used. Please have a serious look at the changed code of fbcon/fbmem etc or switch back to the old routines. cu, Knut - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to maj...@vg... More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |
From: Antonino A. D. <ad...@gm...> - 2005-07-22 04:57:55
|
On Friday 15 July 2005 18:39, Andrew Morton wrote: > Begin forwarded message: > > Date: Fri, 15 Jul 2005 12:14:37 +0200 > From: Knut Petersen <Knu...@t-...> > To: lin...@vg... > Subject: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3 > > > Hi everybody! > > There is a serious performance loss between 2.6.12 and 2.6.13-rc3 > affecting _all_ framebuffer devices, especially those with fast > bitblit functions. > I haven't seen any significant performance penalty, between 2.6.12-rc5-mm1 and 2.6.13-rc3-mm1. Based on your results, I would pinpoint the culprit to be in video/console/bitblit.c. However, the changes there are minor, and should not alter the peformance. Tony |
From: Andrew M. <ak...@os...> - 2005-07-29 07:19:01
|
"Antonino A. Daplas" <ad...@gm...> wrote: > > On Friday 15 July 2005 18:39, Andrew Morton wrote: > > Begin forwarded message: > > > > Date: Fri, 15 Jul 2005 12:14:37 +0200 > > From: Knut Petersen <Knu...@t-...> > > To: lin...@vg... > > Subject: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3 > > > > > > Hi everybody! > > > > There is a serious performance loss between 2.6.12 and 2.6.13-rc3 > > affecting _all_ framebuffer devices, especially those with fast > > bitblit functions. > > > > I haven't seen any significant performance penalty, between 2.6.12-rc5-mm1 > and 2.6.13-rc3-mm1. > > Based on your results, I would pinpoint the culprit to be in > video/console/bitblit.c. However, the changes there are minor, and should not > alter the peformance. > So.. what happened here? Is the problem still present in 2.6.13-rc4? |
From: Knut P. <Knu...@t-...> - 2005-07-29 14:52:57
|
Hi everybody! >>I haven't seen any significant performance penalty, between 2.6.12-rc5-mm1 >>and 2.6.13-rc3-mm1. >> >>Based on your results, I would pinpoint the culprit to be in >>video/console/bitblit.c. However, the changes there are minor, and should not >>alter the peformance. >> >> > >So.. what happened here? Is the problem still present in 2.6.13-rc4? > > Yes, the problem still is present in 2.6.13-rc4. ================================================ There is only an insignificant difference of max +/- 2ms between 2.6.13-rc3 and 2.6.13-rc4 for all measurements. Test 1: reset;time cat scrolltest0 Test 2: reset;time cat scrolltest80 Test 3: reset;time cat scrolltest160 scrolltest0 is a file with 2000 empty lines. scrolltest80 is a file with 2000 lines of 80 characters each. scrolltest 160 is a file with 2000 lines of 160 characters each. vesafb tests are made with the original vesafb of the respective kernel versions, cyblafb tests all use the same source file, accelerations: fillrect, bitblit, copyarea 2.6.13-rc* are compiled for 1000Hz system timer as it is also used for 2.6.12. chipset: trident cyberblade/i1 video mode: vesa 0x307 (1280x1024@75hz) 8x16 font Nothing but the kernel changed between the tests, the time values given are system time in seconds. vga=0x307 | test 1 test 2 test 3 | test 1 test 2 test 3 | video=vesafb:ypan | video=vesafb -----------+---------------------------+--------------------------- 2.6.12 | 3,753s 4,825s 5,936s | 4,258s 65,645s 126,898s 2.6.13-rc4 | 3,937s 5,135s 6,302s | 4,304s 71,515s 138,674s | +4,9% +6,42% +6,17% | +1,08% +8,94% +9,28% vga=0x307 | test 1 test 2 test 3 | test 1 test 2 test 3 | video=cyblafb | video=cyblafb:noypan -----------+---------------------------+--------------------------- 2.6.12 | 0,228s 0,549s 0,870s | 7,692s 8,015s 8,335s 2.6.13-rc4 | 0,235s 0,654s 1,072s | 7,699s 8,120s 8,549s | +3,07% +19,13% +23,22% | +0,09% +1,31% +2,57% The numbers show very clearly that 2.6.13-rc* blitting is much slower than the blitting of 2.6.12. For cyblafb the time spend for the actual blitting is about 257ms for test3, so the actual performance loss for the pre-driver part is above 30% Now for a real world example: reset; time cat patch-2.6.13-rc4 cyblafb, kernel 2.6.12 : 173,013s cyblafb, kernel 2.6.13-rc4 : 196,181s difference : 23,168s ( +13,4% ) Could anyone take the time to measure performance of some other drivers? Those using ypan scrolling and hardware accelerated bitblit should be most affected. cu, Knut |
From: Antonino A. D. <ad...@gm...> - 2005-07-29 15:42:41
|
Knut Petersen wrote: > Hi everybody! > >>> I haven't seen any significant performance penalty, between >>> 2.6.12-rc5-mm1 >>> and 2.6.13-rc3-mm1. >>> >>> Based on your results, I would pinpoint the culprit to be in >>> video/console/bitblit.c. However, the changes there are minor, and >>> should not >>> alter the peformance. >>> >>> >> >> So.. what happened here? Is the problem still present in 2.6.13-rc4? >> >> > Yes, the problem still is present in 2.6.13-rc4. > ================================================ Thank you for your persistence. I think I know the culprit. Someone insisted on using memcpy in fb_pad_aligned_buffer(). I have already fixed this before, but apparently, the memcpy was brought back. Try the attached patch and let me know. Tony fbdev: Replace memcpy with for-loop when preparing bitmap Do not use memcpy in fb_pad_aligned_buffer. It is suboptimal because only a few bytes are moved at a time. Replace with a for-loop. From: Antonino Daplas <ad...@po...> Signed-off-by: Antonino Daplas <ad...@po...> --- fbmem.c | 6 ++++-- 1 files changed, 4 insertions(+), 2 deletions(-) --- a/drivers/video/fbmem.c +++ b/drivers/video/fbmem.c @@ -80,10 +80,12 @@ EXPORT_SYMBOL(fb_get_color_depth); */ void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) { - int i; + int i, j; for (i = height; i--; ) { - memcpy(dst, src, s_pitch); + /* s_pitch is a few bytes at the most, memcpy is suboptimal */ + for (j = 0; j < s_pitch; j++) + dst[j] = src[j]; src += s_pitch; dst += d_pitch; } |
From: James S. <jsi...@in...> - 2005-07-29 19:51:46
|
> Thank you for your persistence. I think I know the culprit. Someone > insisted on using memcpy in fb_pad_aligned_buffer(). I have already > fixed this before, but apparently, the memcpy was brought back. Try > the attached patch and let me know. Yipes, I did that. The memcpy function is suppose to be optimized for the platform. See string.h in the include/asm directory. I seen for example the Athlon would use the 3DNow instruction set to copy data. Something is really wrong with memcpy if moving byte by byte is faster !!!! Alot of drivers use memcpy. If memcpy sucks then drivers should be copying byte by byte then. The question I have is this the case for non intel platforms as well. Could someone run the numbers on other platforms? > Tony > > fbdev: Replace memcpy with for-loop when preparing bitmap > > Do not use memcpy in fb_pad_aligned_buffer. It is suboptimal because only > a few bytes are moved at a time. Replace with a for-loop. > > From: Antonino Daplas <ad...@po...> > Signed-off-by: Antonino Daplas <ad...@po...> > --- > > fbmem.c | 6 ++++-- > 1 files changed, 4 insertions(+), 2 deletions(-) > > --- a/drivers/video/fbmem.c > +++ b/drivers/video/fbmem.c > @@ -80,10 +80,12 @@ EXPORT_SYMBOL(fb_get_color_depth); > */ > void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) > { > - int i; > + int i, j; > > for (i = height; i--; ) { > - memcpy(dst, src, s_pitch); > + /* s_pitch is a few bytes at the most, memcpy is suboptimal */ > + for (j = 0; j < s_pitch; j++) > + dst[j] = src[j]; > src += s_pitch; > dst += d_pitch; > } > > > ------------------------------------------------------- > SF.Net email is Sponsored by the Better Software Conference & EXPO September > 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf > _______________________________________________ > Linux-fbdev-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-fbdev-devel > |
From: Jon S. <jon...@gm...> - 2005-07-29 20:21:44
|
On 7/29/05, James Simmons <jsi...@in...> wrote: >=20 > > Thank you for your persistence. I think I know the culprit. Someone > > insisted on using memcpy in fb_pad_aligned_buffer(). I have already > > fixed this before, but apparently, the memcpy was brought back. Try > > the attached patch and let me know. >=20 > Yipes, I did that. The memcpy function is suppose to be optimized for the > platform. See string.h in the include/asm directory. I seen for example > the Athlon would use the 3DNow instruction set to copy data. Something > is really wrong with memcpy if moving byte by byte is faster !!!! > Alot of drivers use memcpy. If memcpy sucks then drivers should be copyin= g > byte by byte then. The question I have is this the case for non intel > platforms as well. Could someone run the numbers on other platforms? memmove/memcpy is faster. memcpy is faster than memmove so use it if you can. But, there is a lower limit probably around 16 bytes or so where the loop becomes faster. So if you know that you will always be copying small fragments use the loop. The compiler can't decide between loop/memcpy for you since it doesn't know the upper limit on the length, it is forced to use memcpy since you told it so. For small things it is even better use a structure assignment if possible. That lets the compiler decide to do a loop or memcpy since the length is known. In this case if we could figure out how to give the compiler an upper bound on the loop it might decide to unroll it and use multiple moves. >=20 > > Tony > > > > fbdev: Replace memcpy with for-loop when preparing bitmap > > > > Do not use memcpy in fb_pad_aligned_buffer. It is suboptimal becaus= e only > > a few bytes are moved at a time. Replace with a for-loop. > > > > From: Antonino Daplas <ad...@po...> > > Signed-off-by: Antonino Daplas <ad...@po...> > > --- > > > > fbmem.c | 6 ++++-- > > 1 files changed, 4 insertions(+), 2 deletions(-) > > > > --- a/drivers/video/fbmem.c > > +++ b/drivers/video/fbmem.c > > @@ -80,10 +80,12 @@ EXPORT_SYMBOL(fb_get_color_depth); > > */ > > void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch,= u32 height) > > { > > - int i; > > + int i, j; > > > > for (i =3D height; i--; ) { > > - memcpy(dst, src, s_pitch); > > + /* s_pitch is a few bytes at the most, memcpy is suboptim= al */ > > + for (j =3D 0; j < s_pitch; j++) > > + dst[j] =3D src[j]; > > src +=3D s_pitch; > > dst +=3D d_pitch; > > } > > > > > > ------------------------------------------------------- > > SF.Net email is Sponsored by the Better Software Conference & EXPO Sept= ember > > 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices > > Agile & Plan-Driven Development * Managing Projects & Teams * Testing &= QA > > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5= sf > > _______________________________________________ > > Linux-fbdev-devel mailing list > > Lin...@li... > > https://lists.sourceforge.net/lists/listinfo/linux-fbdev-devel > > >=20 >=20 > ------------------------------------------------------- > SF.Net email is sponsored by: Discover Easy Linux Migration Strategies > from IBM. Find simple to follow Roadmaps, straightforward articles, > informative Webcasts and more! Get everything you need to get up to > speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclic= k > _______________________________________________ > Linux-fbdev-devel mailing list > Lin...@li... > https://lists.sourceforge.net/lists/listinfo/linux-fbdev-devel >=20 --=20 Jon Smirl jon...@gm... |
From: Luca <kr...@pe...> - 2005-07-29 22:45:32
|
Il Fri, Jul 29, 2005 at 08:51:34PM +0100, James Simmons ha scritto: > > Thank you for your persistence. I think I know the culprit. Someone > > insisted on using memcpy in fb_pad_aligned_buffer(). I have already > > fixed this before, but apparently, the memcpy was brought back. Try > > the attached patch and let me know. > > Yipes, I did that. The memcpy function is suppose to be optimized for the > platform. See string.h in the include/asm directory. I seen for example > the Athlon would use the 3DNow instruction set to copy data. Something > is really wrong with memcpy if moving byte by byte is faster !!!! For small copies MMX/3DNow are not used at all. In current kernel MMX/3DNow memcpy is used only when data size is greater than 512bytes. Remember that MMX/3DNow uses FPU so the kernel must save/restore state and this overhead would make the copy slow for small chunks. Luca -- Home: http://kronoz.cjb.net Se il destino di un uomo e` annegare, anneghera` anche in un bicchier d'acqua. Proverbio yddish |
From: Knut P. <Knu...@t-...> - 2005-07-29 20:09:00
|
Hi Tony, > > Thank you for your persistence. I think I know the culprit. Someone > insisted on using memcpy in fb_pad_aligned_buffer(). I have already > fixed this before, but apparently, the memcpy was brought back. Try > the attached patch and let me know. > > Tony Replacing memcpy() with this inline code helps. Performance is slightly slower than it was in 2.6.12, but this is hardly measurable and could be caused by other changes in the kernel. The most affected test, (test 3, cyblafb, ypan) now is about 7ms slower than it was in 2.6.12. Without your patch the performance penalty was 202ms! Yes, please send the patch to Linus asap, it´s a must for 2.6.13. Someone should look at memcpy ;-) cu, Knut |
From: Andrew M. <ak...@os...> - 2005-07-29 19:04:00
|
"Antonino A. Daplas" <ad...@gm...> wrote: > > fbdev: Replace memcpy with for-loop when preparing bitmap Whee, progress. Please let me know if/when you want this sent to Linus. |
From: James S. <jsi...@in...> - 2005-07-29 19:53:35
|
> "Antonino A. Daplas" <ad...@gm...> wrote: > > > > fbdev: Replace memcpy with for-loop when preparing bitmap > > Whee, progress. Please let me know if/when you want this sent to Linus. Before you do I like to know memcpy is slower than byte by byte copying. This just seems to be wrong! |
From: James S. <jsi...@in...> - 2005-07-29 19:59:37
|
Can you do some performance measurements with this patch instead? I have a theory. I bet because we didn't have the linux version of string.h we are using the glibc version instead which is slower. In fact I bet it will be faster than byte by byte copy. Give it a try. --- /usr/src/linus-2.6/drivers/video/fbmem.c 2005-07-28 10:24:11.000000000 -0700 +++ fbmem.c 2005-07-29 12:53:30.000000000 -0700 @@ -15,6 +15,7 @@ #include <linux/module.h> #include <linux/types.h> +#include <linux/string.h> #include <linux/errno.h> #include <linux/sched.h> #include <linux/smp_lock.h> |
From: Antonino A. D. <ad...@gm...> - 2005-07-29 22:45:18
|
Jon Smirl wrote: > On 7/29/05, James Simmons <jsi...@in...> wrote: >>> Thank you for your persistence. I think I know the culprit. Someone >>> insisted on using memcpy in fb_pad_aligned_buffer(). I have already >>> fixed this before, but apparently, the memcpy was brought back. Try >>> the attached patch and let me know. >> Yipes, I did that. The memcpy function is suppose to be optimized for the >> platform. See string.h in the include/asm directory. I seen for example >> the Athlon would use the 3DNow instruction set to copy data. Something >> is really wrong with memcpy if moving byte by byte is faster !!!! >> Alot of drivers use memcpy. If memcpy sucks then drivers should be copying >> byte by byte then. The question I have is this the case for non intel >> platforms as well. Could someone run the numbers on other platforms? > > memmove/memcpy is faster. memcpy is faster than memmove so use it if > you can. But, there is a lower limit probably around 16 bytes or so > where the loop becomes faster. So if you know that you will always be > copying small fragments use the loop. The compiler can't decide Yes, the loop copies each row of a font character. For an 8x16 font that's 1 byte. The maximum fontwidth is 32. A 12x22 font does not pass through this function because the width is not a multiple of 8. So, currently, it's used mostly for 8x16 fonts. I already know people using 16x30 fonts. There are probably others bigger than that. Of course, we can always use Duff's version to loop-unroll that particular section, but even at 4 bytes, I don't know if it's worth the effort. Anyone knows people using 32 wide fonts? Tony |
From: James S. <jsi...@in...> - 2005-08-03 17:30:12
|
> Yes, the loop copies each row of a font character. For an 8x16 font > that's 1 byte. The maximum fontwidth is 32. A 12x22 font does not pass > through this function because the width is not a multiple of 8. So, > currently, it's used mostly for 8x16 fonts. > > I already know people using 16x30 fonts. There are probably others bigger > than that. > > Of course, we can always use Duff's version to loop-unroll that particular > section, but even at 4 bytes, I don't know if it's worth the effort. Anyone > knows people using 32 wide fonts? The console system supports up to 32 pixel wide fonts. Even at that maximum size we only copy 4 bytes of data at a time. Unrolling the loop is right. |