Thread: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

Brought to you by: bradd, geertu, jsimmons, wyo

linux-fbdev-devel

[Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Andrew M. <ak...@os...> - 2005-07-15 10:40:13

Begin forwarded message:

Date: Fri, 15 Jul 2005 12:14:37 +0200
From: Knut Petersen <Knu...@t-...>
To: lin...@vg...
Subject: framebuffer blitting performance loss  2.6.12 -> 2.6.13-rc3

Hi everybody!

There is a serious performance loss between 2.6.12 and 2.6.13-rc3
affecting _all_ framebuffer devices, especially those with fast
bitblit functions.

System: Via Epia 5000
CPU: Via Samuel 2, 533MHz
Graphics core: Cyberblade/i1 (Blade 3D core integrated in 8601A)
Framebuffer driver: Not yet released fully accelerated framebuffer
                    driver cyblafb

Test setup
==========

video mode: 1280x1024, vyres=2662, bpp=8, 8x16 font, ypan scrollmode
kernel 2.6.13-rc3 is compiled with HZ==1000

Measurement 1: Compile framebuffer modules
       Result: 2.6.13-rc3 is slightly slower, but this is an almost
               invisible performance loss of about 1%

Measurement 2: time cat of file consisting of 2000 empty lines
       Result:
                                          |  2.6.12 / 2.6.13-rc3
------------------------------------------+----------------------
total time                                |  0.182s / 0.220s

Measurement 3: time cat of file consisting of 2000 full lines of
               160 characters each. Result:

       Result:
                                          |  2.6.12 / 2.6.13-rc3
------------------------------------------+----------------------
total time                                |  0.853s / 1.062s
time spent in framebuffer bitblit routine |  0.256s / 0.257s
time spent for kernel bitblit overhead    |  0,426s / 0.623s !!!
other time (scrolling, disk io etc)       |  0,171s / 0,182s

Discussion of measurements
==========================

Framebuffer compiling shows that the general kernel performance is
more or less unchanged between 2.6.12 and 2.6.13-rc3.

Cat-ing of the file consisting of 2000 empty lines takes about 20.9%
more time, cat-ing of the file consisting of 2000 full lines takes about
24% more time.

As the time spent in the bitblit function of the framebuffer driver
does not change I do assume that the data sent to the framebuffer
driver has not changed. But the new routines take about 46% longer.

All framebuffer drivers should be affected by this performance loss,
but the faster the bitblit of the used framebuffer driver is, the
more it will affect the general performance. You will not see such
a great difference if e.g. vesafb is used.

Please have a serious look at the changed code of fbcon/fbmem etc
or switch back to the old routines.

cu,
 Knut
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to maj...@vg...
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Antonino A. D. <ad...@gm...> - 2005-07-22 04:57:55

On Friday 15 July 2005 18:39, Andrew Morton wrote:
> Begin forwarded message:
>
> Date: Fri, 15 Jul 2005 12:14:37 +0200
> From: Knut Petersen <Knu...@t-...>
> To: lin...@vg...
> Subject: framebuffer blitting performance loss  2.6.12 -> 2.6.13-rc3
>
>
> Hi everybody!
>
> There is a serious performance loss between 2.6.12 and 2.6.13-rc3
> affecting _all_ framebuffer devices, especially those with fast
> bitblit functions.
>

I haven't seen any significant performance penalty, between 2.6.12-rc5-mm1
and 2.6.13-rc3-mm1.

Based on your results, I would pinpoint the culprit to be in
video/console/bitblit.c.  However, the changes there are minor, and should not
alter the peformance.

Tony

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Andrew M. <ak...@os...> - 2005-07-29 07:19:01

"Antonino A. Daplas" <ad...@gm...> wrote:
>
> On Friday 15 July 2005 18:39, Andrew Morton wrote:
> > Begin forwarded message:
> >
> > Date: Fri, 15 Jul 2005 12:14:37 +0200
> > From: Knut Petersen <Knu...@t-...>
> > To: lin...@vg...
> > Subject: framebuffer blitting performance loss  2.6.12 -> 2.6.13-rc3
> >
> >
> > Hi everybody!
> >
> > There is a serious performance loss between 2.6.12 and 2.6.13-rc3
> > affecting _all_ framebuffer devices, especially those with fast
> > bitblit functions.
> >
> 
> I haven't seen any significant performance penalty, between 2.6.12-rc5-mm1
> and 2.6.13-rc3-mm1.
> 
> Based on your results, I would pinpoint the culprit to be in
> video/console/bitblit.c.  However, the changes there are minor, and should not
> alter the peformance.
> 

So.. what happened here?  Is the problem still present in 2.6.13-rc4?

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Knut P. <Knu...@t-...> - 2005-07-29 14:52:57

Hi everybody!

>>I haven't seen any significant performance penalty, between 2.6.12-rc5-mm1
>>and 2.6.13-rc3-mm1.
>>
>>Based on your results, I would pinpoint the culprit to be in
>>video/console/bitblit.c.  However, the changes there are minor, and should not
>>alter the peformance.
>>
>>
>
>So.. what happened here?  Is the problem still present in 2.6.13-rc4?
>
>
Yes, the problem still is present in 2.6.13-rc4.
================================================

There is only an insignificant difference of max +/- 2ms between
2.6.13-rc3 and 2.6.13-rc4 for all measurements.

Test 1:   reset;time cat scrolltest0
Test 2:   reset;time cat scrolltest80
Test 3:   reset;time cat scrolltest160

scrolltest0 is a file with 2000 empty lines.
scrolltest80 is a file with 2000 lines of 80 characters each.
scrolltest 160 is a file with 2000 lines of 160 characters each.

vesafb tests are made with the original vesafb of the respective kernel 
versions,
cyblafb tests all use the same source file, accelerations: fillrect, 
bitblit, copyarea
2.6.13-rc* are compiled for 1000Hz system timer as it is also used for 
2.6.12.

chipset: trident cyberblade/i1
video mode: vesa 0x307 (1280x1024@75hz)
8x16 font

Nothing but the kernel changed between the tests,
the time values given are system time in seconds.


 vga=0x307 | test 1   test 2   test 3  | test 1   test 2    test 3 
           |    video=vesafb:ypan      |       video=vesafb
-----------+---------------------------+---------------------------
2.6.12     | 3,753s   4,825s   5,936s  | 4,258s  65,645s  126,898s
2.6.13-rc4 | 3,937s   5,135s   6,302s  | 4,304s  71,515s  138,674s
           |  +4,9%   +6,42%   +6,17%  | +1,08%   +8,94%    +9,28%


 vga=0x307 | test 1   test 2   test 3  | test 1   test 2    test 3 
           |    video=cyblafb          |   video=cyblafb:noypan
-----------+---------------------------+---------------------------
2.6.12     | 0,228s   0,549s   0,870s  | 7,692s   8,015s    8,335s
2.6.13-rc4 | 0,235s   0,654s   1,072s  | 7,699s   8,120s    8,549s
           | +3,07%  +19,13%  +23,22%  | +0,09%   +1,31%    +2,57%


The numbers show very clearly that 2.6.13-rc* blitting is much slower than
the blitting of 2.6.12. For cyblafb the time spend for the actual 
blitting is
about 257ms for test3, so the actual performance loss for the pre-driver 
part
is above 30%

Now for a real world example:

      reset; time cat patch-2.6.13-rc4

cyblafb, kernel 2.6.12     : 173,013s
cyblafb, kernel 2.6.13-rc4 : 196,181s
   difference              :  23,168s ( +13,4% )

Could anyone take the time to measure performance of some other drivers?
Those using ypan scrolling and hardware accelerated bitblit should be
most affected.


cu,
 Knut

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Antonino A. D. <ad...@gm...> - 2005-07-29 15:42:41

Knut Petersen wrote:
> Hi everybody!
> 
>>> I haven't seen any significant performance penalty, between 
>>> 2.6.12-rc5-mm1
>>> and 2.6.13-rc3-mm1.
>>>
>>> Based on your results, I would pinpoint the culprit to be in
>>> video/console/bitblit.c.  However, the changes there are minor, and 
>>> should not
>>> alter the peformance.
>>>
>>>
>>
>> So.. what happened here?  Is the problem still present in 2.6.13-rc4?
>>
>>
> Yes, the problem still is present in 2.6.13-rc4.
> ================================================

Thank you for your persistence.  I think I know the culprit.  Someone
insisted on using memcpy in fb_pad_aligned_buffer().  I have already
fixed this before, but apparently, the memcpy was brought back.  Try
the attached patch and let me know.

Tony

   fbdev: Replace memcpy with for-loop when preparing bitmap

    Do not use memcpy in fb_pad_aligned_buffer. It is suboptimal because only
    a few bytes are moved at a time. Replace with a for-loop.

    From: Antonino Daplas <ad...@po...>
    Signed-off-by: Antonino Daplas <ad...@po...>
---

 fbmem.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

--- a/drivers/video/fbmem.c
+++ b/drivers/video/fbmem.c
@@ -80,10 +80,12 @@ EXPORT_SYMBOL(fb_get_color_depth);
  */
 void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
 {
-	int i;
+	int i, j;
 
 	for (i = height; i--; ) {
-		memcpy(dst, src, s_pitch);
+		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
+		for (j = 0; j < s_pitch; j++)
+			dst[j] = src[j];
 		src += s_pitch;
 		dst += d_pitch;
 	}

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: James S. <jsi...@in...> - 2005-07-29 19:51:46

> Thank you for your persistence.  I think I know the culprit.  Someone
> insisted on using memcpy in fb_pad_aligned_buffer().  I have already
> fixed this before, but apparently, the memcpy was brought back.  Try
> the attached patch and let me know.

Yipes, I did that. The memcpy function is suppose to be optimized for the 
platform. See string.h in the include/asm directory. I seen for example 
the Athlon would use the 3DNow instruction set to copy data. Something 
is really wrong with memcpy if moving byte by byte is faster !!!! 
Alot of drivers use memcpy. If memcpy sucks then drivers should be copying 
byte by byte then. The question I have is this the case for non intel 
platforms as well. Could someone run the numbers on other platforms?

> Tony
> 
>    fbdev: Replace memcpy with for-loop when preparing bitmap
> 
>     Do not use memcpy in fb_pad_aligned_buffer. It is suboptimal because only
>     a few bytes are moved at a time. Replace with a for-loop.
> 
>     From: Antonino Daplas <ad...@po...>
>     Signed-off-by: Antonino Daplas <ad...@po...>
> ---
> 
>  fbmem.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> --- a/drivers/video/fbmem.c
> +++ b/drivers/video/fbmem.c
> @@ -80,10 +80,12 @@ EXPORT_SYMBOL(fb_get_color_depth);
>   */
>  void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
>  {
> -	int i;
> +	int i, j;
>  
>  	for (i = height; i--; ) {
> -		memcpy(dst, src, s_pitch);
> +		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
> +		for (j = 0; j < s_pitch; j++)
> +			dst[j] = src[j];
>  		src += s_pitch;
>  		dst += d_pitch;
>  	}
> 
> 
> -------------------------------------------------------
> SF.Net email is Sponsored by the Better Software Conference & EXPO September
> 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
> Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
> _______________________________________________
> Linux-fbdev-devel mailing list
> Lin...@li...
> https://lists.sourceforge.net/lists/listinfo/linux-fbdev-devel
>

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Jon S. <jon...@gm...> - 2005-07-29 20:21:44

On 7/29/05, James Simmons <jsi...@in...> wrote:
>=20
> > Thank you for your persistence.  I think I know the culprit.  Someone
> > insisted on using memcpy in fb_pad_aligned_buffer().  I have already
> > fixed this before, but apparently, the memcpy was brought back.  Try
> > the attached patch and let me know.
>=20
> Yipes, I did that. The memcpy function is suppose to be optimized for the
> platform. See string.h in the include/asm directory. I seen for example
> the Athlon would use the 3DNow instruction set to copy data. Something
> is really wrong with memcpy if moving byte by byte is faster !!!!
> Alot of drivers use memcpy. If memcpy sucks then drivers should be copyin=
g
> byte by byte then. The question I have is this the case for non intel
> platforms as well. Could someone run the numbers on other platforms?

memmove/memcpy is faster. memcpy is faster than memmove so use it if
you can. But, there is a lower limit probably around 16 bytes or so
where the loop becomes faster.  So if you know that you will always be
copying small fragments use the loop.  The compiler can't decide
between loop/memcpy for you since it doesn't know the upper limit on
the length, it is forced to use memcpy since you told it so.

For small things it is even better use a structure assignment if
possible. That lets the compiler decide to do a loop or memcpy since
the length is known.

In this case if we could figure out how to give the compiler an upper
bound on the loop it might decide to unroll it and use multiple moves.

>=20
> > Tony
> >
> >    fbdev: Replace memcpy with for-loop when preparing bitmap
> >
> >     Do not use memcpy in fb_pad_aligned_buffer. It is suboptimal becaus=
e only
> >     a few bytes are moved at a time. Replace with a for-loop.
> >
> >     From: Antonino Daplas <ad...@po...>
> >     Signed-off-by: Antonino Daplas <ad...@po...>
> > ---
> >
> >  fbmem.c |    6 ++++--
> >  1 files changed, 4 insertions(+), 2 deletions(-)
> >
> > --- a/drivers/video/fbmem.c
> > +++ b/drivers/video/fbmem.c
> > @@ -80,10 +80,12 @@ EXPORT_SYMBOL(fb_get_color_depth);
> >   */
> >  void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch,=
 u32 height)
> >  {
> > -     int i;
> > +     int i, j;
> >
> >       for (i =3D height; i--; ) {
> > -             memcpy(dst, src, s_pitch);
> > +             /* s_pitch is a few bytes at the most, memcpy is suboptim=
al */
> > +             for (j =3D 0; j < s_pitch; j++)
> > +                     dst[j] =3D src[j];
> >               src +=3D s_pitch;
> >               dst +=3D d_pitch;
> >       }
> >
> >
> > -------------------------------------------------------
> > SF.Net email is Sponsored by the Better Software Conference & EXPO Sept=
ember
> > 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
> > Agile & Plan-Driven Development * Managing Projects & Teams * Testing &=
 QA
> > Security * Process Improvement & Measurement * http://www.sqe.com/bsce5=
sf
> > _______________________________________________
> > Linux-fbdev-devel mailing list
> > Lin...@li...
> > https://lists.sourceforge.net/lists/listinfo/linux-fbdev-devel
> >
>=20
>=20
> -------------------------------------------------------
> SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
> from IBM. Find simple to follow Roadmaps, straightforward articles,
> informative Webcasts and more! Get everything you need to get up to
> speed, fast. http://ads.osdn.com/?ad_id=3D7477&alloc_id=3D16492&op=3Dclic=
k
> _______________________________________________
> Linux-fbdev-devel mailing list
> Lin...@li...
> https://lists.sourceforge.net/lists/listinfo/linux-fbdev-devel
>=20


--=20
Jon Smirl
jon...@gm...

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Luca <kr...@pe...> - 2005-07-29 22:45:32

Il Fri, Jul 29, 2005 at 08:51:34PM +0100, James Simmons ha scritto: 
> > Thank you for your persistence.  I think I know the culprit.  Someone
> > insisted on using memcpy in fb_pad_aligned_buffer().  I have already
> > fixed this before, but apparently, the memcpy was brought back.  Try
> > the attached patch and let me know.
> 
> Yipes, I did that. The memcpy function is suppose to be optimized for the 
> platform. See string.h in the include/asm directory. I seen for example 
> the Athlon would use the 3DNow instruction set to copy data. Something 
> is really wrong with memcpy if moving byte by byte is faster !!!! 

For small copies MMX/3DNow are not used at all. In current kernel
MMX/3DNow memcpy is used only when data size is greater than 512bytes.
Remember that MMX/3DNow uses FPU so the kernel must save/restore state
and this overhead would make the copy slow for small chunks.

Luca
-- 
Home: http://kronoz.cjb.net
Se il  destino di un uomo  e` annegare, anneghera` anche  in un bicchier
d'acqua.
Proverbio yddish

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Knut P. <Knu...@t-...> - 2005-07-29 20:09:00

Hi Tony,

>
> Thank you for your persistence.  I think I know the culprit.  Someone
> insisted on using memcpy in fb_pad_aligned_buffer().  I have already
> fixed this before, but apparently, the memcpy was brought back.  Try
> the attached patch and let me know.
>
> Tony

Replacing memcpy() with this inline code helps. Performance is slightly 
slower than
it was in 2.6.12,  but this is hardly measurable and could be caused by 
other changes
in the kernel.

The most affected test, (test 3, cyblafb, ypan) now is about 7ms slower 
than it was
in 2.6.12. Without your patch the performance penalty was 202ms!

Yes, please send the patch to Linus asap, it´s a must for 2.6.13.

Someone should look at memcpy ;-)

cu,
 Knut

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Andrew M. <ak...@os...> - 2005-07-29 19:04:00

"Antonino A. Daplas" <ad...@gm...> wrote:
>
>     fbdev: Replace memcpy with for-loop when preparing bitmap

Whee, progress.  Please let me know if/when you want this sent to Linus.

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: James S. <jsi...@in...> - 2005-07-29 19:53:35

> "Antonino A. Daplas" <ad...@gm...> wrote:
> >
> >     fbdev: Replace memcpy with for-loop when preparing bitmap
> 
> Whee, progress.  Please let me know if/when you want this sent to Linus.

Before you do I like to know memcpy is slower than byte by byte copying. 
This just seems to be wrong!

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: James S. <jsi...@in...> - 2005-07-29 19:59:37

Can you do some performance measurements with this patch instead? I have 
a theory. I bet because we didn't have the linux version of string.h we 
are using the glibc version instead which is slower. In fact I bet it will
be faster than byte by byte copy. Give it a try.

--- /usr/src/linus-2.6/drivers/video/fbmem.c	2005-07-28 10:24:11.000000000 -0700
+++ fbmem.c	2005-07-29 12:53:30.000000000 -0700
@@ -15,6 +15,7 @@
 #include <linux/module.h>
 
 #include <linux/types.h>
+#include <linux/string.h>
 #include <linux/errno.h>
 #include <linux/sched.h>
 #include <linux/smp_lock.h>

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: Antonino A. D. <ad...@gm...> - 2005-07-29 22:45:18

Jon Smirl wrote:
> On 7/29/05, James Simmons <jsi...@in...> wrote:
>>> Thank you for your persistence.  I think I know the culprit.  Someone
>>> insisted on using memcpy in fb_pad_aligned_buffer().  I have already
>>> fixed this before, but apparently, the memcpy was brought back.  Try
>>> the attached patch and let me know.
>> Yipes, I did that. The memcpy function is suppose to be optimized for the
>> platform. See string.h in the include/asm directory. I seen for example
>> the Athlon would use the 3DNow instruction set to copy data. Something
>> is really wrong with memcpy if moving byte by byte is faster !!!!
>> Alot of drivers use memcpy. If memcpy sucks then drivers should be copying
>> byte by byte then. The question I have is this the case for non intel
>> platforms as well. Could someone run the numbers on other platforms?
> 
> memmove/memcpy is faster. memcpy is faster than memmove so use it if
> you can. But, there is a lower limit probably around 16 bytes or so
> where the loop becomes faster.  So if you know that you will always be
> copying small fragments use the loop.  The compiler can't decide

Yes, the loop copies each row of a font character.  For an 8x16 font
that's 1 byte. The maximum fontwidth is 32. A 12x22 font does not pass
through this function because the width is not a multiple of 8.  So,
currently, it's used mostly for 8x16 fonts. 

I already know people using 16x30 fonts. There are probably others bigger
than that. 

Of course, we can always use Duff's version to loop-unroll that particular
section, but even at 4 bytes, I don't know if it's worth the effort. Anyone
knows people using 32 wide fonts?

Tony

Re: [Linux-fbdev-devel] Fw: framebuffer blitting performance loss 2.6.12 -> 2.6.13-rc3

From: James S. <jsi...@in...> - 2005-08-03 17:30:12

> Yes, the loop copies each row of a font character.  For an 8x16 font
> that's 1 byte. The maximum fontwidth is 32. A 12x22 font does not pass
> through this function because the width is not a multiple of 8.  So,
> currently, it's used mostly for 8x16 fonts. 
> 
> I already know people using 16x30 fonts. There are probably others bigger
> than that. 
> 
> Of course, we can always use Duff's version to loop-unroll that particular
> section, but even at 4 bytes, I don't know if it's worth the effort. Anyone
> knows people using 32 wide fonts?

The console system supports up to 32 pixel wide fonts. Even at that 
maximum size we only copy 4 bytes of data at a time. Unrolling the loop 
is right.