Thread: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

Brought to you by: bradd, geertu, jsimmons, wyo

linux-fbdev-devel

[Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Knut P. <Knu...@t-...> - 2005-08-30 16:12:24

This trivial patch gives a performance boost to the framebuffer console

Constructing the bitmaps that are given to the bitblit functions of the 
framebuffer
drivers is time consuming. Here we avoide a call to the slow 
fb_pad_aligned_buffer().
The patch replaces that call with a simple but much more efficient 
bytewise copy.

The kernel spends a significant time at this place if you use 8x* fonts. 
Every
pixel displayed on your screen is prepared here.

Some benchmark results:

Displaying a file of 2000 lines with 160 characters each takes 889 ms 
system
time using  cyblafb on my system  (I´m using a 1280x1024 video mode,
resulting in a 160x64 character console)

Displaying the same file with the enclosed patch applied to 2.6.13 only 
takes
760 ms system time, saving 129 ms or 14.5%.

Font widths other than 8 are not affected.

The advantage and correctness of this patch should be obvious.

Signed-off-by: Knut Petersen <Knu...@t-...>

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c	2005-08-29 01:41:01.000000000 +0200
+++ linux/drivers/video/console/bitblit.c	2005-08-30 17:19:57.000000000 +0200
@@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
 	unsigned int scan_align = info->pixmap.scan_align - 1;
 	unsigned int buf_align = info->pixmap.buf_align - 1;
 	unsigned int shift_low = 0, mod = vc->vc_font.width % 8;
-	unsigned int shift_high = 8, pitch, cnt, size, k;
+	unsigned int shift_high = 8, pitch, cnt, size, i, k;
 	unsigned int idx = vc->vc_font.width >> 3;
 	unsigned int attribute = get_attribute(info, scr_readw(s));
 	struct fb_image image;
@@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
 					src = buf;
 				}
 
-				fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
+				if (idx == 1)
+					for(i=0; i < image.height; i++)
+						dst[pitch*i] = src[i];
+				else
+					fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
 				dst += width;
 			}
 		}

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Geert U. <ge...@li...> - 2005-08-30 16:18:53

On Tue, 30 Aug 2005, Knut Petersen wrote:
> linux/drivers/video/console/bitblit.c
> --- linuxorig/drivers/video/console/bitblit.c	2005-08-29 01:41:01.000000000
> +0200
> +++ linux/drivers/video/console/bitblit.c	2005-08-30 17:19:57.000000000
> +0200
> @@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc
> 	unsigned int scan_align = info->pixmap.scan_align - 1;
> 	unsigned int buf_align = info->pixmap.buf_align - 1;
> 	unsigned int shift_low = 0, mod = vc->vc_font.width % 8;
> -	unsigned int shift_high = 8, pitch, cnt, size, k;
> +	unsigned int shift_high = 8, pitch, cnt, size, i, k;
> 	unsigned int idx = vc->vc_font.width >> 3;
> 	unsigned int attribute = get_attribute(info, scr_readw(s));
> 	struct fb_image image;
> @@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc
> 					src = buf;
> 				}
> 
> -				fb_pad_aligned_buffer(dst, pitch, src, idx,
> image.height);
> +				if (idx == 1)
> +					for(i=0; i < image.height; i++)
> +						dst[pitch*i] = src[i];
                                                    ^^^^^^^
> +				else
> +					fb_pad_aligned_buffer(dst, pitch, src,
> idx, image.height);
> 				dst += width;
> 			}
> 		}

Probably you can make it even faster by avoiding the multiplication, like

    unsigned int offset = 0;
    for (i = 0; i < image.height; i++) {
	dst[offset] = src[i];
	offset += pitch;
    }

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@li...

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
							    -- Linus Torvalds

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Knut P. <Knu...@t-...> - 2005-08-30 17:55:51

>Probably you can make it even faster by avoiding the multiplication, like
>
>    unsigned int offset = 0;
>    for (i = 0; i < image.height; i++) {
>	dst[offset] = src[i];
>	offset += pitch;
>    }
>

More than two decades ago I learned to avoid mul and imul. Use shifts, 
add and lea instead,
that was the credo those days. The name of the game was CP/M 80/86, a86, 
d86 and ddt ;-)

But let´s get serious again.

Your proposed change of the patch results in a 21 ms performance 
decrease on my system.
Yes, I do know that this is hard to believe. I tested a similar 
variation before, and the results
were even worse.

Avoiding mul is a good idea in assembly language today, but often it is 
better to write a
multiplication  with the loop counter in C and not to introduce an extra 
variable instead. The
compiler will optimize the code and it´s easier for gcc without that 
extra variable.

More interesting would be the question what should be done for idx==2 or 
idx==3. Probably
fb_pad_aligned_buffer() is also slower for those cases. But does anybody 
use such fonts?

cu,
 knut

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Roman Z. <zi...@li...> - 2005-08-30 19:14:29

Hi,

On Tue, 30 Aug 2005, Knut Petersen wrote:

> > Probably you can make it even faster by avoiding the multiplication, li=
ke
> >=20
> >    unsigned int offset =3D 0;
> >    for (i =3D 0; i < image.height; i++) {
> > =09dst[offset] =3D src[i];
> > =09offset +=3D pitch;
> >    }
> >=20
>=20
> More than two decades ago I learned to avoid mul and imul. Use shifts, ad=
d and
> lea instead,
> that was the credo those days. The name of the game was CP/M 80/86, a86, =
d86
> and ddt ;-)
>=20
> But let=B4s get serious again.
>=20
> Your proposed change of the patch results in a 21 ms performance decrease=
 on
> my system.
> Yes, I do know that this is hard to believe. I tested a similar variation
> before, and the results
> were even worse.

Could you try the patch below, for a few extra cycles you might want to=20
make it an inline function.

bye, Roman

Index: linux-2.6/drivers/video/fbmem.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- linux-2.6.orig/drivers/video/fbmem.c=092005-08-30 01:55:18.000000000 +0=
200
+++ linux-2.6/drivers/video/fbmem.c=092005-08-30 21:10:25.705462837 +0200
@@ -82,11 +82,11 @@ void fb_pad_aligned_buffer(u8 *dst, u32=20
 {
 =09int i, j;
=20
+=09d_pitch -=3D s_pitch;
 =09for (i =3D height; i--; ) {
 =09=09/* s_pitch is a few bytes at the most, memcpy is suboptimal */
 =09=09for (j =3D 0; j < s_pitch; j++)
-=09=09=09dst[j] =3D src[j];
-=09=09src +=3D s_pitch;
+=09=09=09*dst++ =3D *src++;
 =09=09dst +=3D d_pitch;
 =09}
 }

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Knut P. <Knu...@t-...> - 2005-08-30 22:24:01

Hi Roman,

>Could you try the patch below, for a few extra cycles you might want to 
>make it an inline function.
>  
>
No, it does not help. If there is any difference, it is too small to be 
measured on
my system ... and my system does run at 1000 Hz.

After 2.6.12 fb_pad_aligned_buffer() was changed to use memcpy() instead 
of a
bytewise copy. That slowed things down a lot, some weeks ago that was 
reverted.
fb_pad_aligned _buffer() isn´t that slow, it´s just an external function 
to be called
and that means a lot of unnecessary code.

How could I make it an inline function? It is used in console/bitblit.c, 
nvidia/nvidia.c,
riva/fbdev.c and softcursor.c.

cu,
 Knut

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Roman Z. <zi...@li...> - 2005-08-31 00:51:18

Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

> How could I make it an inline function? It is used in console/bitblit.c,
> nvidia/nvidia.c,
> riva/fbdev.c and softcursor.c.

Something like below, which has the advantange that there is still only 
one implementation of the function and if it's still slower, we really 
need to check the compiler.

bye, Roman

 drivers/video/console/bitblit.c |    5 ++++-
 drivers/video/fbmem.c           |   10 +---------
 include/linux/fb.h              |   13 +++++++++++++
 3 files changed, 18 insertions(+), 10 deletions(-)

Index: linux-2.6/drivers/video/fbmem.c
===================================================================
--- linux-2.6.orig/drivers/video/fbmem.c	2005-08-30 21:17:44.000000000 +0200
+++ linux-2.6/drivers/video/fbmem.c	2005-08-31 01:20:37.000000000 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
  */
 void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
 {
-	int i, j;
-
-	for (i = height; i--; ) {
-		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
-		for (j = 0; j < s_pitch; j++)
-			dst[j] = src[j];
-		src += s_pitch;
-		dst += d_pitch;
-	}
+	__fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
 }
 EXPORT_SYMBOL(fb_pad_aligned_buffer);
 
Index: linux-2.6/drivers/video/console/bitblit.c
===================================================================
--- linux-2.6.orig/drivers/video/console/bitblit.c	2005-08-30 01:55:20.000000000 +0200
+++ linux-2.6/drivers/video/console/bitblit.c	2005-08-31 01:25:30.000000000 +0200
@@ -175,7 +175,10 @@ static void bit_putcs(struct vc_data *vc
 					src = buf;
 				}
 
-				fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
+				if (likely(idx == 1))
+					__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
+				else
+					fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
 				dst += width;
 			}
 		}
Index: linux-2.6/include/linux/fb.h
===================================================================
--- linux-2.6.orig/include/linux/fb.h	2005-08-30 01:56:29.000000000 +0200
+++ linux-2.6/include/linux/fb.h	2005-08-31 01:21:04.000000000 +0200
@@ -824,6 +824,19 @@ extern int fb_get_color_depth(struct fb_
 extern int fb_get_options(char *name, char **option);
 extern int fb_new_modelist(struct fb_info *info);
 
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
+{
+	int i, j;
+
+	d_pitch -= s_pitch;
+	for (i = height; i--; ) {
+		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
+		for (j = 0; j < s_pitch; j++)
+			*dst++ = *src++;
+		dst += d_pitch;
+	}
+}
+
 extern struct fb_info *registered_fb[FB_MAX];
 extern int num_registered_fb;

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Knut P. <Knu...@t-...> - 2005-08-31 12:43:53

>Something like below, which has the advantange that there is still only 
>one implementation of the function
>
True, that´s a great advantage.

> and if it's still slower, we really need to check the compiler
>  
>
Please have a look at the following patch. It takes your idea of 
inlining but moves
the special cases into the macro, speeding things up for the very likely 
case of
s_pitch == 1 and the less likely case of s_pitch of 2. Treating s_pitch 
== 2 special
gives a still significant performance improvement of more than 10 % for 
16x30
fonts.

This way also bit_putcs looks better again ...

Andrew, as this way is better than and still as fast as my first 
approach I think
framebuffer-bit_putcs-optimization-for-8x.patch should be reverted and the
following patch should be applied instead.

Antonino, Roman, Geert, do you agree?


cu,
 knut

diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c
--- linuxorig/drivers/video/console/bitblit.c	2005-08-29 01:41:01.000000000 +0200
+++ linux/drivers/video/console/bitblit.c	2005-08-31 10:06:22.000000000 +0200
@@ -175,7 +175,7 @@ static void bit_putcs(struct vc_data *vc
 					src = buf;
 				}
 
-				fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
+				__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
 				dst += width;
 			}
 		}
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/fbmem.c linux/drivers/video/fbmem.c
--- linuxorig/drivers/video/fbmem.c	2005-08-29 01:41:01.000000000 +0200
+++ linux/drivers/video/fbmem.c	2005-08-31 13:36:16.000000000 +0200
@@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth);
  */
 void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
 {
-	int i, j;
-
-	for (i = height; i--; ) {
-		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
-		for (j = 0; j < s_pitch; j++)
-			dst[j] = src[j];
-		src += s_pitch;
-		dst += d_pitch;
-	}
+	__fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height);
 }
 EXPORT_SYMBOL(fb_pad_aligned_buffer);
 
diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/include/linux/fb.h linux/include/linux/fb.h
--- linuxorig/include/linux/fb.h	2005-08-29 01:41:01.000000000 +0200
+++ linux/include/linux/fb.h	2005-08-31 12:45:08.000000000 +0200
@@ -824,6 +824,38 @@ extern int fb_get_color_depth(struct fb_
 extern int fb_get_options(char *name, char **option);
 extern int fb_new_modelist(struct fb_info *info);
 
+
+/*
+ * Don't change without testing performance of framebuffer
+ * bitblitting. Inlining is necessary for performance reasons.
+ * Although the code might not _look_ fast because of some
+ * multiplications, it really _is_ fast as it is easier for gcc
+ * to optimize well.
+ */
+
+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, 
+						u32 s_pitch, u32 height)
+{
+	int i, j;
+
+	if (likely(s_pitch==1))
+		for(i=0; i < height; i++)
+			dst[d_pitch*i] = src[i];
+	else if (s_pitch==2)
+		for(i=0; i < height; i++) {
+			*(u16 *)dst = ((u16 *)src)[i];
+			dst += d_pitch;
+		}
+	else {
+		d_pitch -= s_pitch;
+		for (i = height; i--; ) {
+			for (j = 0; j < s_pitch; j++)
+				*dst++ = *src++;
+			dst += d_pitch;
+		}
+	}
+}
+
 extern struct fb_info *registered_fb[FB_MAX];
 extern int num_registered_fb;

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Roman Z. <zi...@li...> - 2005-08-31 17:15:37

Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

> +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
> u32 s_pitch, u32 height)
> +{
> +	int i, j;
> +
> +	if (likely(s_pitch==1))
> +		for(i=0; i < height; i++)
> +			dst[d_pitch*i] = src[i];
> +	else if (s_pitch==2)
> +		for(i=0; i < height; i++) {
> +			*(u16 *)dst = ((u16 *)src)[i];
> +			dst += d_pitch;
> +		}
> +	else {
> +		d_pitch -= s_pitch;
> +		for (i = height; i--; ) {
> +			for (j = 0; j < s_pitch; j++)
> +				*dst++ = *src++;
> +			dst += d_pitch;
> +		}
> +	}
> +}

Why did you add the multiply back?
You have now 3 slightly different variants of the same, which isn't really 
an improvement. In my example I showed you how to generate the first and 
last version from the same source.
If you also want to optimize for other sizes, you might want to always 
inline the function, if the function call overhead is the largest part 
anyway, the special case for 2 bytes might not be needed anymore.

BTW this version saves another condition:

static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
{
	int i, j;

	d_pitch -= s_pitch;
	i = height;
	do {
		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
		j = s_pitch;
		do
			*dst++ = *src++;
		while (--j > 0);
		dst += d_pitch;
	} while (--i > 0);
}

bye, Roman

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Knut P. <Knu...@t-...> - 2005-08-31 19:16:35

  Hi Roman!

>>+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, +
>>u32 s_pitch, u32 height)
>>+{
>>+	int i, j;
>>+
>>+	if (likely(s_pitch==1))
>>+		for(i=0; i < height; i++)
>>+			dst[d_pitch*i] = src[i];
>>    
>>

I added the multiply back because gcc (v. 3.3.4) does generate the 
fastest code
if I write it this way. I compiled, inspected the generated assembly and 
benchmarked
about a  dozend variations of the code (benchmark as previously described).

The special case for s_pitch == 1 saves about 10 ms system time (770 ms 
-> 760 ms)

The special case for s_pitch == 2 saves about 270 ms system time (2120 
-> 1850ms)
with a 16x30 font.

The third case is for even bigger fonts ... I believe that it will not 
often be used but
something like that must be present.

>You have now 3 slightly different variants of the same, which isn't really 
>an improvement. In my example I showed you how to generate the first and 
>last version from the same source.
>
The first version will only be generated when gcc can be sure that 
s_pitch is 1.
Therefore you had to explicitly call __fb_pad_aligned_buffer with that 
value:

	if (likely(idx == 1))
		__fb_pad_aligned_buffer(dst, pitch, src, 1, image.height);
	else
		fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);
 
With the version I propose it´s enough to write

		__fb_pad_aligned_buffer(dst, pitch, src, idx, image.height);

instead, and you will get good performance for all cases. If the value of
idx/s_pitch is know at compile time, the compiler can and will ignore the
other cases.

>If you also want to optimize for other sizes, you might want to always 
>inline the function, if the function call overhead is the largest part 
>anyway, the special case for 2 bytes might not be needed anymore.
>  
>
fb_pad_aligned_buffer() is  usefull to save some space in cases like 
softcursor.
It´s also used by some  drivers (nvidia and riva), but the authors of 
those drivers
have to decide if they prefer the inlined version or the version fixed 
in fbmem.

>BTW this version saves another condition:
>
>static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height)
>{
>	int i, j;
>
>	d_pitch -= s_pitch;
>	i = height;
>	do {
>		/* s_pitch is a few bytes at the most, memcpy is suboptimal */
>		j = s_pitch;
>		do
>			*dst++ = *src++;
>		while (--j > 0);
>		dst += d_pitch;
>	} while (--i > 0);
>}
>
I tested that code, together with the followig code in but_putcs():

             if(idx==1)
                    __fb_pad_aligned_buffer(dst, pitch, src, 1, 
image.height);
             else if (idx==2)
                    __fb_pad_aligned_buffer(dst, pitch, src, 2, 
image.height);
             else
                    fb_pad_aligned_buffer(dst, pitch, src, idx, 
image.height);
             dst += width;

It´s as fast/slow as your previous version, the measurements are almost 
identical.

Let´s summarize:

Your version of __fb_pad_aligned_buffer looks much better, but it needs 
not so nice
conditionals when used. My version looks bad, but it is easier to use 
and it is
_faster_.

cu,
 knut

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Roman Z. <zi...@li...> - 2005-08-31 19:34:40

Hi,

On Wed, 31 Aug 2005, Knut Petersen wrote:

> I added the multiply back because gcc (v. 3.3.4) does generate the fastes=
t
> code
> if I write it this way.

The multiply is not generally faster, so your version may be the fastest,=
=20
but in other situations it will be a lot slower. My version is generally=20
fast.

> The special case for s_pitch =3D=3D 2 saves about 270 ms system time (212=
0 ->
> 1850ms)
> with a 16x30 font.

Compared to what? How much is the function call overhead?

> It=B4s as fast/slow as your previous version, the measurements are almost
> identical.

The generated code is a bit smaller, so it mostly depends on the icache to=
=20
see a difference.

bye, Roman

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Knut P. <Knu...@t-...> - 2005-08-31 19:49:22

>>The special case for s_pitch == 2 saves about 270 ms system time (2120 ->
>>1850ms)
>>with a 16x30 font.
>>    
>>
>Compared to what? How much is the function call overhead?
>
>  
>
Your version of the inline code inserted after an if (idx==2) in 
bit_putcs against my version of the
inline code.

cu,
 knut

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Geert U. <ge...@li...> - 2005-08-30 20:00:21

On Tue, 30 Aug 2005, Knut Petersen wrote:
> > Probably you can make it even faster by avoiding the multiplication, =
like
> >=20
> >    unsigned int offset =3D 0;
> >    for (i =3D 0; i < image.height; i++) {
> > 	dst[offset] =3D src[i];
> > 	offset +=3D pitch;
> >    }
>=20
> More than two decades ago I learned to avoid mul and imul. Use shifts, =
add and
> lea instead,
> that was the credo those days. The name of the game was CP/M 80/86, a86=
, d86
> and ddt ;-)
>=20
> But let=B4s get serious again.

On modern CPUs, a multiplication indeed takes 1 cycle, just like an addit=
ion.
But on older CPUs (still supported by Linux), this is not true.

> Your proposed change of the patch results in a 21 ms performance decrea=
se on
> my system.
> Yes, I do know that this is hard to believe. I tested a similar variati=
on
> before, and the results
> were even worse.
>=20
> Avoiding mul is a good idea in assembly language today, but often it is=
 better
> to write a
> multiplication  with the loop counter in C and not to introduce an extr=
a
> variable instead. The
> compiler will optimize the code and it=B4s easier for gcc without that =
extra
> variable.

But you are right. On actual inspection of the generated assembly code fo=
r a
very simple test case, it turns out both (m68k-linux-)gcc 2.95.2 and 3.3.=
3 are
smart enough to convert the multiplication to an addition...

And interestingly, if I avoid the multiplication explicitly, gcc 2.95.2 s=
till
generates the same code, but 3.3.3 adds a few extra instructions to
save/restore local vars. So this probably explains why it turned out to b=
e
slower for you. Ugh...

Gr{oetje,eeting}s,

						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m6=
8k.org

In personal conversations with technical people, I call myself a hacker. =
But
when I'm talking to journalists I just say "programmer" or something like=
 that.
							    -- Linus Torvalds

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Antonino A. D. <ad...@gm...> - 2005-08-31 01:15:21

Knut Petersen wrote:
> fb_pad_aligned_buffer() is also slower for those cases. But does anybody
> use such fonts?

Yes, there are 16x30 fonts out there in the wild. 

Tony

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Antonino A. D. <ad...@gm...> - 2005-08-31 06:43:20

Roman Zippel wrote:
> Hi,
> 
> On Wed, 31 Aug 2005, Knut Petersen wrote:
> 
>> How could I make it an inline function? It is used in console/bitblit.c,
>> nvidia/nvidia.c,
>> riva/fbdev.c and softcursor.c.
> 
> Something like below, which has the advantange that there is still only 
> one implementation of the function and if it's still slower, we really 
> need to check the compiler.
> 

I do get better numbers with this, not much, but better than Knut's (5ms), and
definitely much better than the old uninlined one (100ms).

Andrew, don't get this yet.  I'll incorporate this with the bit_putcs()
breakup that I'm currently doing.

Roman, okay if you have a 'Signed-off-by' line?

Tony

Re: [Linux-fbdev-devel] [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Roman Z. <zi...@li...> - 2005-08-31 15:50:12

Hi,

On Wed, 31 Aug 2005, Antonino A. Daplas wrote:

> Roman, okay if you have a 'Signed-off-by' line?

Okay.

bye, Roman

[Linux-fbdev-devel] Re: [PATCH 1/1 2.6.13] framebuffer: bit_putcs() optimization for 8x* fonts

From: Antonino A. D. <ad...@gm...> - 2005-08-31 01:15:09

Knut Petersen wrote:
> This trivial patch gives a performance boost to the framebuffer console
> 
> Constructing the bitmaps that are given to the bitblit functions of the
> framebuffer
> drivers is time consuming. Here we avoide a call to the slow
> fb_pad_aligned_buffer().
> The patch replaces that call with a simple but much more efficient
> bytewise copy.
> 
> The kernel spends a significant time at this place if you use 8x* fonts.
> Every
> pixel displayed on your screen is prepared here.
> 
> Some benchmark results:
> 
> Displaying a file of 2000 lines with 160 characters each takes 889 ms
> system
> time using  cyblafb on my system  (I´m using a 1280x1024 video mode,
> resulting in a 160x64 character console)
> 
> Displaying the same file with the enclosed patch applied to 2.6.13 only
> takes
> 760 ms system time, saving 129 ms or 14.5%.

Where did this 14.5% come from?  Is it bit_putcs alone or is more
real world, such as 'time cat text_file'? If I'm to guess, a large
percent of the improvement is caused by the inlining of the code.

I'm not against the patch, it will benefit drivers with very fast
imageblits.  However, since most drivers have imageblits done in software,
a large proportion of the processing time will go to the imageblit itself,
so I don't think you'll get that high a number (I get only a 4%
improvement, and this is in a driver with accelerated blits, and it will
probably be lower, ie, in vesafb).

Anyway, with the addition of your patch, bit_putcs has now reached an
'ugliness threshhold' for a revamp.

Tony