From: Knut P. <Knu...@t-...> - 2005-08-30 16:12:24
|
This trivial patch gives a performance boost to the framebuffer console Constructing the bitmaps that are given to the bitblit functions of the framebuffer drivers is time consuming. Here we avoide a call to the slow fb_pad_aligned_buffer(). The patch replaces that call with a simple but much more efficient bytewise copy. The kernel spends a significant time at this place if you use 8x* fonts. Every pixel displayed on your screen is prepared here. Some benchmark results: Displaying a file of 2000 lines with 160 characters each takes 889 ms system time using cyblafb on my system (I´m using a 1280x1024 video mode, resulting in a 160x64 character console) Displaying the same file with the enclosed patch applied to 2.6.13 only takes 760 ms system time, saving 129 ms or 14.5%. Font widths other than 8 are not affected. The advantage and correctness of this patch should be obvious. Signed-off-by: Knut Petersen <Knu...@t-...> diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c --- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.000000000 +0200 +++ linux/drivers/video/console/bitblit.c 2005-08-30 17:19:57.000000000 +0200 @@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc unsigned int scan_align = info->pixmap.scan_align - 1; unsigned int buf_align = info->pixmap.buf_align - 1; unsigned int shift_low = 0, mod = vc->vc_font.width % 8; - unsigned int shift_high = 8, pitch, cnt, size, k; + unsigned int shift_high = 8, pitch, cnt, size, i, k; unsigned int idx = vc->vc_font.width >> 3; unsigned int attribute = get_attribute(info, scr_readw(s)); struct fb_image image; @@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc src = buf; } - fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); + if (idx == 1) + for(i=0; i < image.height; i++) + dst[pitch*i] = src[i]; + else + fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); dst += width; } } |
From: Geert U. <ge...@li...> - 2005-08-30 16:18:53
|
On Tue, 30 Aug 2005, Knut Petersen wrote: > linux/drivers/video/console/bitblit.c > --- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.000000000 > +0200 > +++ linux/drivers/video/console/bitblit.c 2005-08-30 17:19:57.000000000 > +0200 > @@ -114,7 +114,7 @@ static void bit_putcs(struct vc_data *vc > unsigned int scan_align = info->pixmap.scan_align - 1; > unsigned int buf_align = info->pixmap.buf_align - 1; > unsigned int shift_low = 0, mod = vc->vc_font.width % 8; > - unsigned int shift_high = 8, pitch, cnt, size, k; > + unsigned int shift_high = 8, pitch, cnt, size, i, k; > unsigned int idx = vc->vc_font.width >> 3; > unsigned int attribute = get_attribute(info, scr_readw(s)); > struct fb_image image; > @@ -175,7 +175,11 @@ static void bit_putcs(struct vc_data *vc > src = buf; > } > > - fb_pad_aligned_buffer(dst, pitch, src, idx, > image.height); > + if (idx == 1) > + for(i=0; i < image.height; i++) > + dst[pitch*i] = src[i]; ^^^^^^^ > + else > + fb_pad_aligned_buffer(dst, pitch, src, > idx, image.height); > dst += width; > } > } Probably you can make it even faster by avoiding the multiplication, like unsigned int offset = 0; for (i = 0; i < image.height; i++) { dst[offset] = src[i]; offset += pitch; } Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@li... In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds |
From: Knut P. <Knu...@t-...> - 2005-08-30 17:55:51
|
>Probably you can make it even faster by avoiding the multiplication, like > > unsigned int offset = 0; > for (i = 0; i < image.height; i++) { > dst[offset] = src[i]; > offset += pitch; > } > More than two decades ago I learned to avoid mul and imul. Use shifts, add and lea instead, that was the credo those days. The name of the game was CP/M 80/86, a86, d86 and ddt ;-) But let´s get serious again. Your proposed change of the patch results in a 21 ms performance decrease on my system. Yes, I do know that this is hard to believe. I tested a similar variation before, and the results were even worse. Avoiding mul is a good idea in assembly language today, but often it is better to write a multiplication with the loop counter in C and not to introduce an extra variable instead. The compiler will optimize the code and it´s easier for gcc without that extra variable. More interesting would be the question what should be done for idx==2 or idx==3. Probably fb_pad_aligned_buffer() is also slower for those cases. But does anybody use such fonts? cu, knut |
From: Roman Z. <zi...@li...> - 2005-08-30 19:14:29
|
Hi, On Tue, 30 Aug 2005, Knut Petersen wrote: > > Probably you can make it even faster by avoiding the multiplication, li= ke > >=20 > > unsigned int offset =3D 0; > > for (i =3D 0; i < image.height; i++) { > > =09dst[offset] =3D src[i]; > > =09offset +=3D pitch; > > } > >=20 >=20 > More than two decades ago I learned to avoid mul and imul. Use shifts, ad= d and > lea instead, > that was the credo those days. The name of the game was CP/M 80/86, a86, = d86 > and ddt ;-) >=20 > But let=B4s get serious again. >=20 > Your proposed change of the patch results in a 21 ms performance decrease= on > my system. > Yes, I do know that this is hard to believe. I tested a similar variation > before, and the results > were even worse. Could you try the patch below, for a few extra cycles you might want to=20 make it an inline function. bye, Roman Index: linux-2.6/drivers/video/fbmem.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D --- linux-2.6.orig/drivers/video/fbmem.c=092005-08-30 01:55:18.000000000 +0= 200 +++ linux-2.6/drivers/video/fbmem.c=092005-08-30 21:10:25.705462837 +0200 @@ -82,11 +82,11 @@ void fb_pad_aligned_buffer(u8 *dst, u32=20 { =09int i, j; =20 +=09d_pitch -=3D s_pitch; =09for (i =3D height; i--; ) { =09=09/* s_pitch is a few bytes at the most, memcpy is suboptimal */ =09=09for (j =3D 0; j < s_pitch; j++) -=09=09=09dst[j] =3D src[j]; -=09=09src +=3D s_pitch; +=09=09=09*dst++ =3D *src++; =09=09dst +=3D d_pitch; =09} } |
From: Knut P. <Knu...@t-...> - 2005-08-30 22:24:01
|
Hi Roman, >Could you try the patch below, for a few extra cycles you might want to >make it an inline function. > > No, it does not help. If there is any difference, it is too small to be measured on my system ... and my system does run at 1000 Hz. After 2.6.12 fb_pad_aligned_buffer() was changed to use memcpy() instead of a bytewise copy. That slowed things down a lot, some weeks ago that was reverted. fb_pad_aligned _buffer() isn´t that slow, it´s just an external function to be called and that means a lot of unnecessary code. How could I make it an inline function? It is used in console/bitblit.c, nvidia/nvidia.c, riva/fbdev.c and softcursor.c. cu, Knut |
From: Roman Z. <zi...@li...> - 2005-08-31 00:51:18
|
Hi, On Wed, 31 Aug 2005, Knut Petersen wrote: > How could I make it an inline function? It is used in console/bitblit.c, > nvidia/nvidia.c, > riva/fbdev.c and softcursor.c. Something like below, which has the advantange that there is still only one implementation of the function and if it's still slower, we really need to check the compiler. bye, Roman drivers/video/console/bitblit.c | 5 ++++- drivers/video/fbmem.c | 10 +--------- include/linux/fb.h | 13 +++++++++++++ 3 files changed, 18 insertions(+), 10 deletions(-) Index: linux-2.6/drivers/video/fbmem.c =================================================================== --- linux-2.6.orig/drivers/video/fbmem.c 2005-08-30 21:17:44.000000000 +0200 +++ linux-2.6/drivers/video/fbmem.c 2005-08-31 01:20:37.000000000 +0200 @@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth); */ void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) { - int i, j; - - for (i = height; i--; ) { - /* s_pitch is a few bytes at the most, memcpy is suboptimal */ - for (j = 0; j < s_pitch; j++) - dst[j] = src[j]; - src += s_pitch; - dst += d_pitch; - } + __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height); } EXPORT_SYMBOL(fb_pad_aligned_buffer); Index: linux-2.6/drivers/video/console/bitblit.c =================================================================== --- linux-2.6.orig/drivers/video/console/bitblit.c 2005-08-30 01:55:20.000000000 +0200 +++ linux-2.6/drivers/video/console/bitblit.c 2005-08-31 01:25:30.000000000 +0200 @@ -175,7 +175,10 @@ static void bit_putcs(struct vc_data *vc src = buf; } - fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); + if (likely(idx == 1)) + __fb_pad_aligned_buffer(dst, pitch, src, 1, image.height); + else + fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); dst += width; } } Index: linux-2.6/include/linux/fb.h =================================================================== --- linux-2.6.orig/include/linux/fb.h 2005-08-30 01:56:29.000000000 +0200 +++ linux-2.6/include/linux/fb.h 2005-08-31 01:21:04.000000000 +0200 @@ -824,6 +824,19 @@ extern int fb_get_color_depth(struct fb_ extern int fb_get_options(char *name, char **option); extern int fb_new_modelist(struct fb_info *info); +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) +{ + int i, j; + + d_pitch -= s_pitch; + for (i = height; i--; ) { + /* s_pitch is a few bytes at the most, memcpy is suboptimal */ + for (j = 0; j < s_pitch; j++) + *dst++ = *src++; + dst += d_pitch; + } +} + extern struct fb_info *registered_fb[FB_MAX]; extern int num_registered_fb; |
From: Knut P. <Knu...@t-...> - 2005-08-31 12:43:53
|
>Something like below, which has the advantange that there is still only >one implementation of the function > True, that´s a great advantage. > and if it's still slower, we really need to check the compiler > > Please have a look at the following patch. It takes your idea of inlining but moves the special cases into the macro, speeding things up for the very likely case of s_pitch == 1 and the less likely case of s_pitch of 2. Treating s_pitch == 2 special gives a still significant performance improvement of more than 10 % for 16x30 fonts. This way also bit_putcs looks better again ... Andrew, as this way is better than and still as fast as my first approach I think framebuffer-bit_putcs-optimization-for-8x.patch should be reverted and the following patch should be applied instead. Antonino, Roman, Geert, do you agree? cu, knut diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/console/bitblit.c linux/drivers/video/console/bitblit.c --- linuxorig/drivers/video/console/bitblit.c 2005-08-29 01:41:01.000000000 +0200 +++ linux/drivers/video/console/bitblit.c 2005-08-31 10:06:22.000000000 +0200 @@ -175,7 +175,7 @@ static void bit_putcs(struct vc_data *vc src = buf; } - fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); + __fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); dst += width; } } diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/drivers/video/fbmem.c linux/drivers/video/fbmem.c --- linuxorig/drivers/video/fbmem.c 2005-08-29 01:41:01.000000000 +0200 +++ linux/drivers/video/fbmem.c 2005-08-31 13:36:16.000000000 +0200 @@ -80,15 +80,7 @@ EXPORT_SYMBOL(fb_get_color_depth); */ void fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) { - int i, j; - - for (i = height; i--; ) { - /* s_pitch is a few bytes at the most, memcpy is suboptimal */ - for (j = 0; j < s_pitch; j++) - dst[j] = src[j]; - src += s_pitch; - dst += d_pitch; - } + __fb_pad_aligned_buffer(dst, d_pitch, src, s_pitch, height); } EXPORT_SYMBOL(fb_pad_aligned_buffer); diff -uprN -X linux/Documentation/dontdiff -x '*.bak' -x '*.ctx' linuxorig/include/linux/fb.h linux/include/linux/fb.h --- linuxorig/include/linux/fb.h 2005-08-29 01:41:01.000000000 +0200 +++ linux/include/linux/fb.h 2005-08-31 12:45:08.000000000 +0200 @@ -824,6 +824,38 @@ extern int fb_get_color_depth(struct fb_ extern int fb_get_options(char *name, char **option); extern int fb_new_modelist(struct fb_info *info); + +/* + * Don't change without testing performance of framebuffer + * bitblitting. Inlining is necessary for performance reasons. + * Although the code might not _look_ fast because of some + * multiplications, it really _is_ fast as it is easier for gcc + * to optimize well. + */ + +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, + u32 s_pitch, u32 height) +{ + int i, j; + + if (likely(s_pitch==1)) + for(i=0; i < height; i++) + dst[d_pitch*i] = src[i]; + else if (s_pitch==2) + for(i=0; i < height; i++) { + *(u16 *)dst = ((u16 *)src)[i]; + dst += d_pitch; + } + else { + d_pitch -= s_pitch; + for (i = height; i--; ) { + for (j = 0; j < s_pitch; j++) + *dst++ = *src++; + dst += d_pitch; + } + } +} + extern struct fb_info *registered_fb[FB_MAX]; extern int num_registered_fb; |
From: Roman Z. <zi...@li...> - 2005-08-31 17:15:37
|
Hi, On Wed, 31 Aug 2005, Knut Petersen wrote: > +static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, + > u32 s_pitch, u32 height) > +{ > + int i, j; > + > + if (likely(s_pitch==1)) > + for(i=0; i < height; i++) > + dst[d_pitch*i] = src[i]; > + else if (s_pitch==2) > + for(i=0; i < height; i++) { > + *(u16 *)dst = ((u16 *)src)[i]; > + dst += d_pitch; > + } > + else { > + d_pitch -= s_pitch; > + for (i = height; i--; ) { > + for (j = 0; j < s_pitch; j++) > + *dst++ = *src++; > + dst += d_pitch; > + } > + } > +} Why did you add the multiply back? You have now 3 slightly different variants of the same, which isn't really an improvement. In my example I showed you how to generate the first and last version from the same source. If you also want to optimize for other sizes, you might want to always inline the function, if the function call overhead is the largest part anyway, the special case for 2 bytes might not be needed anymore. BTW this version saves another condition: static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) { int i, j; d_pitch -= s_pitch; i = height; do { /* s_pitch is a few bytes at the most, memcpy is suboptimal */ j = s_pitch; do *dst++ = *src++; while (--j > 0); dst += d_pitch; } while (--i > 0); } bye, Roman |
From: Knut P. <Knu...@t-...> - 2005-08-31 19:16:35
|
Hi Roman! >>+static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, + >>u32 s_pitch, u32 height) >>+{ >>+ int i, j; >>+ >>+ if (likely(s_pitch==1)) >>+ for(i=0; i < height; i++) >>+ dst[d_pitch*i] = src[i]; >> >> I added the multiply back because gcc (v. 3.3.4) does generate the fastest code if I write it this way. I compiled, inspected the generated assembly and benchmarked about a dozend variations of the code (benchmark as previously described). The special case for s_pitch == 1 saves about 10 ms system time (770 ms -> 760 ms) The special case for s_pitch == 2 saves about 270 ms system time (2120 -> 1850ms) with a 16x30 font. The third case is for even bigger fonts ... I believe that it will not often be used but something like that must be present. >You have now 3 slightly different variants of the same, which isn't really >an improvement. In my example I showed you how to generate the first and >last version from the same source. > The first version will only be generated when gcc can be sure that s_pitch is 1. Therefore you had to explicitly call __fb_pad_aligned_buffer with that value: if (likely(idx == 1)) __fb_pad_aligned_buffer(dst, pitch, src, 1, image.height); else fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); With the version I propose it´s enough to write __fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); instead, and you will get good performance for all cases. If the value of idx/s_pitch is know at compile time, the compiler can and will ignore the other cases. >If you also want to optimize for other sizes, you might want to always >inline the function, if the function call overhead is the largest part >anyway, the special case for 2 bytes might not be needed anymore. > > fb_pad_aligned_buffer() is usefull to save some space in cases like softcursor. It´s also used by some drivers (nvidia and riva), but the authors of those drivers have to decide if they prefer the inlined version or the version fixed in fbmem. >BTW this version saves another condition: > >static inline void __fb_pad_aligned_buffer(u8 *dst, u32 d_pitch, u8 *src, u32 s_pitch, u32 height) >{ > int i, j; > > d_pitch -= s_pitch; > i = height; > do { > /* s_pitch is a few bytes at the most, memcpy is suboptimal */ > j = s_pitch; > do > *dst++ = *src++; > while (--j > 0); > dst += d_pitch; > } while (--i > 0); >} > I tested that code, together with the followig code in but_putcs(): if(idx==1) __fb_pad_aligned_buffer(dst, pitch, src, 1, image.height); else if (idx==2) __fb_pad_aligned_buffer(dst, pitch, src, 2, image.height); else fb_pad_aligned_buffer(dst, pitch, src, idx, image.height); dst += width; It´s as fast/slow as your previous version, the measurements are almost identical. Let´s summarize: Your version of __fb_pad_aligned_buffer looks much better, but it needs not so nice conditionals when used. My version looks bad, but it is easier to use and it is _faster_. cu, knut |
From: Roman Z. <zi...@li...> - 2005-08-31 19:34:40
|
Hi, On Wed, 31 Aug 2005, Knut Petersen wrote: > I added the multiply back because gcc (v. 3.3.4) does generate the fastes= t > code > if I write it this way. The multiply is not generally faster, so your version may be the fastest,= =20 but in other situations it will be a lot slower. My version is generally=20 fast. > The special case for s_pitch =3D=3D 2 saves about 270 ms system time (212= 0 -> > 1850ms) > with a 16x30 font. Compared to what? How much is the function call overhead? > It=B4s as fast/slow as your previous version, the measurements are almost > identical. The generated code is a bit smaller, so it mostly depends on the icache to= =20 see a difference. bye, Roman |
From: Knut P. <Knu...@t-...> - 2005-08-31 19:49:22
|
>>The special case for s_pitch == 2 saves about 270 ms system time (2120 -> >>1850ms) >>with a 16x30 font. >> >> >Compared to what? How much is the function call overhead? > > > Your version of the inline code inserted after an if (idx==2) in bit_putcs against my version of the inline code. cu, knut |
From: Geert U. <ge...@li...> - 2005-08-30 20:00:21
|
On Tue, 30 Aug 2005, Knut Petersen wrote: > > Probably you can make it even faster by avoiding the multiplication, = like > >=20 > > unsigned int offset =3D 0; > > for (i =3D 0; i < image.height; i++) { > > dst[offset] =3D src[i]; > > offset +=3D pitch; > > } >=20 > More than two decades ago I learned to avoid mul and imul. Use shifts, = add and > lea instead, > that was the credo those days. The name of the game was CP/M 80/86, a86= , d86 > and ddt ;-) >=20 > But let=B4s get serious again. On modern CPUs, a multiplication indeed takes 1 cycle, just like an addit= ion. But on older CPUs (still supported by Linux), this is not true. > Your proposed change of the patch results in a 21 ms performance decrea= se on > my system. > Yes, I do know that this is hard to believe. I tested a similar variati= on > before, and the results > were even worse. >=20 > Avoiding mul is a good idea in assembly language today, but often it is= better > to write a > multiplication with the loop counter in C and not to introduce an extr= a > variable instead. The > compiler will optimize the code and it=B4s easier for gcc without that = extra > variable. But you are right. On actual inspection of the generated assembly code fo= r a very simple test case, it turns out both (m68k-linux-)gcc 2.95.2 and 3.3.= 3 are smart enough to convert the multiplication to an addition... And interestingly, if I avoid the multiplication explicitly, gcc 2.95.2 s= till generates the same code, but 3.3.3 adds a few extra instructions to save/restore local vars. So this probably explains why it turned out to b= e slower for you. Ugh... Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m6= 8k.org In personal conversations with technical people, I call myself a hacker. = But when I'm talking to journalists I just say "programmer" or something like= that. -- Linus Torvalds |
From: Antonino A. D. <ad...@gm...> - 2005-08-31 01:15:21
|
Knut Petersen wrote: > fb_pad_aligned_buffer() is also slower for those cases. But does anybody > use such fonts? Yes, there are 16x30 fonts out there in the wild. Tony |
From: Antonino A. D. <ad...@gm...> - 2005-08-31 06:43:20
|
Roman Zippel wrote: > Hi, > > On Wed, 31 Aug 2005, Knut Petersen wrote: > >> How could I make it an inline function? It is used in console/bitblit.c, >> nvidia/nvidia.c, >> riva/fbdev.c and softcursor.c. > > Something like below, which has the advantange that there is still only > one implementation of the function and if it's still slower, we really > need to check the compiler. > I do get better numbers with this, not much, but better than Knut's (5ms), and definitely much better than the old uninlined one (100ms). Andrew, don't get this yet. I'll incorporate this with the bit_putcs() breakup that I'm currently doing. Roman, okay if you have a 'Signed-off-by' line? Tony |
From: Roman Z. <zi...@li...> - 2005-08-31 15:50:12
|
Hi, On Wed, 31 Aug 2005, Antonino A. Daplas wrote: > Roman, okay if you have a 'Signed-off-by' line? Okay. bye, Roman |
From: Antonino A. D. <ad...@gm...> - 2005-08-31 01:15:09
|
Knut Petersen wrote: > This trivial patch gives a performance boost to the framebuffer console > > Constructing the bitmaps that are given to the bitblit functions of the > framebuffer > drivers is time consuming. Here we avoide a call to the slow > fb_pad_aligned_buffer(). > The patch replaces that call with a simple but much more efficient > bytewise copy. > > The kernel spends a significant time at this place if you use 8x* fonts. > Every > pixel displayed on your screen is prepared here. > > Some benchmark results: > > Displaying a file of 2000 lines with 160 characters each takes 889 ms > system > time using cyblafb on my system (I´m using a 1280x1024 video mode, > resulting in a 160x64 character console) > > Displaying the same file with the enclosed patch applied to 2.6.13 only > takes > 760 ms system time, saving 129 ms or 14.5%. Where did this 14.5% come from? Is it bit_putcs alone or is more real world, such as 'time cat text_file'? If I'm to guess, a large percent of the improvement is caused by the inlining of the code. I'm not against the patch, it will benefit drivers with very fast imageblits. However, since most drivers have imageblits done in software, a large proportion of the processing time will go to the imageblit itself, so I don't think you'll get that high a number (I get only a 4% improvement, and this is in a driver with accelerated blits, and it will probably be lower, ie, in vesafb). Anyway, with the addition of your patch, bit_putcs has now reached an 'ugliness threshhold' for a revamp. Tony |