|
From: Antonino D. <ad...@po...> - 2003-02-23 04:42:05
Attachments:
tileblit1.diff
cfb_tileops.diff
|
Hi James, If fbcon will support different methods of rasterizing to the framebuffer, it's best if display->dispsw is revived. Currently, fbcon supports classic blitting, but if you decide to add text mode support (for matroxfb and sbusfb), then tile blitting needs to be supported too. Also, rotation will require another method, and so will texture mapping if you ever decide to add this (Isn't this a bit overkill?!). The struct display is not visible to the drivers, so fbcon need to know which dispsw to use, possibly based on a new field in struct fb_info or the presence of the function pointer. I prefer an info->caps field so drivers need not mangle function pointers. Here's an updated Tile Blitting patch. The feature is selectable via kernel configuration (under Advanced Low Level Driver Features). Currently, it detects the presence of info->tileops but bit setting info->caps should be better. The second patch adds generic tile blitting operation, to be used only for testing purposes since it's very slow. It just wraps fb_imageblit and fb_copyarea. If anyone wants to test this, just link the driver cfb_tileops.o, add "info->tileops = &cfb_tileops" during driver initialization, and enable Tile Blitting support in the kernel. Tony PS: reverse the "accel_putcs() optimazation patch" if applied |
|
From: Antonino D. <ad...@po...> - 2003-02-23 07:42:57
Attachments:
tileblit1.diff
|
On Sun, 2003-02-23 at 12:42, Antonino Daplas wrote: > Here's an updated Tile Blitting patch. The feature is selectable via > kernel configuration (under Advanced Low Level Driver Features). > Currently, it detects the presence of info->tileops but bit setting > info->caps should be better. Stupid me :-(, the patch does not work properly. Try this one instead. Tony |
|
From: Antonino D. <ad...@po...> - 2003-02-23 11:07:29
|
On Sun, 2003-02-23 at 15:43, Antonino Daplas wrote: > On Sun, 2003-02-23 at 12:42, Antonino Daplas wrote: > > > Here's an updated Tile Blitting patch. The feature is selectable via > > kernel configuration (under Advanced Low Level Driver Features). > > Currently, it detects the presence of info->tileops but bit setting > > info->caps should be better. > > Stupid me :-(, the patch does not work properly. Try this one instead. > The optimization logic is flawed :-( Please apply this patch also. Tony diff -Naur linux-2.5.61-fbdev/drivers/video/console/fbcon_accelops.c linux-2.5.61/drivers/video/console/fbcon_accelops.c --- linux-2.5.61-fbdev/drivers/video/console/fbcon_accelops.c 2003-02-23 18:47:53.000000000 +0800 +++ linux-2.5.61/drivers/video/console/fbcon_accelops.c 2003-02-23 18:52:23.000000000 +0800 @@ -113,13 +113,16 @@ src++; } dst[idx] &= mask; - dst[idx] |= *src++ >> shift_low; + dst[idx] |= *src >> shift_low; + if (shift_high < mod) + dst[idx+1] = *src << shift_high; + src++; dst += pitch; } shift_low += mod; + dst0 += (shift_low >= 8) ? width : width - 1; shift_low &= 7; shift_high = 8 - shift_low; - dst0 += (shift_low) ? width - 1 : width; } info->fbops->fb_imageblit(info, image); |
|
From: James S. <jsi...@in...> - 2003-02-26 20:12:57
|
Boy this has been tricky to handle. I have been thinking about how to
handle image blitting from normal memory to texture mappings to tiles.
Then after that I have to make it abstract to fit all these models.
Pretty much I have come to the conclusion that we have two models.
Model 1: Consistent mappings.
In this model we allocate one time a buffer to store image data.
The fbcon classic is loadfont. It could also be creating texures
and saving it in a permentant texture map buffer that is present
on the card. Same for tiles. We create a bunch of tiles and save
them somewhere. We then use a index of some kind later to draw
the image.
Model 2: Streaming mappings.
This model has us create a temporary memory pool to store data
then draw it. After drawing is complete we release the memory.
As you can see the standard imageblit function falls into model 2. At
present we allocate a static buffer :-( Now for a PCI DMA based card we
want a hook to allocate a chunck of memory via pci_alloc_consittent. To
free the memory we use pci_free_consistent. Also for AGP there could be
hooks just for it. So model 2 can be broken into 2 parts.
A) Memory mangement. We first allocate the memory needed. After drawing
the image free the memory.
B) Draw the image. This occurs between the two events in A.
Now for model 1. A example would be the matrox fbdev driver for fast
fonts. We allocate a region to store the font images before using them.
This buffer is pretty much permanent. Then we use indices to tell which
section of the allocated buffer to have the graphics engine use.
So this is the model I have come up with. The tricky part I haven't
figured out yet is working with irqs and how to relate fbcon with it.
Comments?
|
|
From: Antonino D. <ad...@po...> - 2003-02-27 00:34:28
|
On Thu, 2003-02-27 at 04:11, James Simmons wrote:
>
> Boy this has been tricky to handle. I have been thinking about how to
> handle image blitting from normal memory to texture mappings to tiles.
> Then after that I have to make it abstract to fit all these models.
> Pretty much I have come to the conclusion that we have two models.
>
> Model 1: Consistent mappings.
>
> In this model we allocate one time a buffer to store image data.
> The fbcon classic is loadfont. It could also be creating texures
> and saving it in a permentant texture map buffer that is present
> on the card. Same for tiles. We create a bunch of tiles and save
> them somewhere. We then use a index of some kind later to draw
> the image.
>
> Model 2: Streaming mappings.
>
> This model has us create a temporary memory pool to store data
> then draw it. After drawing is complete we release the memory.
>
>
> As you can see the standard imageblit function falls into model 2. At
> present we allocate a static buffer :-( Now for a PCI DMA based card we
> want a hook to allocate a chunck of memory via pci_alloc_consittent. To
> free the memory we use pci_free_consistent. Also for AGP there could be
> hooks just for it. So model 2 can be broken into 2 parts.
>
> A) Memory mangement. We first allocate the memory needed. After drawing
> the image free the memory.
>
> B) Draw the image. This occurs between the two events in A.
>
For model 2, do we have to allocate/deallocate per iteration? First, we
are not dealing with large-sized bitmaps here (a single character at
most will have 64 bytes). Secondly, we cannot deallocate the memory
until the GPU is done rendering. This means we have to synchronize for
each imageblit, further slowing it down. Modern GPU's have deep
pipelines, let's take advantage of it.
So, how about letting the driver allocate the memory for us, and this
will last throughout the lifetime of the driver? This also becomes a
consistent mapping. The main difference is, we treat this memory as a a
ringbuffer, ie:
Memory is at address p, size N.
The first bitmap, in terms of time of arrival, (bitmap1) will be at 'p',
bitmap2 at 'p+size1', bitmap 3 at 'p+size1+size2' and so on and so
forth. Once fbcon reaches the end of the buffer, 'p+N', it calls
fb_sync() and start all over again, at 'p'.
The advantages of the above are:
1. no need to allocate/deallocate memory which is disproportionately
more expansive relative to the bitmap sizes fbcon is dealing with.
2. no chance of memory becoming unavailable during
memory-starved/emergency states.
3. the whole process is very fast and asynchronous. The GPU can be
rendering, while the CPU is preparing the bitmap. The only time fbcon
synchronizes is during the "wrap-around".
This is actually the initial patch that I submitted to you months ago,
but you rejected it. That's why I came up with the simpler
implementation (statically allocated buffer). As Geert and DaveM has
mentioned to me, the current implementation might not be thread-safe
(although I see more of a concurrency problem between CPU and GPU).
Thus, the restriction that the buffer must be completely copied by the
driver before returning. And because of this restriction, an extra copy
which might be unnecessary cannot be avoided (this was noted by Petr).
Treating the buffer as a ringbuffer, we eliminate these restrictions.
So:
struct fb_pixmap {
__u8 *addr;
__u32 size;
__u32 tail;
__u32 buf_align;
__u32 scan_align;
__u32 flags;
}
a. addr - pointer to memory
b. tail - this is the current offset to the buffer
c. buf_align - start alignment per bitmap
d. scan_align - alignment for each scanline, cfb_imageblit requires 1,
i810fb, 2, rivafb and tgafb(?) 4.
e. flags = location of buffer (system or graphics/pci/dma) so fbcon can
choose how to access the memory.
The structure is prepared by the driver at initialization. If it chooses
not too, addr should be NULL and fbcon will just allocate memory for it,
and use default values (size = 8K, buf_align = 1, scan_align = 1, flags
= system).
Tony
|
|
From: James S. <jsi...@in...> - 2003-02-27 01:19:09
|
> For model 2, do we have to allocate/deallocate per iteration?
I'm looking at the streaming DMA model used by filesystem buffers being
written/read by a SCSI device. You are right tho. If you use pci_dma_sync
then you don't need to create and remove the mappings. The question is
there any hardware to limited where we have to create/destory mapping
constantly?
> Secondly, we cannot deallocate the memory
> until the GPU is done rendering. This means we have to synchronize for
> each imageblit, further slowing it down. Modern GPU's have deep
> pipelines, let's take advantage of it.
Okay good point.
> So, how about letting the driver allocate the memory for us, and this
> will last throughout the lifetime of the driver? This also becomes a
> consistent mapping. The main difference is, we treat this memory as a a
> ringbuffer, ie:
>
> Memory is at address p, size N.
>
> The first bitmap, in terms of time of arrival, (bitmap1) will be at 'p',
> bitmap2 at 'p+size1', bitmap 3 at 'p+size1+size2' and so on and so
> forth. Once fbcon reaches the end of the buffer, 'p+N', it calls
> fb_sync() and start all over again, at 'p'.
>
> The advantages of the above are:
>
> 1. no need to allocate/deallocate memory which is disproportionately
> more expansive relative to the bitmap sizes fbcon is dealing with.
>
> 2. no chance of memory becoming unavailable during
> memory-starved/emergency states.
Very good.
> 3. the whole process is very fast and asynchronous. The GPU can be
> rendering, while the CPU is preparing the bitmap. The only time fbcon
> synchronizes is during the "wrap-around".
> This is actually the initial patch that I submitted to you months ago,
> but you rejected it.
Well I was wrong :-( I rejected because I was hoping to keep the api
object orientated (rectangle and bitmap/pixmaps). Now I see that without
this kind of solution we end up with a bigger mess. I admit I made the
wrong judgement call on this.
> As Geert and DaveM has
> mentioned to me, the current implementation might not be thread-safe
> (although I see more of a concurrency problem between CPU and GPU).
I agree I see more of a problem with CPU GPU syncing issue. I do have a
fix in BK with allocating and deallocating continuely but it is the wrong
approach.
> Thus, the restriction that the buffer must be completely copied by the
> driver before returning. And because of this restriction, an extra copy
> which might be unnecessary cannot be avoided (this was noted by Petr).
>
> Treating the buffer as a ringbuffer, we eliminate these restrictions.
I didn't realize that the below was a ringbuffer implementation. The name
threw me off.
> So:
>
> struct fb_pixmap {
> __u8 *addr;
> __u32 size;
> __u32 tail;
> __u32 buf_align;
> __u32 scan_align;
> __u32 flags;
> }
>
> a. addr - pointer to memory
>
> b. tail - this is the current offset to the buffer
>
> c. buf_align - start alignment per bitmap
>
> d. scan_align - alignment for each scanline, cfb_imageblit requires 1,
> i810fb, 2, rivafb and tgafb(?) 4.
>
> e. flags = location of buffer (system or graphics/pci/dma) so fbcon can
> choose how to access the memory.
>
> The structure is prepared by the driver at initialization. If it chooses
> not too, addr should be NULL and fbcon will just allocate memory for it,
> and use default values (size = 8K, buf_align = 1, scan_align = 1, flags
> = system).
Do you still have the original patch?
|
|
From: Antonino D. <ad...@po...> - 2003-02-27 14:14:26
Attachments:
pixmap2.diff
rivafb_pixmap.diff
|
On Thu, 2003-02-27 at 09:18, James Simmons wrote:
> > As Geert and DaveM has
> > mentioned to me, the current implementation might not be thread-safe
> > (although I see more of a concurrency problem between CPU and GPU).
>
> I agree I see more of a problem with CPU GPU syncing issue. I do have a
> fix in BK with allocating and deallocating continuely but it is the wrong
> approach.
>
We can avoid concurrency problems by assigning different sections of the
buffer per call to fb_imageblit(). There is really no need to map/unmap
or allocate/deallocate buffers unless you do not trust the drivers to
behave :-). The buffer will not be exposed to userland, unlike DRM
which has to implement "map->user access->unmap->pass to hardware".
> > Thus, the restriction that the buffer must be completely copied by the
> > driver before returning. And because of this restriction, an extra copy
> > which might be unnecessary cannot be avoided (this was noted by Petr).
> >
> > Treating the buffer as a ringbuffer, we eliminate these restrictions.
>
> I didn't realize that the below was a ringbuffer implementation. The name
> threw me off.
Well, it's not strictly a ringbuffer implementation. This would require
a head and tail pointer where fbcon will adjust the tail and the
driver/hardware will adjust the head. This will be very difficult to
implement in a device independent manner. So we just cheat by issuing
an fb_sync() per loop to flush all pending commands.
>
> Do you still have the original patch?
>
Here's a revised one. Driver's can choose to fill up the following
structure, or leave it empty:
#define FB_PIXMAP_DEFAULT 1 /* used internally by fbcon */
#define FB_PIXMAP_SYSTEM 2 /* memory is in system RAM */
#define FB_PIXMAP_IO 4 /* memory is iomapped */
#define FB_PIXMAP_SYNC 256 /* set if GPU can DMA */
struct fb_pixmap {
__u8 *addr;
__u32 size;
__u32 offset;
__u32 buf_align;
__u32 scan_align;
__u32 flags;
void (*outbuf)(u8 dst, u8 *addr);
u8 (*inbuf) (u8 *addr);
unsigned long lock_flags;
spinlock_t lock;
};
The buffer can be anywhere, system or io. If it's in special memory
(ie, offscreen graphics), access methods must be specified (outbuf,
inbuf). If the buffer is DMA'able by the GPU, then FB_PIXMAP_SYNC must
be set (it issues an fb_sync()), otherwise leave it cleared (ie soft
accels). The buf_align and scan_align are hardware specific. This will
let fbcon format the bitmap into a form acceptable by the hardware. The
modified rivafb sets the alignment according to its needs, which greatly
simplified the rivafb_imageblit() function.
The spinlock may be necessary because fbcon_cursor, which is called via
timer or interrupt, might also use fb_imageblit(). You can change it to
a more appropriate locking method if you want.
The patch should work without driver breakage.
Diff is against linux-2.5.61 + your fbdev.diff.gz + my accel_putcs
optimization patch + Geert's logo updates. I know you already applied
them all in your tree.
Tony
|
|
From: Michel <mi...@da...> - 2003-02-27 18:25:10
|
On Don, 2003-02-27 at 15:15, Antonino Daplas wrote: > On Thu, 2003-02-27 at 09:18, James Simmons wrote: > > > > Thus, the restriction that the buffer must be completely copied by the > > > driver before returning. And because of this restriction, an extra copy > > > which might be unnecessary cannot be avoided (this was noted by Petr). > > > > > > Treating the buffer as a ringbuffer, we eliminate these restrictions. > > > > I didn't realize that the below was a ringbuffer implementation. The name > > threw me off. > > Well, it's not strictly a ringbuffer implementation. This would require > a head and tail pointer where fbcon will adjust the tail and the > driver/hardware will adjust the head. This will be very difficult to > implement in a device independent manner. So we just cheat by issuing > an fb_sync() per loop to flush all pending commands. That still seems suboptimal though. What the DRM often does is have the chip write an age value to a scratch register when it's done processing something. Maybe something like that could be used to avoid waiting for the chip to go idle at all? -- Earthling Michel Dänzer (MrCooper)/ Debian GNU/Linux (powerpc) developer XFree86 and DRI project member / CS student, Free Software enthusiast |
|
From: James S. <jsi...@in...> - 2003-02-27 19:49:57
|
> On Don, 2003-02-27 at 15:15, Antonino Daplas wrote: > > On Thu, 2003-02-27 at 09:18, James Simmons wrote: > > > > > > Thus, the restriction that the buffer must be completely copied by the > > > > driver before returning. And because of this restriction, an extra copy > > > > which might be unnecessary cannot be avoided (this was noted by Petr). > > > > > > > > Treating the buffer as a ringbuffer, we eliminate these restrictions. > > > > > > I didn't realize that the below was a ringbuffer implementation. The name > > > threw me off. > > > > Well, it's not strictly a ringbuffer implementation. This would require > > a head and tail pointer where fbcon will adjust the tail and the > > driver/hardware will adjust the head. This will be very difficult to > > implement in a device independent manner. So we just cheat by issuing > > an fb_sync() per loop to flush all pending commands. > > That still seems suboptimal though. What the DRM often does is have the > chip write an age value to a scratch register when it's done processing > something. Maybe something like that could be used to avoid waiting for > the chip to go idle at all? Don't waste your time. I'm removing all the changes that have been done since 2.5.51. After that I will no longer be co-maintainter of the framebuffer layer. |
|
From: Geert U. <ge...@li...> - 2003-03-02 12:17:48
|
On Thu, 27 Feb 2003, James Simmons wrote:
> > On Don, 2003-02-27 at 15:15, Antonino Daplas wrote:
> > > On Thu, 2003-02-27 at 09:18, James Simmons wrote:
> > >
> > > > > Thus, the restriction that the buffer must be completely copied by the
> > > > > driver before returning. And because of this restriction, an extra copy
> > > > > which might be unnecessary cannot be avoided (this was noted by Petr).
> > > > >
> > > > > Treating the buffer as a ringbuffer, we eliminate these restrictions.
> > > >
> > > > I didn't realize that the below was a ringbuffer implementation. The name
> > > > threw me off.
> > >
> > > Well, it's not strictly a ringbuffer implementation. This would require
> > > a head and tail pointer where fbcon will adjust the tail and the
> > > driver/hardware will adjust the head. This will be very difficult to
> > > implement in a device independent manner. So we just cheat by issuing
> > > an fb_sync() per loop to flush all pending commands.
> >
> > That still seems suboptimal though. What the DRM often does is have the
> > chip write an age value to a scratch register when it's done processing
> > something. Maybe something like that could be used to avoid waiting for
> > the chip to go idle at all?
>
> Don't waste your time. I'm removing all the changes that have been done
> since 2.5.51. After that I will no longer be co-maintainter of the
> framebuffer layer.
Are you sure about that?!?!?
What about adding a pointer to a
struct con_ops {
set_font();
putc();
putcs();
}
to struct fb_info as an interim solution, until tile blitting has matured? That
would make Petr (matroxfb) and Davem (sbusfb) happy?
Gr{oetje,eeting}s,
Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@li...
In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds
|
|
From: Antonino D. <ad...@po...> - 2003-03-03 21:31:14
|
On Sun, 2003-03-02 at 20:14, Geert Uytterhoeven wrote: > On Thu, 27 Feb 2003, James Simmons wrote: > > > On Don, 2003-02-27 at 15:15, Antonino Daplas wrote: > > > > On Thu, 2003-02-27 at 09:18, James Simmons wrote: > > > > > > > > > > Thus, the restriction that the buffer must be completely copied by the > > > > > > driver before returning. And because of this restriction, an extra copy > > > > > > which might be unnecessary cannot be avoided (this was noted by Petr). > > > > > > > > > > > > Treating the buffer as a ringbuffer, we eliminate these restrictions. > > > > > > > > > > I didn't realize that the below was a ringbuffer implementation. The name > > > > > threw me off. > > > > > > > > Well, it's not strictly a ringbuffer implementation. This would require > > > > a head and tail pointer where fbcon will adjust the tail and the > > > > driver/hardware will adjust the head. This will be very difficult to > > > > implement in a device independent manner. So we just cheat by issuing > > > > an fb_sync() per loop to flush all pending commands. > > > > > > That still seems suboptimal though. What the DRM often does is have the > > > chip write an age value to a scratch register when it's done processing > > > something. Maybe something like that could be used to avoid waiting for > > > the chip to go idle at all? > > > > Don't waste your time. I'm removing all the changes that have been done > > since 2.5.51. After that I will no longer be co-maintainter of the > > framebuffer layer. > > Are you sure about that?!?!? > I agree. Actually, I'm not worried about Tile blitting. It should be mature since it's basically a rehash of the 2.4 API. If I'm going to worry about something, it's the standard accel_putcs() code, especially if thread-safety is an issue. Is thread-safety a real problem? That was also my concern before, which is why I looked at vt.c. From what I can see, there's always an "acquire_console_sem()" before calling any of the console methods. The console semaphore is supposed to guarantee serialization and exclusive access to the console (see linux/kernel/printk.c). Tony |
|
From: Antonino D. <ad...@po...> - 2003-02-27 21:46:13
|
On Fri, 2003-02-28 at 02:25, Michel D=E4nzer wrote: > On Don, 2003-02-27 at 15:15, Antonino Daplas wrote:=20 > > On Thu, 2003-02-27 at 09:18, James Simmons wrote: > > =20 > > > > Thus, the restriction that the buffer must be completely copied by = the > > > > driver before returning. And because of this restriction, an extra= copy > > > > which might be unnecessary cannot be avoided (this was noted by Pet= r). > > > >=20 > > > > Treating the buffer as a ringbuffer, we eliminate these restriction= s. > > >=20 > > > I didn't realize that the below was a ringbuffer implementation. The = name > > > threw me off.=20 > >=20 > > Well, it's not strictly a ringbuffer implementation. This would requir= e > > a head and tail pointer where fbcon will adjust the tail and the > > driver/hardware will adjust the head. This will be very difficult to > > implement in a device independent manner. So we just cheat by issuing > > an fb_sync() per loop to flush all pending commands. >=20 > That still seems suboptimal though. What the DRM often does is have the > chip write an age value to a scratch register when it's done processing > something. Maybe something like that could be used to avoid waiting for > the chip to go idle at all? >=20 Yes, it is certainly suboptimal. But the fb_sync() will be done only when the pointer wraps to the beginning of the buffer and only if the buffer can be DMA'd by the chip. To support these types of buffers, we can change pixmap.offset to head and tail. The buffer is empty when head =3D=3D tail. Then, fbcon continually adjusts the tail to tail+size, and when the chip is done processing that particular bitmap, it will adjust the head to head+size. Then instead of doing an fb_sync(), fbcon will just busy loop waiting for the buffer to have enough space for the next set of bitmaps. This will involve per chipset support, the reason why I did not do it like this. It can be added, but it will not be default behavior. PS: I remember that practically all drivers do some form of syncing before each operation, and no one complained :-). Tony =20 =20 |