Re: GEM discussion questions

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Keith Packard wrote:
| On Mon, 2008-05-19 at 00:14 -0700, Ian Romanick wrote:
|
|> - - I'm pretty sure that the read_domain = GPU, write_domain = CPU case
|> needs to be handled.  I know of at least one piece of hardware with a
|> kooky command buffer that wants to be used that way.
|
| Oh, so mapping the same command buffer for both activities.
|
| For Intel, we use batch buffers written with the CPU and queued to the
| GPU by the kernel, using suitable flushing to get data written to memory
| before the GPU is asked to read it.
|
| It could be that this 'command domain' just needs to be separate, and
| mapped coherent between GPU and CPU so that this works.
|
| However, instead of messing with the API on some theoretical hardware,
| I'm really only interested in seeing how the API fits to actual
| hardware. Having someone look at how a gem-like API would work on Radeon
| or nVidia hardware would go a long ways to exploring what pieces are
| general purpose and which are UMA- (or even Intel-) specific.

Sorry for being subtle.  It isn't theoretical hardware.  It's XP10.  It
uses a weird linked-list mechanism for commands.  Each command has a
header that contains a pointer to the next command and a flag.  The flag
says whether the command is valid or an end-of-list sentinel.  The
driver can then keep linking in new commands and changing sentinels to
commands while the hardware is going.

I'd have to go back and look, but I think MGA would work well with a
similar domain setting.

|> - - I suspect that in the (near) future we may want multiple
read_domains.
|
| That's why the argument is called 'read_domains' and not 'read_domain'.
| We already have operations that read objects to both the sampler and
| render caches.

Ah, so it is.  It wasn't clear in the document that the domain values
were bits.  It looks more like they're enums.

|> - - I think drm_i915_gem_relocation_entry should have a "size" field.
|> There are a lot of cases in the current GL API (and more to come) where
|> the entire object will trivially not be used.  Clamped LOD on textures
|> is a trivial example, but others exist as well.

The specific situation I was thinking of above is where a 2048x2048
mipmapped texture has been evicted from texturable memory.  The LOD
range of the card is clamped so that only the 512x512 mipmap will be
used (imagine doing render-to-texture to generate the 256x256 mipmap
from the 512x512).  Having both an offset and a size allows the memory
manager to only bring back in the required subset of the texture.

| There are a couple of places where this might be useful (presumably both
| offset and length); the 'set_domain' operation seems like one of them,
| and if we place it there, then other places where domain information is
| passed across the API might be good places to include that as well.
|
| The most obvious benefit here is reducing clflush action as we flip
| buffers from GPU to CPU for fallbacks; however, flipping objects back
| and forth should be avoided anyway, eliminating this kind of fallback
| would provide more performance benefit than making the fallback a bit
| faster.

There is also a bunch of up-and-coming GL functionality that you may not
be aware of that changes this picture a *LOT*.

- - GL_EXT_texture_buffer_object allows a portion of a buffer object to be
used to back a texture.

- - GL_EXT_bindable_uniform allows a portion of a buffer object to be used
to back a block of uniforms.

- - GL_EXT_transform_feedback allows the output of the vertex shader or
geometry shader to be stored to buffer objects.

- - Long's Peak has functionality where a buffer object can be mapped
*without* waiting for all previous GL commands to complete.
GL_APPLE_flush_buffer_range has similar functionality.

- - Long's Peak has NV_fence-like synchronization objects.

The usage scenario that ISVs (and that other vendors are going to make
fast) is one where transform feedback is used to "render" a bunch of
objects to a single buffer object.  There is a fair amount of overhead
in changing all the output buffer object bindings, so ISVs don't want to
be forced to take that performance hit.  If a fence is set after each
object is sent down the pipe, the app can wait for one object to finish
rendering, map the buffer object, and operate on the data.

Similar situations can occur even without transform feedback.  Imagine
an app that is streaming data into a buffer object.  It streams in one
object (via MapBuffer), does a draw command, sets a fence, streams in
the next, etc.  When the buffer is full, it waits on the first fence,
and starts back at the beginning.  Apparently, a lot of ISVs are wanting
to do this.  I'm not a big fan of this usage.  It seems that nobody ever
got fire-and-forget buffer objects (repeated cycle of allocate, fill,
use, delete) to be fast, so this is what ISVs are wanting instead.

In other news, app developers *hate* BufferSubData.  They much prefer to
just map the buffer and write to it or read from it.

All of this points to mapping buffers to the CPU in, on, and around GPU
usage being a very common operation.  It's also an operation that app
developers expect to be fast.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIMUa1X1gOwKyEAw8RAqSXAKCeGmkIRKUKhyPtTvY3upfIzYIe0wCfd+pm
UzbYXLplHki3A5xfvRGcMcY=
=wNJy
-----END PGP SIGNATURE-----