Thread: TTM merging?

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Dave,

Could you list what fixes / changes you think are needed to get TTM into 
the mainline kernel?

/Thomas

> Dave,
> 
> Could you list what fixes / changes you think are needed to get TTM into 
> the mainline kernel?
> 

2 main reasons:

1) I feel there hasn't been enough open driver coverage to prove it. So 
far we have done an Intel IGD, we have a lot of code that isn't required 
for these devices, so the question of how much code exists purely to 
support poulsbo closed source userspace there is and why we need to live 
with it. Both radeon and nouveau developers have expressed frustration 
about the fencing internals being really hard to work with which doesn't 
bode well for maintainability in the future.

2) Intel have asked that we don't push i915 support upstream as they 
believe it isn't ready and as they end up supporting the kernel module in 
the longer term I cannot go against that without a good reason. I have no 
other driver to push hence stalled. I'll leave keithp to comment on this 
further.

Dave.

On Tue, 2008-05-13 at 21:35 +0100, Dave Airlie wrote:

> 2) Intel have asked that we don't push i915 support upstream as they 
> believe it isn't ready and as they end up supporting the kernel module in 
> the longer term I cannot go against that without a good reason. I have no 
> other driver to push hence stalled. I'll leave keithp to comment on this 
> further.

We've spent the last couple of weeks writing a different manager for the
kernel, called 'gem' (for 'graphics execution manager'). It takes the
lessons we've learned from TTM and constructs just the API we need to
implement the dri_bufmgr interface.

On 915, performance for openarena is 50% faster than classic (15.4fps to
23.6fps for a demo Eric recorded), and our favorite benchmark, glxgears,
runs 60% faster (551fps to 889fps) The glxgears number is
semi-interesting because I think it shows the bandwidth available
between CPU and GPU for command execution.

Performance for 965 is similar to classic mode, although we're working
mostly on gen3 hardware as that's a lot easier to use, so we haven't
started taking advantage of the new gem-specific APIs.

This code is not complete yet, the biggest missing feature is proper
latency-throttling where we'd like to keep the ring nearly empty and
pend new requests while the ring executes older ones. We should have
that finished up this week, at which point I think the code will be
functionally complete.

Here's the 'drm-gem.txt' document from the drm-gem branch of my drm
repository ( git://people.freedesktop.org/~keithp/drm ). There are
parallel drm-gem branches in my mesa and xf86-video-intel repositories.

Key features:

      * Memory is allocated using shmfs; objects not pinned to the GTT
        are pageable.
      * Cache synchronization is handled automatically by the kernel,
        for GPU->GPU object transfers, no ring stall is required.
      * Objects can be written (using pwrite) from user space. This
        eliminates most cache effects from clflush as pwrite uses
        non-temporal stores.
      * There are no fences exposed for the Intel driver.

This document reflects the current status of the implementation.

-----

                  The Graphics Execution Manager
	       Part of the Direct Rendering Manager
                  ==============================

		 Keith Packard <ke...@ke...>
		   Eric Anholt <er...@an...>
			     2008-5-9

Contents:

 1. GEM Overview
 2. API overview and conventions
 3. Object Creation/Destruction
 4. Reading/writing contents
 5. Mapping objects to userspace
 6. Memory Domains
 7. Execution (Intel specific)
 8. Other misc Intel-specific functions

1. Graphics Execution Manager Overview

Gem is designed to manage graphics memory, control access to the graphics
device execution context and handle the essentially NUMA environment unique
to modern graphics hardware. Gem allows multiple applications to share
graphics device resources without the need to constantly reload the entire
graphics card. Data may be shared between multiple applications with gem
ensuring that the correct memory synchronization occurs.

Graphics data can consume arbitrary amounts of memory, with 3D applications
constructing ever larger sets of textures and vertices. With graphics cards
memory space growing larger every year, and graphics APIs growing more
complex, we can no longer insist that each application save a complete copy
of their graphics state so that the card can be re-initialized from user
space at each context switch. Ensuring that graphics data remains persistent
across context switches allows applications significant new functionality
while also improving performance for existing APIs.

Modern linux desktops include significant 3D rendering as a fundemental
component of the desktop image construction process. 2D and 3D applications
paint their content to offscreen storage and the central 'compositing
manager' constructs the final screen image from those window contents.  This
means that pixel image data from these applications must move within reach
of the compositing manager and used as source operands for screen image
rendering operations.

Gem provides simple mechanisms to manage graphics data and control execution
flow within the linux operating system. Using many existing kernel
subsystems, it does this with a modest amount of code.

2. API Overview and Conventions

All APIs here are defined in terms of ioctls appplied to the DRM file
descriptor. To create and manipulate objects, an application must be
'authorized' using the DRI or DRI2 protocols with the X server. To relax
that, we will need to implement some better access control mechanisms within
the hardware portion of the driver to prevent inappropriate
cross-application data access.

Any DRM driver which does not support GEM will return -ENODEV for all of
these ioctls. Invalid object handles return -EINVAL. Invalid object names
return -ENOENT. Other errors are as documented in the specific API below.

To avoid the need to translate ioctl contents on mixed-size systems (with
32-bit user space running on a 64-bit kernel), the ioctl data structures
contain explicitly sized objects, using 64-bits for all size and pointer
data and 32-bits for identifiers. In addition, the 64-bit objects are all
carefully aligned on 64-bit boundaries. Because of this, all pointers in the
ioctl data structures are passed as uint64_t values. Suitable casts will
be necessary.

One significant operation which is explicitly left out of this API is object
locking. Applications are expected to perform locking of shared objects
outside of the GEM api. This kind of locking is not necessary to safely
manipulate the graphics engine, and with multiple objects interacting in
unknown ways, per-object locking would likely introduce all kinds of
lock-order issues. Punting this to the application seems like the only
sensible plan. Given that DRM already offers a global lock on the hardware,
this doesn't change the current situation.

3. Object Creation and Destruction

Gem provides explicit memory management primitives. System pages are
allocated when the object is created, either as the fundemental storage for
hardware where system memory is used by the graphics processor directly, or
as backing store for graphics-processor resident memory.

Objects are referenced from user space using handles. These are, for all
intents and purposes, equivalent to file descriptors. We could simply use
file descriptors were it not for the small limit (1024) of file descriptors
available to applications, and for the fact that the X server (a rather
significant user of this API) uses 'select' and has a limited maximum file
descriptor for that operation. Given the ability to allocate more file
descriptors, and given the ability to place these 'higher' in the file
descriptor space, we'd love to simply use file descriptors.

Objects may be published with a name so that other applications can access
them. The name remains valid as long as the object exists. Right now, our
DRI APIs use 32-bit integer names, so that's what we expose here

 A. Creation

		struct drm_gem_create {
			/**
			 * Requested size for the object.
			 *
			 * The (page-aligned) allocated size for the object
			 * will be returned.
			 */
			uint64_t size;
			/**
			 * Returned handle for the object.
			 *
			 * Object handles are nonzero.
			 */
			uint32_t handle;
			uint32_t pad;
		};

		/* usage */
    		create.size = 16384;
		ret = ioctl (fd, DRM_IOCTL_GEM_CREATE, &create);
		if (ret == 0)
			return create.handle;

	Note that the size is rounded up to a page boundary, and that
	the rounded-up size is returned in 'size'. No name is assigned to
	this object, making it local to this process.

	If insufficient memory is availabe, -ENOMEM will be returned.

 B. Closing

		struct drm_gem_close {
			/** Handle of the object to be closed. */
			uint32_t handle;
			uint32_t pad;
		};

		/* usage */
		close.handle = <handle>;
		ret = ioctl (fd, DRM_IOCTL_GEM_CLOSE, &close);

	This call makes the specified handle invalid, and if no other
	applications are using the object, any necessary graphics hardware
	synchronization is performed and the resources used by the object
	released.

 C. Naming

		struct drm_gem_flink {
			/** Handle for the object being named */
			uint32_t handle;

			/** Returned global name */
			uint32_t name;
		};

		/* usage */
		flink.handle = <handle>;
		ret = ioctl (fd, DRM_IOCTL_GEM_FLINK, &flink);
		if (ret == 0)
			return flink.name;

	Flink creates a name for the object and returns it to the
	application. This name can be used by other applications to gain
	access to the same object.

 D. Opening by name

		struct drm_gem_open {
			/** Name of object being opened */
			uint32_t name;

			/** Returned handle for the object */
			uint32_t handle;

			/** Returned size of the object */
			uint64_t size;
		};

		/* usage */
		open.name = <name>;
		ret = ioctl (fd, DRM_IOCTL_GEM_OPEN, &open);
		if (ret == 0) {
			*sizep = open.size;
			return open.handle;
		}

	Open accesses an existing object and returns a handle for it. If the
	object doesn't exist, -ENOENT is returned. The size of the object is
	also returned. This handle has all the same capabilities as the
	handle used to create the object. In particular, the object is not
	destroyed until all handles are closed.

4. Basic read/write operations

By default, gem objects are not mapped to the applications address space,
getting data in and out of them is done with I/O operations instead. This
allows the data to reside in otherwise unmapped pages, including pages in
video memory on an attached discrete graphics card. In addition, using
explicit I/O operations allows better control over cache contents, as
graphics devices are generally not cache coherent with the CPU, mapping
pages used for graphics into an application address space requires the use
of expensive cache flushing operations. Providing direct control over
graphics data access ensures that data are handled in the most efficient
possible fashion.

 A. Reading

		struct drm_gem_pread {
			/** Handle for the object being read. */
			uint32_t handle;
			uint32_t pad;
			/** Offset into the object to read from */
			uint64_t offset;
			/** Length of data to read */
			uint64_t size;
			/** Pointer to write the data into. */
			uint64_t data_ptr;	/* void * */
		};

	This copies data into the specified object at the specified
	position. Any necessary graphics device synchronization and
	flushing will be done automatically.

		struct drm_gem_pwrite {
			/** Handle for the object being written to. */
			uint32_t handle;
			uint32_t pad;
			/** Offset into the object to write to */
			uint64_t offset;
			/** Length of data to write */
			uint64_t size;
			/** Pointer to read the data from. */
			uint64_t data_ptr;	/* void * */
		};

	This copies data out of the specified object into the
	waiting user memory. Again, device synchronization will
	be handled by the kernel to ensure user space sees a
	consistent view of the graphics device.

5. Mapping objects to user space

For most objects, reading/writing is the preferred interaction mode.
However, when the CPU is involved in rendering to cover deficiencies in
hardware support for particular operations, the CPU will want to directly
access the relevant objects. 

Because mmap is fairly heavyweight, we allow applications to retain maps to
objects persistently and then update how they're using the memory through a
separate interface. Applications which fail to use this separate interface
may exhibit unpredictable behaviour as memory consistency will not be
preserved.

 A. Mapping

		struct drm_gem_mmap {
			/** Handle for the object being mapped. */
			uint32_t handle;
			uint32_t pad;
			/** Offset in the object to map. */
			uint64_t offset;
			/**
			 * Length of data to map.
			 *
			 * The value will be page-aligned.
			 */
			uint64_t size;
			/** Returned pointer the data was mapped at */
			uint64_t addr_ptr;	/* void * */
		};

		/* usage */
		mmap.handle = <handle>;
		mmap.offset = <offset>;
		mmap.size = <size>;
		ret = ioctl (fd, DRM_IOCTL_GEM_MMAP, &mmap);
		if (ret == 0)
			return (void *) (uintptr_t) mmap.addr_ptr;

 B. Unmapping

		munmap (addr, length);

	Nothing strange here, just use the normal munmap syscall.

6. Memory Domains

Graphics devices remain a strong bastion of non cache-coherent memory. As a
result, accessing data through one functional unit will end up loading that
cache with data which then needs to be manually synchronized when that data
is used with another functional unit.

Tracking where data are resident is done by identifying how functional units
deal with caches. Each cache is labeled as a separate memory domain. Then,
each sequence of operations is expected to load data into various read
domains and leave data in at most one write domain. Gem tracks the read and
write memory domains of each object and performs the necessary
synchronization operations when objects move from one domain set to another.

For example, if operation 'A' constructs an image that is immediately used
by operation 'B', then when the read domain for 'B' is not the same as the
write domain for 'A', then the write domain must be flushed, and the read
domain invalidated. If these two operations are both executed in the same
command queue, then the flush operation can go inbetween them in the same
queue, avoiding any kind of CPU-based synchronization and leaving the GPU to
do the work itself.

6.1 Memory Domains (GPU-independent)

 * DRM_GEM_DOMAIN_CPU.

 Objects in this domain are using caches which are connected to the CPU.
 Moving objects from non-CPU domains into the CPU domain can involve waiting
 for the GPU to finish with operations using this object. Moving objects
 from this domain to a GPU domain can involve flushing CPU caches and chipset
 buffers.

6.1 GPU-independent memory domain ioctl

This ioctl is independent of the GPU in use. So far, no use other than
synchronizing objects to the CPU domain have been found; if that turns out
to be generally true, this ioctl may be simplified further.

 A. Explicit domain control

		struct drm_gem_set_domain {
			/** Handle for the object */
			uint32_t handle;

			/** New read domains */
			uint32_t read_domains;

			/** New write domain */
			uint32_t write_domain;
		};

		/* usage */
		set_domain.handle = <handle>;
		set_domain.read_domains = <read_domains>;
		set_domain.write_domain = <write_domain>;
		ret = ioctl (fd, DRM_IOCTL_GEM_SET_DOMAIN, &set_domain);

	When the application wants to explicitly manage memory domains for
	an object, it can use this function. Usually, this is only used
	when the application wants to synchronize object contents between
	the GPU and CPU-based application rendering. In that case,
	the <read_domains> would be set to DRM_GEM_DOMAIN_CPU, and if the
	application were going to write to the object, the <write_domain>
	would also be set to DRM_GEM_DOMAIN_CPU. After the call, gem
	guarantees that all previous rendering operations involving this
	object are complete. The application is then free to access the
	object through the address returned by the mmap call. Afterwards,
	when the application again uses the object through the GPU, any
	necessary CPU flushing will occur and the object will be correctly
	synchronized with the GPU.

	Note that this synchronization is not required for any accesses
	going through the driver itself. The pread, pwrite and execbuffer
	ioctls all perform the necessary domain management internally.
	Explicit synchronization is only necessary when accessing the object
	through the mmap'd address.

7. Execution (Intel specific)

Managing the command buffers is inherently chip-specific, so the core of gem
doesn't have any intrinsic functions. Rather, execution is left to the
device-specific portions of the driver.

The Intel DRM_I915_GEM_EXECBUFFER ioctl takes a list of gem objects, all of
which are mapped to the graphics device. The last object in the list is the
command buffer.

7.1. Relocations

Command buffers often refer to other objects, and to allow the kernel driver
to move objects around, a sequence of relocations is associated with each
object. Device-specific relocation operations are used to place the
target-object relative value into the object.

The Intel driver has a single relocation type:

		struct drm_i915_gem_relocation_entry {
			/**
			 * Handle of the buffer being pointed to by this
			 * relocation entry.
			 *
			 * It's appealing to make this be an index into the
			 * mm_validate_entry list to refer to the buffer,
			 * but this allows the driver to create a relocation
			 * list for state buffers and not re-write it per
			 * exec using the buffer.
			 */
			uint32_t target_handle;

			/**
			 * Value to be added to the offset of the target
			 * buffer to make up the relocation entry.
			 */
			uint32_t delta;

			/**
			 * Offset in the buffer the relocation entry will be
			 * written into
			 */
			uint64_t offset;

			/**
			 * Offset value of the target buffer that the
			 * relocation entry was last written as.
			 *
			 * If the buffer has the same offset as last time, we
			 * can skip syncing and writing the relocation.  This
			 * value is written back out by the execbuffer ioctl
			 * when the relocation is written.
			 */
			uint64_t presumed_offset;

			/**
			 * Target memory domains read by this operation.
			 */
			uint32_t read_domains;

			/*
			 * Target memory domains written by this operation.
			 *
			 * Note that only one domain may be written by the
			 * whole execbuffer operation, so that where there are
			 * conflicts, the application will get -EINVAL back.
			 */
			uint32_t write_domain;
		};

	'target_handle', the handle to the target object. This object must
	be one of the objects listed in the execbuffer request or
	bad things will happen. The kernel doesn't check for this.

	'offset' is where, in the source object, the relocation data
	are written. Each relocation value is a 32-bit value consisting
	of the location of the target object in the GPU memory space plus
	the 'delta' value included in the relocation.

	'presumed_offset' is where user-space believes the target object
	lies in GPU memory space. If this value matches where the object
	actually is, then no relocation data are written, the kernel
	assumes that user space has set up data in the source object
	using this presumption. This offers a fairly important optimization
	as writing relocation data requires mapping of the source object
	into the kernel memory space.

	'read_domains' and 'write_domains' list the usage by the source
	object of the target object. The kernel unions all of the domain
	information from all relocations in the execbuffer request. No more
	than one write_domain is allowed, otherwise an EINVAL error is
	returned. read_domains must contain write_domain. This domain
	information is used to synchronize buffer contents as described
	above in the section on domains.

7.1.1 Memory Domains (Intel specific)

The Intel GPU has several internal caches which are not coherent and hence
require explicit synchronization. Memory domains provide the necessary data
to synchronize what is needed while leaving other cache contents intact.

 * DRM_GEM_DOMAIN_I915_RENDER.
   The GPU 3D and 2D rendering operations use a unified rendering cache, so
   operations doing 3D painting and 2D blts will use this domain

 * DRM_GEM_DOMAIN_I915_SAMPLER
   Textures are loaded by the sampler through a separate cache, so
   any texture reading will use this domain. Note that the sampler
   and renderer use different caches, so moving an object from render target
   to texture source will require a domain transfer.

 * DRM_GEM_DOMAIN_I915_COMMAND
   The command buffer doesn't have an explicit cache (although it does
   read ahead quite a bit), so this domain just indicates that the object
   needs to be flushed to the GPU.

 * DRM_GEM_DOMAIN_I915_INSTRUCTION
   All of the programs on Gen4 and later chips use an instruction cache to
   speed program execution. It must be explicitly flushed when new programs
   are written to memory by the CPU.

 * DRM_GEM_DOMAIN_I915_VERTEX
   Vertex data uses two different vertex caches, but they're
   both flushed with the same instruction.

7.2 Execution object list (Intel specific)

		struct drm_i915_gem_exec_object {
			/**
			 * User's handle for a buffer to be bound into the GTT
			 * for this operation.
			 */
			uint32_t handle;

			/**
			 * List of relocations to be performed on this buffer
			 */
			uint32_t relocation_count;
			/* struct drm_i915_gem_relocation_entry *relocs */
			uint64_t relocs_ptr;

			/** 
			 * Required alignment in graphics aperture 
			 */
			uint64_t alignment;

			/**
			 * Returned value of the updated offset of the object,
			 * for future presumed_offset writes.
			 */
			uint64_t offset;
		};

	Each object involved in a particular execution operation must be
	listed using one of these structures.

	'handle' references the object.

	'relocs_ptr' is a user-mode pointer to a array of 'relocation_count'
	drm_i915_gem_relocation_entry structs (see above) that
	define the relocations necessary in this buffer. Note that all
	relocations must reference other exec_object structures in the same
	execbuffer ioctl and that those other buffers must come earlier in
	the exec_object array. In other words, the dependencies mapped by the
	exec_object relocations must form a directed acyclic graph.

	'alignment' is the byte alignment necessary for this buffer. Each
	object has specific alignment requirements, as the kernel doesn't
	know what each object is being used for, those requirements must be
	provided by user mode. If an object is used in two different ways,
	it's quite possible that the alignment requirements will differ.

	'offset' is a return value, receiving the location of the object
	during this execbuffer operation. The application should use this
	as the presumed offset in future operations; if the object does not
	move, then kernel need not write relocation data.

7.3 Execbuffer ioctl (Intel specific)

		struct drm_i915_gem_execbuffer {
			/**
			 * List of buffers to be validated with their
			 * relocations to be performend on them.
			 *
			 * These buffers must be listed in an order such that
			 * all relocations a buffer is performing refer to
			 * buffers that have already appeared in the validate
			 * list.
			 */
			/* struct drm_i915_gem_validate_entry *buffers */
			uint64_t buffers_ptr;
			uint32_t buffer_count;

			/**
			 * Offset in the batchbuffer to start execution from.
			 */
			uint32_t batch_start_offset;

			/**
			 * Bytes used in batchbuffer from batch_start_offset
			 */
			uint32_t batch_len;
			uint32_t DR1;
			uint32_t DR4;
			uint32_t num_cliprects;
			uint64_t cliprects_ptr;	/* struct drm_clip_rect *cliprects */
		};

	'buffers_ptr' is a user-mode pointer to an array of 'buffer_count'
	drm_i915_gem_exec_object structures which contains the complete set
	of objects required for this execbuffer operation. The last entry in
	this array, the 'batch buffer', is the buffer of commands which will
	be linked to the ring and executed.

	'batch_start_offset' is the byte offset within the batch buffer which
	contains the first command to execute. So far, we haven't found a
	reason to use anything other than '0' here, but the thought was that
	some space might be allocated for additional initialization which
	could be skipped in some cases. This must be a multiple of 4.

	'batch_len' is the length, in bytes, of the data to be executed
	(i.e., the amount of data after batch_start_offset). This must
	be a multiple of 4.

	'num_cliprects' and 'cliprects_ptr' reference an array of
	drm_clip_rect structures that is num_cliprects long. The entire
	batch buffer will be executed multiple times, once for each
	rectangle in this list. If num_cliprects is 0, then no clipping
	rectangle will be set.

	'DR1' and 'DR4' are portions of the 3DSTATE_DRAWING_RECTANGLE
	command which will be queued when this operation is clipped
	(num_cliprects != 0).

		DR1 bit		definition
		31		Fast Scissor Clip Disable (debug only).
				Disables a hardware optimization that
				improves performance. This should have
				no visible effect, other than reducing
				performance

		30		Depth Buffer Coordinate Offset Disable.
				This disables the addition of the
				depth buffer offset bits which are used
				to change the location of the depth buffer
				relative to the front buffer.

		27:26		X Dither Offset. Specifies the X pixel
				offset to use when accessing the dither table

		25:24		Y Dither Offset. Specifies the Y pixel
				offset to use when accessing the dither
				table.

		DR4 bit		definition
		31:16		Drawing Rectangle Origin Y. Specifies the Y
				origin of coordinates relative to the
				draw buffer.

		15:0		Drawing Rectangle Origin X. Specifies the X
				origin of coordinates relative to the
				draw buffer.

	As you can see, these two fields are necessary for correctly
	offsetting drawing within a buffer which contains multiple surfaces.
	Note that DR1 is only used on Gen3 and earlier hardware and that
	newer hardware sticks the dither offset elsewhere.

7.3.1 Detailed Execution Description

	Execution of a single batch buffer requires several preparatory
	steps to make the objects visible to the graphics engine and resolve
	relocations to account for their current addresses.

 A. Mapping and Relocation

	Each exec_object structure in the array is examined in turn. 

	If the object is not already bound to the GTT, it is assigned a
	location in the graphics address space. If no space is available in
	the GTT, some other object will be evicted. This may require waiting
	for previous execbuffer requests to complete before that object can
	be unmapped. With the location assigned, the pages for the object
	are pinned in memory using find_or_create_page and the GTT entries
	updated to point at the relevant pages using drm_agp_bind_pages.

	Then the array of relocations is traversed. Each relocation record
	looks up the target object and, if the presumed offset does not
	match the current offset (remember that this buffer has already been
	assigned an address as it must have been mapped earlier), the
	relocation value is computed using the current offset.  If the
	object is currently in use by the graphics engine, writing the data
	out must be preceeded by a delay while the object is still busy.
	Once it is idle, then the page containing the relocation is mapped
	by the CPU and the updated relocation data written out.

	The read_domains and write_domain entries in each relocation are
	used to compute the new read_domains and write_domain values for the
	target buffers. The actual execution of the domain changes must wait
	until all of the exec_object entries have been evaluated as the
	complete set of domain information will not be available until then.

 B. Memory Domain Resolution

	After all of the new memory domain data has been pulled out of the
	relocations and computed for each object, the list of objects is
	again traversed and the new memory domains compared against the
	current memory domains. There are two basic operations involved here:

 	 * Flushing the current write domain. If the new read domains
	   are not equal to the current write domain, then the current
	   write domain must be flushed. Otherwise, reads will not see data
	   present in the write domain cache. In addition, any new read domains
	   other than the current write domain must be invalidated to ensure
	   that the flushed data are re-read into their caches.

	 * Invaliding new read domains. Any domains which were not currently
	   used for this object must be invalidated as old objects which
	   were mapped at the same location may have stale data in the new
	   domain caches.

	If the CPU cache is being invalidated and some GPU cache is being
	flushed, then we'll have to wait for rendering to complete so that
	any pending GPU writes will be complete before we flush the GPU
	cache.

	If the CPU cache is being flushed, then we use 'clflush' to get data
	written from the CPU.

	Because the GPU caches cannot be partially flushed or invalidated,
	we don't actually flush them during this traversal stage. Rather, we
	gather the invalidate and flush bits up in the device structure.

	Once all of the object domain changes have been evaluated, then the
	gathered invalidate and flush bits are examined. For any GPU flush
	operations, we emit a single MI_FLUSH command that performs all of
	the necessary flushes. We then look to see if the CPU cache was
	flushed. If so, we use the chipset flush magic (writing to a special
	page) to get the data out of the chipset and into memory.

 C. Queuing Batch Buffer to the Ring

	With all of the objects resident in graphics memory space, and all
	of the caches prepared with appropriate data, the batch buffer
	object can be queued to the ring. If there are clip rectangles, then
	the buffer is queued once per rectangle, with suitable clipping
	inserted into the ring just before the batch buffer.

 D. Creating an IRQ Cookie

	Right after the batch buffer is placed in the ring, a request to
	generate an IRQ is added to the ring along with a command to write a
	marker into memory. When the IRQ fires, the driver can look at the
	memory location to see where in the ring the GPU has passed. This
	magic cookie value is stored in each object used in this execbuffer
	command; it is used whereever you saw 'wait for rendering' above in
	this document.

 E. Writing back the new object offsets

	So that the application has a better idea what to use for
	'presumed_offset' values later, the current object offsets are
	written back to the exec_object structures.

8. Other misc Intel-specific functions.

To complete the driver, a few other functions were necessary.

8.1 Initialization from the X server

As the X server is currently responsible for apportioning memory between 2D
and 3D, it must tell the kernel which region of the GTT aperture is
available for 3D objects to be mapped into.

		struct drm_i915_gem_init {
			/**
			 * Beginning offset in the GTT to be managed by the
			 * DRM memory manager.
			 */
			uint64_t gtt_start;
			/**
			 * Ending offset in the GTT to be managed by the DRM
			 * memory manager.
			 */
			uint64_t gtt_end;
		};
		/* usage */
		init.gtt_start = <gtt_start>;
		init.gtt_end = <gtt_end>;
		ret = ioctl (fd, DRM_IOCTL_I915_GEM_INIT, &init);

	The GTT aperture between gtt_start and gtt_end will be used to map
	objects. This also tells the kernel that the ring can be used,
	pulling the ring addresses from the device registers.

8.2 Pinning objects in the GTT

For scan-out buffers and the current shared depth and back buffers, we need
to have them always available in the GTT, at least for now. Pinning means to
lock their pages in memory along with keeping them at a fixed offset in the
graphics aperture. These operations are available only to root.

		struct drm_i915_gem_pin {
			/** Handle of the buffer to be pinned. */
			uint32_t handle;
			uint32_t pad;

			/** alignment required within the aperture */
			uint64_t alignment;

			/** Returned GTT offset of the buffer. */
			uint64_t offset;
		};

		/* usage */
		pin.handle = <handle>;
		pin.alignment = <alignment>;
		ret = ioctl (fd, DRM_IOCTL_I915_GEM_PIN, &pin);
		if (ret == 0)
			return pin.offset;

	Pinning an object ensures that it will not be evicted from the GTT
	or moved. It will stay resident until destroyed or unpinned.

		struct drm_i915_gem_unpin {
			/** Handle of the buffer to be unpinned. */
			uint32_t handle;
			uint32_t pad;
		};

		/* usage */
		unpin.handle = <handle>;
		ret = ioctl (fd, DRM_IOCTL_I915_GEM_UNPIN, &unpin);

	Unpinning an object makes it possible to evict this object from the
	GTT. It doesn't ensure that it will be evicted, just that it may.

-- 
kei...@in...

On Tue, 13 May 2008 21:35:16 +0100 (IST)
Dave Airlie <ai...@li...> wrote:

> 1) I feel there hasn't been enough open driver coverage to prove it. So 
> far we have done an Intel IGD, we have a lot of code that isn't required 
> for these devices, so the question of how much code exists purely to 
> support poulsbo closed source userspace there is and why we need to live 
> with it. Both radeon and nouveau developers have expressed frustration 
> about the fencing internals being really hard to work with which doesn't 
> bode well for maintainability in the future.

Well my ttm experiment bring me up to EXA with radeon, i also done several
small 3d test to see how i want to send command. So from my experiments here
are the things that are becoming painfull for me.

On some radeon hw (most of newer card with big amount of ram) you can't
map vram beyond aperture, well you can be you need to reprogram card
aperture and it's not somethings you want to do. TTM assumption is that
memory access are done through map of the buffer and so in this situation
this become cumberstone. We already discussed this and the idea was to
split vram but i don't like this solution. So in the end i am more and
more convinced that we should avoid object mapping in vma of client i see
2 advantages to this : no tlb flush on vma, no hard to solve page maping
aliasing.

On fence side i hoped that i could have reasonable code using IRQ working
reliably but after discussion with AMD what i was doing was obviously not
recommanded and prone to hard GPU lockup which is no go for me. The last
solution i have in mind about synchronization ie knowing when gpu is done
with a buffer could not use IRQ at least not on all hw i am interesed in
(r3xx/r4xx). Of course i don't want to busy wait for knowing when GPU is
done. Also fence code put too much assumption on what we should provide,
while fencing might prove usefull, i think it can be more well served by
driver specific ioctl than by a common infrastructure where hw obviously
doesn't fit well in the scheme due to their differences.

And like Stephane, i think virtual memory from GPU stuff can't be used
at its best in this scheme.

That said, i share also some concern on GEM like the high memory page but
i think this one is workable with help of kernel people. For vram the
solution discussed so far and which i like is to have driver choose
based on client request on which object to put their and to see vram as
a cache. So we will have all object backed by a ram copy (which can be
swapped) then it's all a matter on syncing vram copy & ram copy when
necessary. Domain & pread/pwrite access let you easily do this sync
only on the necessary area. Also for suspend becomes easier just sync
object where write domain is GPU. So all in all i agree that GEM might
ask each driver to redo some stuff but i think a large set of helper
function can leverage this, but more importantly i see this as freedom
for each driver and the only way to cope with hw differences.

Cheers,
Jerome Glisse <gl...@fr...>

On Wed, 2008-05-14 at 16:34 -0700, Allen Akin wrote:
> On Wed, May 14, 2008 at 03:48:47PM -0700, Keith Packard wrote:
> | Object mapping is really the least important part of the system; it
> | should only be necessary when your GPU is deficient, or your API so
> | broken as to require this inefficient mechanism.
> 
> In the OpenGL case, object mapping wasn't originally a part of the API.
> It was added because people building hardware and apps for Intel-based
> PCs determined that it was worthwhile, and demanded it.

In a UMA environment, it seems so obvious to map objects into the
application and just bypass the whole kernel API issue. That, however,
ignores caching effects, which appear to dominate performance effects
these days.

> This wasn't on my watch, so I can't give you the history in detail, but
> my recollection is that the primary uses were texture loading for games
> and video apps, and incremental changes to vertex arrays for games and
> rendering apps.

Most of which can be efficiently performed with a pwrite-like system
where the application explicitly tells the system which portions of the
object to modify. Again, it seems insane when everything is a uniform
mass of pages, except for the subtle differences in cache behaviour.

> So maybe the hardware has changed sufficiently that the old reasoning
> and performance measurements are no longer valid.  It would still be
> good to know for sure that eliminating low-level support for the
> mechanism won't be drastically bad for the classes of apps that use it.

I'm not sure we can (or want to) eliminate it entirely, all that I
discovered was that it should be avoided as it has negative performance
consequences. Not dire, but certainly not positive either.

I don't know how old these measurements were, but certainly the gap
between CPU and memory speed has been rapidly increasing for years,
along with cache sizes, both of which have a fairly dramatic effect on
how best to access actual memory.

-- 
kei...@in...

On Wed, May 14, 2008 at 05:22:06PM -0700, Keith Packard wrote:
| On Wed, 2008-05-14 at 16:34 -0700, Allen Akin wrote:
| > In the OpenGL case, object mapping wasn't originally a part of the API.
| > It was added because people building hardware and apps for Intel-based
| > PCs determined that it was worthwhile, and demanded it.
| 
| In a UMA environment, it seems so obvious to map objects into the
| application and just bypass the whole kernel API issue. That, however,
| ignores caching effects, which appear to dominate performance effects
| these days.

I think the confusion arises because the mechanism is used for several
purposes, some of which are likely to be dominated by cache effects on
some implementations, and others that aren't.  I'm thinking about the
differences between piecemeal updating of the elements of a vertex
array, versus grabbing an image from a video capture card or a direct
read() from a file into a texture buffer.  The API is intended to allow
apps and drivers to make intelligent choices between cases like those.
Check out BufferData() and MapBuffer() in section 2.9 of the OpenGL 2.1
spec for a discussion which specifically mentions cache effects.

| > This wasn't on my watch, so I can't give you the history in detail, but
| > my recollection is that the primary uses were texture loading for games
| > and video apps, and incremental changes to vertex arrays for games and
| > rendering apps.
| 
| Most of which can be efficiently performed with a pwrite-like system
| where the application explicitly tells the system which portions of the
| object to modify. ...

Interfaces of that style are present in OpenGL, and predate the mapping
interfaces.  I know they were regarded as too slow for some apps, so the
mapping interfaces were added.  The early extensions were driven by
vendors who didn't support UMA, so that couldn't have been the only
model they were concerned about.  Beyond that I'm not sure.

| > So maybe the hardware has changed sufficiently that the old reasoning
| > and performance measurements are no longer valid.  It would still be
| > good to know for sure that eliminating low-level support for the
| > mechanism won't be drastically bad for the classes of apps that use it.
| 
| I'm not sure we can (or want to) eliminate it entirely, all that I
| discovered was that it should be avoided as it has negative performance
| consequences. Not dire, but certainly not positive either.
| 
| I don't know how old these measurements were, but certainly the gap
| between CPU and memory speed has been rapidly increasing for years,
| along with cache sizes, both of which have a fairly dramatic effect on
| how best to access actual memory.

The first reference I can find to an object-mapping API in OpenGL is
from 2001.  I'm sure the vendors had implementations internally before
then, but that's when things were mature enough to start standardizing.
Since the functionality is present in OpenGL 2.0 (vintage 2006?),
apparently someone thought it was still useful enough to carry over from
OpenGL 1.X.

Again, sorry I don't know the entire history on this one.

Allen

Dave Airlie wrote:
>> Dave,
>>
>> Could you list what fixes / changes you think are needed to get TTM into 
>> the mainline kernel?
>>
>>     
>
> 2 main reasons:
>
> 1) I feel there hasn't been enough open driver coverage to prove it. So 
> far we have done an Intel IGD, we have a lot of code that isn't required 
> for these devices, so the question of how much code exists purely to 
> support poulsbo closed source userspace there is and why we need to live 
> with it. Both radeon and nouveau developers have expressed frustration 
> about the fencing internals being really hard to work with which doesn't 
> bode well for maintainability in the future.
>   
OK. So basically what I'm asking is that when we have full-feathered 
open source drivers available that
utilize TTM, either as part of DRM core, or, if needed, as part of 
driver-specific code, do you see anything
else that prevents that from being pushed? That would be very valuable 
to know for anyone starting porting work. ?

/Thomas

> > 1) I feel there hasn't been enough open driver coverage to prove it. So far
> > we have done an Intel IGD, we have a lot of code that isn't required for
> > these devices, so the question of how much code exists purely to support
> > poulsbo closed source userspace there is and why we need to live with it.
> > Both radeon and nouveau developers have expressed frustration about the
> > fencing internals being really hard to work with which doesn't bode well for
> > maintainability in the future.
> >   
> OK. So basically what I'm asking is that when we have full-feathered open
> source drivers available that
> utilize TTM, either as part of DRM core, or, if needed, as part of
> driver-specific code, do you see anything
> else that prevents that from being pushed? That would be very valuable to know
> for anyone starting porting work. ?

I was hoping that by now, one of the radeon or nouveau drivers would have 
adopted TTM, or at least demoed something working using it, this hasn't 
happened which worries me, perhaps glisse or darktama could fill in on 
what limited them from doing it. The fencing internals are very very scary 
and seem to be a major stumbling block.

I do worry that TTM is not Linux enough, it seems you have decided that we 
can never do in-kernel allocations at any useable speed and punted the 
work into userspace, which makes life easier for Gallium as its more like 
what Windows does, but I'm not sure this is a good solution for Linux.

The real question is whether TTM suits the driver writers for use in Linux 
desktop and embedded environments, and I think so far I'm not seeing 
enough positive feedback from the desktop side.

Also wrt the i915 driver it has too many experiments in it, the i915 users 
need to group together and remove the codepaths that make no sense and 
come up with a ssuitable userspace driver for it, remove all unused 
fencing mechanisms etc..

Dave.

 > 
> /Thomas
> 
> 
> 
> 
> 
> 

On 5/14/08, Dave Airlie <ai...@li...> wrote:
>
> I was hoping that by now, one of the radeon or nouveau drivers would have
>  adopted TTM, or at least demoed something working using it, this hasn't
>  happened which worries me, perhaps glisse or darktama could fill in on
>  what limited them from doing it. The fencing internals are very very scary
>  and seem to be a major stumbling block.
>

Aside from the fencing code, I have some othern more general, concerns
with respect to using TTM on recent hardware. Although I've raised
them before, it was on IRC, not really on the list.

The main issue in my opinion, is that TTM enforces most things to be
done form the kernel, and how those things should be done: command
checking with relocations, fence emission, memory moves... Depending
on the hardware functionality available, this might be useless or even
counter-productive.

Also, I'm concerned about handling chips that can do page faults in
video memory. It is interesting to be able to use this feature (which
was asked for by the windows guys). For example we could have the
ability to have huge textures paged in progressively at the memory
manager level.

So to me the current TTM design lacks enough flexibility for recent
chip features. I'm not saying all of this has to be implemented now,
but it should not be prevented by the design. After all, if the memory
manager is here to stay, I'd say it needs to be future-proof.

Stephane

Dave Airlie wrote:
>>> 1) I feel there hasn't been enough open driver coverage to prove it. So far
>>> we have done an Intel IGD, we have a lot of code that isn't required for
>>> these devices, so the question of how much code exists purely to support
>>> poulsbo closed source userspace there is and why we need to live with it.
>>> Both radeon and nouveau developers have expressed frustration about the
>>> fencing internals being really hard to work with which doesn't bode well for
>>> maintainability in the future.
>>>   
>>>       
>> OK. So basically what I'm asking is that when we have full-feathered open
>> source drivers available that
>> utilize TTM, either as part of DRM core, or, if needed, as part of
>> driver-specific code, do you see anything
>> else that prevents that from being pushed? That would be very valuable to know
>> for anyone starting porting work. ?
>>     
>
> I was hoping that by now, one of the radeon or nouveau drivers would have 
> adopted TTM, or at least demoed something working using it, this hasn't 
> happened which worries me, perhaps glisse or darktama could fill in on 
> what limited them from doing it. The fencing internals are very very scary 
> and seem to be a major stumbling block.
>   
Yes, it would be good to get some details here. Exactly what parts are 
scary?
It seems Ian Romanick has made it work fine with xgi. 122 locs including 
license headers.
I915 fencing can be made equally short if all sample (flushing) code is 
removed.
> I do worry that TTM is not Linux enough, it seems you have decided that we 
> can never do in-kernel allocations at any useable speed and punted the 
> work into userspace, which makes life easier for Gallium as its more like 
> what Windows does, but I'm not sure this is a good solution for Linux.
>
>   
In-kernel allocations should be really fast unless they involve changing 
caching policy.
If they are not, it's not a design issue but an implementation one which 
should be fixable. Trying to make mmap(anonymous) lightning fast when 
there is malloc() doesn't really make sense to me.

> The real question is whether TTM suits the driver writers for use in Linux 
> desktop and embedded environments, and I think so far I'm not seeing 
> enough positive feedback from the desktop side.
>   
I actually haven't seen much feedback at all. At least not on the 
mailing lists.
Anyway we need to look at the alternatives which currently is GEM.

GEM, while still in development basically brings  us back to the 
functionality of TTM 0.1, with added paging support but without 
fine-grained locking and  caching policy support.

I might have misunderstood things but quickly browsing the code raises 
some obvious questions:

1) Some AGP chipsets don't support page addresses > 32bits. GEM objects 
use GFP_HIGHUSER, and it's hardcoded into the linux swap code.
2) How will user-space mapping of IO memory (AGP apertures) work? 
Eviction and associated killing / refaulting of IO memory mappings?
3) How do we avoid illegal physical page aliasing with non-Intel 
hardware? And how are we going to get the kernel purists to accept it 
when they already complain about WC - UC aliasing?
4) How is VRAM incoporated in the GEM design? How do we map it and keep 
the mapping during eviction?
5) What's protecting i915 GEM object privates and lists in a 
multi-threaded environment?
6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing 
some severe ranting about it on the lkml?

TTM is designed to cope with most hardware quirks I've come across with 
different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and 
some other ones. GEM basically leaves it up to the driver writer to 
reinvent the wheel..
> Also wrt the i915 driver it has too many experiments in it, the i915 users 
> need to group together and remove the codepaths that make no sense and 
> come up with a ssuitable userspace driver for it, remove all unused 
> fencing mechanisms etc..
>   
Agreed, but back to the real and to me very important question:
If I embark on a new OS driver today and want to use advanced memory 
manager stuff. Have VRAM and multiple advanced syncing mechanisms. 
What's my best option to get it into the kernel?  Can I hook up driver 
specific TTM and get it in?

/Thomas
> Dave.
>   
>  > 
>   
>> /Thomas
>>
>>
>>
>>
>>
>>
>>     

Jerome Glisse wrote:
> On Tue, 13 May 2008 21:35:16 +0100 (IST)
> Dave Airlie <ai...@li...> wrote:
>
>   
>> 1) I feel there hasn't been enough open driver coverage to prove it. So 
>> far we have done an Intel IGD, we have a lot of code that isn't required 
>> for these devices, so the question of how much code exists purely to 
>> support poulsbo closed source userspace there is and why we need to live 
>> with it. Both radeon and nouveau developers have expressed frustration 
>> about the fencing internals being really hard to work with which doesn't 
>> bode well for maintainability in the future.
>>     
>
> Well my ttm experiment bring me up to EXA with radeon, i also done several
> small 3d test to see how i want to send command. So from my experiments here
> are the things that are becoming painfull for me.
>
> On some radeon hw (most of newer card with big amount of ram) you can't
> map vram beyond aperture, well you can be you need to reprogram card
> aperture and it's not somethings you want to do. TTM assumption is that
> memory access are done through map of the buffer and so in this situation
> this become cumberstone. We already discussed this and the idea was to
> split vram but i don't like this solution. So in the end i am more and
> more convinced that we should avoid object mapping in vma of client i see
> 2 advantages to this : no tlb flush on vma, no hard to solve page maping
> aliasing.
>
> On fence side i hoped that i could have reasonable code using IRQ working
> reliably but after discussion with AMD what i was doing was obviously not
> recommanded and prone to hard GPU lockup which is no go for me. The last
> solution i have in mind about synchronization ie knowing when gpu is done
> with a buffer could not use IRQ at least not on all hw i am interesed in
> (r3xx/r4xx). Of course i don't want to busy wait for knowing when GPU is
> done. Also fence code put too much assumption on what we should provide,
> while fencing might prove usefull, i think it can be more well served by
> driver specific ioctl than by a common infrastructure where hw obviously
> doesn't fit well in the scheme due to their differences.
>
> And like Stephane, i think virtual memory from GPU stuff can't be used
> at its best in this scheme.
>
> That said, i share also some concern on GEM like the high memory page but
> i think this one is workable with help of kernel people. For vram the
> solution discussed so far and which i like is to have driver choose
> based on client request on which object to put their and to see vram as
> a cache. So we will have all object backed by a ram copy (which can be
> swapped) then it's all a matter on syncing vram copy & ram copy when
> necessary. Domain & pread/pwrite access let you easily do this sync
> only on the necessary area. Also for suspend becomes easier just sync
> object where write domain is GPU. So all in all i agree that GEM might
> ask each driver to redo some stuff but i think a large set of helper
> function can leverage this, but more importantly i see this as freedom
> for each driver and the only way to cope with hw differences.
>
> Cheers,
> Jerome Glisse <gl...@fr...>
>   
Jerome, Dave, Keith

It's hard to argue against people trying things out and finding it's not 
really what they want, so I'm not going to do that.

The biggest argument (apart from the fencing) seems to be that people 
thinks TTM stops them from doing what they want with the hardware, 
although it seems like the Nouveau needs and Intel UMA needs are quite 
opposite. In an open-source community where people work on things 
because they want to, not being able to do what you want to is a bad thing,

OTOH a stall and disagreement about what's the best thing  to  use is 
even worse.  It confuses the users and it's particularly bad for  people 
trying to write drivers on a commercial basis.

I've looked through KeithPs mail to look for a way to use GEM for future 
development. Since many things will be device-dependent I think it's 
possible for us to work around some issues I see,  but  a couple of big 
things remain.

1) The inability to map device memory. The design arguments and proposed 
solution for VRAM are not really valid. Think of this, probably not too 
uncommon, scenario of a single pixel fallback composite to a scanout 
buffer in vram. Or a texture or video frame upload:

A) Page in all GEM pages, because they've been paged out.
B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
C) Write the pixel.
D) Copy the complete buffer back while tiling.

2) Reserving pages when allocating VRAM buffers is also a very bad 
solution particularly on systems with a lot of VRAM and little system 
RAM. (Multiple card machines?). GEM basically needs to reserve 
swap-space when buffers are created, and put a limit on the pinned 
physical pages.  We basically should not be able to fail memory 
allocation during execbuf, because we cannot recover from that.

Other things like GFP_HIGHUSER etc are probably fixable if there is a 
will to do it.

So if GEM is the future, these shortcomings must IMHO be addressed. In 
particular GEM should not stop people from mapping device memory 
directly. Particularly not in the view of the arguments against TTM 
previously outlined.

This means that the dependency on SHMEMFS propably needs to be dropped 
and replaced with some sort of DRMFS that allows overloading of mmap and 
a correct swap handling, address the caching issue and also avoids the 
driver do_mmap(). 

If we're taking another round at this, There's a need to get it more 
right than the old solution.

/Thomas

On Wed, 14 May 2008 12:09:06 +0200
Thomas Hellström <th...@tu...> wrote:

> Jerome Glisse wrote:
> Jerome, Dave, Keith
> 
> 
> 1) The inability to map device memory. The design arguments and proposed 
> solution for VRAM are not really valid. Think of this, probably not too 
> uncommon, scenario of a single pixel fallback composite to a scanout 
> buffer in vram. Or a texture or video frame upload:
> 
> A) Page in all GEM pages, because they've been paged out.
> B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
> C) Write the pixel.
> D) Copy the complete buffer back while tiling.

With pwrite/pread you give offset and size of things you are interested in.
So for single pixel case it will pread a page and pwrite it once fallback
finished. I totaly agree that dowloading whole object on fallback is to be
avoided. But as long as we don't have a fallback which draws the whole
screen then we are fine, and as anyway such fallback will be desastrous
wether we map vram or not lead me to discard this drawback and just
accept pain for such fallback.

Also i am confident that we can find a more clever way in such case.
Like doing the whole rendering in ram and updating the final result
so assuming that the up to date copy is in ram and that vram might
be out of sync.

> 2) Reserving pages when allocating VRAM buffers is also a very bad 
> solution particularly on systems with a lot of VRAM and little system 
> RAM. (Multiple card machines?). GEM basically needs to reserve 
> swap-space when buffers are created, and put a limit on the pinned 
> physical pages.  We basically should not be able to fail memory 
> allocation during execbuf, because we cannot recover from that.

Well this solve the suspend problem we were discussing at xds ie what
to do on buffer. If we know that we have room to put buffer then we
don't to worry about which buffer we are ready to loose. Given that
opengl don't give any clue on that this sounds like a good approach.

For embedded device where every piece of ram still matter i guess
you also have to deal with suspend case so you have a way to either
save vram content or to preserve it. I don't see any problem with
gem to cop with this case too.

> Other things like GFP_HIGHUSER etc are probably fixable if there is a 
> will to do it.
> 
> So if GEM is the future, these shortcomings must IMHO be addressed. In 
> particular GEM should not stop people from mapping device memory 
> directly. Particularly not in the view of the arguments against TTM 
> previously outlined.

As i said i have come to the opinion that not mapping vram in userspace
vma sounds like a good plan. I am even thinking that avoiding all mapping
and encourage pread/pwrite is a better solution. For me vram is a
temporary storage card maker use to speed up their hw as so it should not
be directly used for userspace. Note that this does not go against having
user space choosing policy for vram usage ie which object to put where.

Cheers,
Jerome Glisse

Jerome Glisse wrote:
> On Wed, 14 May 2008 12:09:06 +0200
> Thomas Hellström <th...@tu...> wrote:
>
>   
>> Jerome Glisse wrote:
>> Jerome, Dave, Keith
>>
>>
>> 1) The inability to map device memory. The design arguments and proposed 
>> solution for VRAM are not really valid. Think of this, probably not too 
>> uncommon, scenario of a single pixel fallback composite to a scanout 
>> buffer in vram. Or a texture or video frame upload:
>>
>> A) Page in all GEM pages, because they've been paged out.
>> B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
>> C) Write the pixel.
>> D) Copy the complete buffer back while tiling.
>>     
>
> With pwrite/pread you give offset and size of things you are interested in.
> So for single pixel case it will pread a page and pwrite it once fallback
> finished. I totaly agree that dowloading whole object on fallback is to be
> avoided. But as long as we don't have a fallback which draws the whole
> screen then we are fine, and as anyway such fallback will be desastrous
> wether we map vram or not lead me to discard this drawback and just
> accept pain for such fallback.
>
>   
I don't agree with you here. EXA is much faster for small composite 
operations and even small fill blits if fallbacks are used. Even to 
write-combined memory, but that of course depends on the hardware. This 
is going to be even more pronounced with acceleration architectures like 
Glucose and similar, that don't have an optimized path for small 
hardware composite operations.

My personal feeling is that pwrites are a workaround for a workaround 
for a very bad decision:

To avoid user-space allocators on device-mapped memory. This lead to a 
hack to avoid cahing-policy changes which lead to  cache trashing 
problems which put us in the current situation.  How far are we going to 
follow this path before people wake up? What's wrong with the 
performance of good old i915tex which even beats "classic" i915 in many 
cases.

Having to go through potentially (and even probably) paged-out memory to 
access buffers to make that are present in VRAM sounds like a very odd 
approach (to say the least) to me. Even if it's a single page and 
implementing per-page dirty checks for domain flushing isn't very 
appealing either.

> Also i am confident that we can find a more clever way in such case.
> Like doing the whole rendering in ram and updating the final result
> so assuming that the up to date copy is in ram and that vram might
> be out of sync.
>   
Why should we have to when we can do it right?
>  
>   
>> 2) Reserving pages when allocating VRAM buffers is also a very bad 
>> solution particularly on systems with a lot of VRAM and little system 
>> RAM. (Multiple card machines?). GEM basically needs to reserve 
>> swap-space when buffers are created, and put a limit on the pinned 
>> physical pages.  We basically should not be able to fail memory 
>> allocation during execbuf, because we cannot recover from that.
>>     
>
> Well this solve the suspend problem we were discussing at xds ie what
> to do on buffer. If we know that we have room to put buffer then we
> don't to worry about which buffer we are ready to loose. Given that
> opengl don't give any clue on that this sounds like a good approach.
>
> For embedded device where every piece of ram still matter i guess
> you also have to deal with suspend case so you have a way to either
> save vram content or to preserve it. I don't see any problem with
> gem to cop with this case too.
>   
No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
video cards, 4G swap space, and you want to fill both card's videoram 
with render-and-forget textures for whatever purpose.

What happens? After you've generated the first say 300M, The system 
mysteriously starts to page, and when, after a a couple of minutes of 
crawling texture upload speeds, you're done, The system is using and 
have written almost 2G of swap. Now, you want to update the textures and 
expect fast texsubimage...

So having a backing object that you have to access to get things into 
VRAM is not the way to go.
The correct way to do this is to reserve, but not use swap space. Then 
you can start using it on suspend, provided that the swapping system is 
still up (which is has to be with the current GEM approach anyway). If 
pwrite is used in this case, it must not dirty any backing object pages.

/Thomas

>   
>> Other things like GFP_HIGHUSER etc are probably fixable if there is a 
>> will to do it.
>>
>> So if GEM is the future, these shortcomings must IMHO be addressed. In 
>> particular GEM should not stop people from mapping device memory 
>> directly. Particularly not in the view of the arguments against TTM 
>> previously outlined.
>>     
>
> As i said i have come to the opinion that not mapping vram in userspace
> vma sounds like a good plan. I am even thinking that avoiding all mapping
> and encourage pread/pwrite is a better solution. For me vram is a
> temporary storage card maker use to speed up their hw as so it should not
> be directly used for userspace. Note that this does not go against having
> user space choosing policy for vram usage ie which object to put where.
>
> Cheers,
> Jerome Glisse
>   

On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote:
> >> 2) Reserving pages when allocating VRAM buffers is also a very bad 
> >> solution particularly on systems with a lot of VRAM and little system 
> >> RAM. (Multiple card machines?). GEM basically needs to reserve 
> >> swap-space when buffers are created, and put a limit on the pinned 
> >> physical pages.  We basically should not be able to fail memory 
> >> allocation during execbuf, because we cannot recover from that.
> >>     
> >
> > Well this solve the suspend problem we were discussing at xds ie what
> > to do on buffer. If we know that we have room to put buffer then we
> > don't to worry about which buffer we are ready to loose. Given that
> > opengl don't give any clue on that this sounds like a good approach.
> >
> > For embedded device where every piece of ram still matter i guess
> > you also have to deal with suspend case so you have a way to either
> > save vram content or to preserve it. I don't see any problem with
> > gem to cop with this case too.
> >   
> No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
> video cards, 4G swap space, and you want to fill both card's videoram 
> with render-and-forget textures for whatever purpose.

Who's selling that system?  Who's building that system at home?

-- 
Eric Anholt                             anholt@FreeBSD.org
er...@an...                         eri...@in...

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Eric Anholt schrieb:

>> No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
>> video cards, 4G swap space, and you want to fill both card's videoram 
>> with render-and-forget textures for whatever purpose.
> 
> Who's selling that system?  Who's building that system at home?

Video game consoles?

According to Wikipedia PS3 has 256 MB of RAM vs 256 MB of VRAM.

Philipp

P.S.: Even my ColecoVision, has 1 KB of RAM vs 16 KB of VRAM.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFILHQVbtUV+xsoLpoRAsaLAJ0fXyrk1n4TE0m/egvm10uACnIxLwCgqjl3
BE0DdbGE1R61oBsbf/zi8cU=
=nq1l
-----END PGP SIGNATURE-----

On Wed, 2008-05-14 at 12:09 +0200, Thomas Hellström wrote:

> 1) The inability to map device memory. The design arguments and proposed 
> solution for VRAM are not really valid. Think of this, probably not too 
> uncommon, scenario of a single pixel fallback composite to a scanout 
> buffer in vram. Or a texture or video frame upload:

Nothing prevents you from mapping device memory; it's just that on a UMA
device, there's no difference, and some significant advantages to using
the direct mapping. I wrote the API I needed for my device; I think it's
simple enough that other devices can add the APIs they need.

But, what we've learned in the last few months is that mapping *any*
pages into user space is a last-resort mechanism. Mapping pages WC or UC
requires inter-processor interrupts, and using normal WB pages means
invoking clflush on regions written from user space.

The glxgears "benchmark" demonstrates this with some clarity -- using
pwrite to send batch buffers is nearly three times faster (888 fps using
pwrite vs 300 fps using mmap) than mapping pages to user space and then
clflush'ing them in the kernel.

> A) Page in all GEM pages, because they've been paged out.
> B) Copy the complete scanout buffer to GEM because it's dirty. Untile.
> C) Write the pixel.
> D) Copy the complete buffer back while tiling.

First off, I don't care about fallbacks; any driver using fallbacks is
broken.

Second, if you had to care about fallbacks on non-UMA hardware, you'd
compute the pages necessary for the fallback and only map/copy those
anyway.

> 2) Reserving pages when allocating VRAM buffers is also a very bad 
> solution particularly on systems with a lot of VRAM and little system 
> RAM. (Multiple card machines?). GEM basically needs to reserve 
> swap-space when buffers are created, and put a limit on the pinned 
> physical pages.  We basically should not be able to fail memory 
> allocation during execbuf, because we cannot recover from that.

As far as I know, any device using VRAM will not save it across
suspend/resume. From my perspective, this means you don't get a choice
about allocating backing store for that data

Because GEM has backing store, we can limit pinned memory to only those
pages needed for the current operation, waiting to pin pages until the
device is ready to execute the operation. As I said in my earlier email,
that part of the kernel driver is not written yet. I was hoping to get
that finished before launching into this discussion as it is always
better to argue with running code.

> This means that the dependency on SHMEMFS propably needs to be dropped 
> and replaced with some sort of DRMFS that allows overloading of mmap and 
> a correct swap handling, address the caching issue and also avoids the 
> driver do_mmap(). 

Because GEM doesn't expose the use of shmfs to the user, there's no
requirement that all objects use this abstraction. You could even have
multiple object creation functions if that made sense in your driver.

-- 
kei...@in...

> > > 1) I feel there hasn't been enough open driver coverage to prove it. So far
> > > we have done an Intel IGD, we have a lot of code that isn't required for
> > > these devices, so the question of how much code exists purely to support
> > > poulsbo closed source userspace there is and why we need to live with it.
> > > Both radeon and nouveau developers have expressed frustration about the
> > > fencing internals being really hard to work with which doesn't bode well for
> > > maintainability in the future.
> > >   
> > OK. So basically what I'm asking is that when we have full-feathered open
> > source drivers available that
> > utilize TTM, either as part of DRM core, or, if needed, as part of
> > driver-specific code, do you see anything
> > else that prevents that from being pushed? That would be very valuable to know
> > for anyone starting porting work. ?
> 
> I was hoping that by now, one of the radeon or nouveau drivers would have 
> adopted TTM, or at least demoed something working using it, this hasn't 
> happened which worries me, perhaps glisse or darktama could fill in on 
> what limited them from doing it. The fencing internals are very very scary 
> and seem to be a major stumbling block.
The fencing internals do seem overly complicated indeed, but that's
something that I'm personally OK with taking the time to figure out how
to get right.  Is there any good documentation around that describes it
in detail?

I actually started working on nouveau/ttm again a month or so back, with
the intention of actually having the work land this time.  Overall, I
don't have much problem with TTM and would be willing to work with it.
Supporting G8x/G9x chips was the reason the work's stalled again, I
wasn't sure at the time what requirements we'd have from a memory
manager.

The issue on G8x is that the 3D engine will refuse to render to linear
surfaces, and in order to setup tiling we need to make use of a
channel's page tables.  The driver doesn't get any control when VRAM is
allocated so that it can setup the page tables appropriately etc.  I
just had a thought that the driver-specific validation ioctl could
probably handle that at the last minute, so perhaps that's also not an
issue.  I'll look more into G8x/ttm after I finish my current G8x work.

Another minor issue (probably doesn't effect merging?): Nouveau makes
extensive use fence classes, we assign 1 fence class to each GPU channel
(read: context + command submission mechanism).  We have 128 of these on
G80 cards, the current _DRM_FENCE_CLASSES is 8 which is insufficient
even for NV1x hardware.

So overall, I'm basically fine with TTM now that I've actually made a
proper attempt at using it..  GEM does seem interesting, I'll also
follow its development while I continue with other non-mm G80 work.

Cheers,
Ben.
> 
> I do worry that TTM is not Linux enough, it seems you have decided that we 
> can never do in-kernel allocations at any useable speed and punted the 
> work into userspace, which makes life easier for Gallium as its more like 
> what Windows does, but I'm not sure this is a good solution for Linux.
> 
> The real question is whether TTM suits the driver writers for use in Linux 
> desktop and embedded environments, and I think so far I'm not seeing 
> enough positive feedback from the desktop side.
> 
> Also wrt the i915 driver it has too many experiments in it, the i915 users 
> need to group together and remove the codepaths that make no sense and 
> come up with a ssuitable userspace driver for it, remove all unused 
> fencing mechanisms etc..
> 
> Dave.
> 
>  > 
> > /Thomas
> > 
> > 
> > 
> > 
> > 
> > 
> 
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Microsoft 
> Defy all challenges. Microsoft(R) Visual Studio 2008. 
> http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
> --
> _______________________________________________
> Dri-devel mailing list
> Dri...@li...
> https://lists.sourceforge.net/lists/listinfo/dri-devel

Ben Skeggs wrote:
>>>> 1) I feel there hasn't been enough open driver coverage to prove it. So far
>>>> we have done an Intel IGD, we have a lot of code that isn't required for
>>>> these devices, so the question of how much code exists purely to support
>>>> poulsbo closed source userspace there is and why we need to live with it.
>>>> Both radeon and nouveau developers have expressed frustration about the
>>>> fencing internals being really hard to work with which doesn't bode well for
>>>> maintainability in the future.
>>>>   
>>>>         
>>> OK. So basically what I'm asking is that when we have full-feathered open
>>> source drivers available that
>>> utilize TTM, either as part of DRM core, or, if needed, as part of
>>> driver-specific code, do you see anything
>>> else that prevents that from being pushed? That would be very valuable to know
>>> for anyone starting porting work. ?
>>>       
>> I was hoping that by now, one of the radeon or nouveau drivers would have 
>> adopted TTM, or at least demoed something working using it, this hasn't 
>> happened which worries me, perhaps glisse or darktama could fill in on 
>> what limited them from doing it. The fencing internals are very very scary 
>> and seem to be a major stumbling block.
>>     
> The fencing internals do seem overly complicated indeed, but that's
> something that I'm personally OK with taking the time to figure out how
> to get right.  Is there any good documentation around that describes it
> in detail?
>   
Yes, there is a wiki page.
http://dri.freedesktop.org/wiki/TTMFencing
> I actually started working on nouveau/ttm again a month or so back, with
> the intention of actually having the work land this time.  Overall, I
> don't have much problem with TTM and would be willing to work with it.
> Supporting G8x/G9x chips was the reason the work's stalled again, I
> wasn't sure at the time what requirements we'd have from a memory
> manager.
>
> The issue on G8x is that the 3D engine will refuse to render to linear
> surfaces, and in order to setup tiling we need to make use of a
> channel's page tables.  The driver doesn't get any control when VRAM is
> allocated so that it can setup the page tables appropriately etc.  I
> just had a thought that the driver-specific validation ioctl could
> probably handle that at the last minute, so perhaps that's also not an
> issue.  I'll look more into G8x/ttm after I finish my current G8x work.
>
> Another minor issue (probably doesn't effect merging?): Nouveau makes
> extensive use fence classes, we assign 1 fence class to each GPU channel
> (read: context + command submission mechanism).  We have 128 of these on
> G80 cards, the current _DRM_FENCE_CLASSES is 8 which is insufficient
> even for NV1x hardware.
>   
Ouch. Yes it should be OK to bump that as long as kmalloc doesn't complain.
> So overall, I'm basically fine with TTM now that I've actually made a
> proper attempt at using it..  GEM does seem interesting, I'll also
> follow its development while I continue with other non-mm G80 work.
>
> Cheers,
> Ben.
>   
Nice to know Ben. Anyway whatever happens, the fencing code will remain 
for some drivers either device specific or common, so if you find ways 
to simplify or things that doesn't look right, please let me know.

/Thomas

On Wed, 14 May 2008 16:36:54 +0200
Thomas Hellström <th...@tu...> wrote:

> Jerome Glisse wrote:
> I don't agree with you here. EXA is much faster for small composite 
> operations and even small fill blits if fallbacks are used. Even to 
> write-combined memory, but that of course depends on the hardware. This 
> is going to be even more pronounced with acceleration architectures like 
> Glucose and similar, that don't have an optimized path for small 
> hardware composite operations.
> 
> My personal feeling is that pwrites are a workaround for a workaround 
> for a very bad decision:
> 
> To avoid user-space allocators on device-mapped memory. This lead to a 
> hack to avoid cahing-policy changes which lead to  cache trashing 
> problems which put us in the current situation.  How far are we going to 
> follow this path before people wake up? What's wrong with the 
> performance of good old i915tex which even beats "classic" i915 in many 
> cases.
> 
> Having to go through potentially (and even probably) paged-out memory to 
> access buffers to make that are present in VRAM sounds like a very odd 
> approach (to say the least) to me. Even if it's a single page and 
> implementing per-page dirty checks for domain flushing isn't very 
> appealing either.

I don't have number or benchmark to check how fast pread/pwrite path might
be in this use so i am just expressing my feeling which happen to just be
to avoid vma tlb flush as most as we can. I got the feeling that kernel
goes through numerous trick to avoid tlb flushing for a good reason and
also i am pretty sure that with number of core keeping growing anythings
that need cpu broad synchronization is to be avoided.

Hopefully once i got decent amount of time to do benchmark with gem i will
check out my theory. I think simple benchmark can be done on intel hw just
return false in EXA prepare access to force use of download from screen,
and in download from screen use pread then comparing benchmark of this
hacked intel ddx with a normal one should already give some numbers.

> Why should we have to when we can do it right?

Well my point was that mapping vram is not right, i am not saying that
i know the truth. It's just a feeling based on my experiment with ttm
and on the bar restriction stuff and others consideration of same kind.

> No. Gem can't coop with it. Let's say you have a 512M system with two 1G 
> video cards, 4G swap space, and you want to fill both card's videoram 
> with render-and-forget textures for whatever purpose.
> 
> What happens? After you've generated the first say 300M, The system 
> mysteriously starts to page, and when, after a a couple of minutes of 
> crawling texture upload speeds, you're done, The system is using and 
> have written almost 2G of swap. Now, you want to update the textures and 
> expect fast texsubimage...
> 
> So having a backing object that you have to access to get things into 
> VRAM is not the way to go.
> The correct way to do this is to reserve, but not use swap space. Then 
> you can start using it on suspend, provided that the swapping system is 
> still up (which is has to be with the current GEM approach anyway). If 
> pwrite is used in this case, it must not dirty any backing object pages.
> 

For normal desktop i don't expect VRAM amount > RAM amount, people with
1Go VRAM are usually hard gamer with 4G of ram :). Also most object in
3d world are stored in memory, if program are not stupid and trust gl
to keep their texture then you just have the usual ram copy and possibly
a vram copy, so i don't see any waste in the normal use case. Of course
we can always come up with crazy weird setup, but i am more interested
in dealing well with average Joe than dealing mostly well with every
use case.

That said i do see GPGPU as a possible users of temporary big vram buffer
ie buffer you can trash away. For that kind of stuff it does make sense
to not have backing ram/swap area. But i would rather add somethings in
gem like intercepting allocation of such buffer and not creating backing
buffer, or adding driver specific ioctl for that case.

Anyway i think we need benchmark to know what in the end is really the
best option. I don't have code to support my general feeling, so i might
be wrong. Sadly we don't have 2^32 monkeys doing code days and night for
drm to test all solutions :)

Cheers,
Jerome Glisse <gl...@fr...>

On Wed, 2008-05-14 at 19:08 +0200, Jerome Glisse wrote:

> I don't have number or benchmark to check how fast pread/pwrite path might
> be in this use so i am just expressing my feeling which happen to just be
> to avoid vma tlb flush as most as we can. 

For batch buffers, pwrite is 3X faster than map/write/unmap, at least as
measured by that most estimable benchmark 'glxgears'. Take that with as
much skepticism as it deserves.

-- 
kei...@in...

On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote:
> > The real question is whether TTM suits the driver writers for use in Linux 
> > desktop and embedded environments, and I think so far I'm not seeing 
> > enough positive feedback from the desktop side.
> >   
> I actually haven't seen much feedback at all. At least not on the 
> mailing lists.
> Anyway we need to look at the alternatives which currently is GEM.
> 
> GEM, while still in development basically brings  us back to the 
> functionality of TTM 0.1, with added paging support but without 
> fine-grained locking and  caching policy support.
> 
> I might have misunderstood things but quickly browsing the code raises 
> some obvious questions:
> 
> 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects 
> use GFP_HIGHUSER, and it's hardcoded into the linux swap code.

The obvious solution here is what many DMA APIs do for IOMMUs that can't
address all of memory -- keep a pool of pages within the addressable
range and bounce data through them.  I think the Linux kernel even has
interfaces to support us in this.  Since it's not going to be a very
common case, we may not care about the performance.  If we do find that
we care about the performance, we should first attempt to get what we
need into the linux kernel so we don't have to duplicate code, and only
if that fails do the duplication.

I'm pretty sure the AGP chipsets versus >32-bits pages danger has been
overstated, though.  Besides the fact that you needed to load one of
these older supposed machines with a full 4GB of memory (well,
theoretically 3.5GB but how often can you even boot a system with a 2,
1, .5gb combo?), you also need a chipset that does >32-bit addressing.

At least all AMD and Intel chipsets don't appear to have this problem in
the survey I did last night, as they've either got >32-bit chipset and
>32-bit gart, or 32-bit chipset and 32-bit gart.  Basically all I'm
worried about is ATI PCI[E]GART at this point.

http://dri.freedesktop.org/wiki/GARTAddressingLimits

<snip bits that have been covered in other mails>

> 5) What's protecting i915 GEM object privates and lists in a 
> multi-threaded environment?

Nothing at the moment.  That's my current project.  dev->struct_mutex is
the plan -- I don't want to see finer-grained locking until we show that
contention on that locking is an issue.  Fine-grained locking takes
significant care, and there's a lot more important performance
improvements to work on before then.

> 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing 
> some severe ranting about it on the lkml?

We've talked it over with Arjan, and until we can use real fds as our
handles to objects, he thought it sounded OK.  But apparently Al Viro's
working on making it so that allocating a thousand fds would be feasible
for us.  At that point mmap/pread/pwrite/close ioctls could be replaced
with the syscalls they were named for, and the kernel guys love us.

> TTM is designed to cope with most hardware quirks I've come across with 
> different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and 
> some other ones. GEM basically leaves it up to the driver writer to 
> reinvent the wheel..

The problem with TTM is that it's designed to expose one general API for
all hardware, when that's not what our drivers want.  The GPU-GPU cache
handling for intel, for example, mapped the hardware so poorly that
every batch just flushed everything.  Bolting on the clflush-based
cpu-gpu caching management for our platform recovered a lot of
performance, but we're still having to reuse buffers in userland at a
memory cost because allocating buffers is overly expensive for the
general supporting-everybody (but oops, it's not swappable!) object
allocator.

We're trying to come at it from the other direction: Implement one
driver well.  When someone else implements another driver and finds that
there's code that should be common, make it into a support library and
share it.

I actually would have liked the whole interface to userland to be
driver-specific with a support library for the parts we think other
people would want, but DRI2 wants to use buffer objects for its shared
memory transport and I didn't want to rock its boat too hard, so the
ioctls that should be supportable for everyone got moved to generic.

If the implementation of those ioctls in generic code doesn't work for
some drivers (say, early shmfs object creation turns out to be a bad
idea for VRAM drivers), I'll happily push it out to the driver.

-- 
Eric Anholt                             anholt@FreeBSD.org
er...@an...                         eri...@in...

Eric Anholt wrote:
> On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote:
>   
>>> The real question is whether TTM suits the driver writers for use in Linux 
>>> desktop and embedded environments, and I think so far I'm not seeing 
>>> enough positive feedback from the desktop side.
>>>   
>>>       
>> I actually haven't seen much feedback at all. At least not on the 
>> mailing lists.
>> Anyway we need to look at the alternatives which currently is GEM.
>>
>> GEM, while still in development basically brings  us back to the 
>> functionality of TTM 0.1, with added paging support but without 
>> fine-grained locking and  caching policy support.
>>
>> I might have misunderstood things but quickly browsing the code raises 
>> some obvious questions:
>>
>> 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects 
>> use GFP_HIGHUSER, and it's hardcoded into the linux swap code.
>>     
>
> The obvious solution here is what many DMA APIs do for IOMMUs that can't
> address all of memory -- keep a pool of pages within the addressable
> range and bounce data through them.  I think the Linux kernel even has
> interfaces to support us in this.  Since it's not going to be a very
> common case, we may not care about the performance.  If we do find that
> we care about the performance, we should first attempt to get what we
> need into the linux kernel so we don't have to duplicate code, and only
> if that fails do the duplication.
>
> I'm pretty sure the AGP chipsets versus >32-bits pages danger has been
> overstated, though.  Besides the fact that you needed to load one of
> these older supposed machines with a full 4GB of memory (well,
> theoretically 3.5GB but how often can you even boot a system with a 2,
> 1, .5gb combo?), you also need a chipset that does >32-bit addressing.
>
> At least all AMD and Intel chipsets don't appear to have this problem in
> the survey I did last night, as they've either got >32-bit chipset and
>   
>> 32-bit gart, or 32-bit chipset and 32-bit gart.  Basically all I'm
>>     
> worried about is ATI PCI[E]GART at this point.
>
> http://dri.freedesktop.org/wiki/GARTAddressingLimits
>
> <snip bits that have been covered in other mails>
>
>   
There will probably turn up a couple of more devices or incomplete 
drivers, but in the long run this is a fixable problem.
>> 5) What's protecting i915 GEM object privates and lists in a 
>> multi-threaded environment?
>>     
>
> Nothing at the moment.  That's my current project.  dev->struct_mutex is
> the plan -- I don't want to see finer-grained locking until we show that
> contention on that locking is an issue.  Fine-grained locking takes
> significant care, and there's a lot more important performance
> improvements to work on before then.
>
>   
>> 6) Isn't do_mmap() strictly forbidden in new drivers? I remember seeing 
>> some severe ranting about it on the lkml?
>>     
>
> We've talked it over with Arjan, and until we can use real fds as our
> handles to objects, he thought it sounded OK.  But apparently Al Viro's
> working on making it so that allocating a thousand fds would be feasible
> for us.  At that point mmap/pread/pwrite/close ioctls could be replaced
> with the syscalls they were named for, and the kernel guys love us.
>
>   
>> TTM is designed to cope with most hardware quirks I've come across with 
>> different chipsets so far, including Intel UMA, Unichrome, Poulsbo, and 
>> some other ones. GEM basically leaves it up to the driver writer to 
>> reinvent the wheel..
>>     
>
> The problem with TTM is that it's designed to expose one general API for
> all hardware, when that's not what our drivers want.  The GPU-GPU cache
> handling for intel, for example, mapped the hardware so poorly that
> every batch just flushed everything.  Bolting on the clflush-based
> cpu-gpu caching management for our platform recovered a lot of
> performance, but we're still having to reuse buffers in userland at a
> memory cost because allocating buffers is overly expensive for the
> general supporting-everybody (but oops, it's not swappable!) object
> allocator.
>
>   
Swapping drmBOs is a couple of days implementation and some core kernel 
exports. It's just that someone needs find the time and the right person 
to talk to in the right way to get certain swapping functions exported.
> We're trying to come at it from the other direction: Implement one
> driver well.  When someone else implements another driver and finds that
> there's code that should be common, make it into a support library and
> share it.
>
> I actually would have liked the whole interface to userland to be
> driver-specific with a support library for the parts we think other
> people would want, but DRI2 wants to use buffer objects for its shared
> memory transport and I didn't want to rock its boat too hard, so the
> ioctls that should be supportable for everyone got moved to generic.
>
> If the implementation of those ioctls in generic code doesn't work for
> some drivers (say, early shmfs object creation turns out to be a bad
> idea for VRAM drivers), I'll happily push it out to the driver.
>
>   
That would basically make the rest of my issues non-issues, as it would 
allow us to without too much effort reuse much of the TTM functionality 
over the GEM interface, and everybody's hopefully happy.

/Thomas

On Wed, May 14, 2008 at 2:30 PM, Eric Anholt <er...@an...> wrote:
> On Wed, 2008-05-14 at 02:33 +0200, Thomas Hellström wrote:
>  > > The real question is whether TTM suits the driver writers for use in Linux
>  > > desktop and embedded environments, and I think so far I'm not seeing
>  > > enough positive feedback from the desktop side.
>  > >
>  > I actually haven't seen much feedback at all. At least not on the
>  > mailing lists.
>  > Anyway we need to look at the alternatives which currently is GEM.
>  >
>  > GEM, while still in development basically brings  us back to the
>  > functionality of TTM 0.1, with added paging support but without
>  > fine-grained locking and  caching policy support.
>  >
>  > I might have misunderstood things but quickly browsing the code raises
>  > some obvious questions:
>  >
>  > 1) Some AGP chipsets don't support page addresses > 32bits. GEM objects
>  > use GFP_HIGHUSER, and it's hardcoded into the linux swap code.
>
>  The obvious solution here is what many DMA APIs do for IOMMUs that can't
>  address all of memory -- keep a pool of pages within the addressable
>  range and bounce data through them.  I think the Linux kernel even has
>  interfaces to support us in this.  Since it's not going to be a very
>  common case, we may not care about the performance.  If we do find that
>  we care about the performance, we should first attempt to get what we
>  need into the linux kernel so we don't have to duplicate code, and only
>  if that fails do the duplication.
>
>  I'm pretty sure the AGP chipsets versus >32-bits pages danger has been
>  overstated, though.  Besides the fact that you needed to load one of
>  these older supposed machines with a full 4GB of memory (well,
>  theoretically 3.5GB but how often can you even boot a system with a 2,
>  1, .5gb combo?), you also need a chipset that does >32-bit addressing.
>
>  At least all AMD and Intel chipsets don't appear to have this problem in
>  the survey I did last night, as they've either got >32-bit chipset and
>  >32-bit gart, or 32-bit chipset and 32-bit gart.  Basically all I'm
>  worried about is ATI PCI[E]GART at this point.

AMD PCIE and IGP GART support 40 bits (Dave just committed support
this morning) so we should be fine on r3xx and newer PCIE cards.

Alex

On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote:

> My personal feeling is that pwrites are a workaround for a workaround 
> for a very bad decision

Feel free to map VRAM then if you can; I didn't need to on Intel as
there isn't any difference.

-- 
kei...@in...

Keith Packard wrote:
> On Wed, 2008-05-14 at 16:36 +0200, Thomas Hellström wrote:
>
>   
>> My personal feeling is that pwrites are a workaround for a workaround 
>> for a very bad decision
>>     
>
> Feel free to map VRAM then if you can; I didn't need to on Intel as
> there isn't any difference.
>
>   

With mapping device memory on UMA devices I'm referring to mapping 
through the GTT aperture. Either as stolen memory, Pre-bound GTT pools 
or simply buffer object memory temporarily bound to the GTT.

As you've previously mentioned, this requires caching policy changes and 
it needs to be used with some care.

/Thomas