=2D----BEGIN PGP SIGNED MESSAGE-----
On Saturday 22 May 2004 16:04, Michel D=C3=A4nzer wrote:
> On Sat, 2004-05-22 at 14:04, Nicolai Haehnle wrote:
> > It seems to me as if DRM(unlock) in drm_drv.h unlocks without checking=
> > whether the caller actually holds the global lock. There is no=20
> > LOCK_TEST_WITH_RETURN or similar, and the helper function lock_transfer=
> > no check in it either.
> > Did I miss something, or is this intended behaviour? It certainly seems=
> > strange to me.
> True. Note that the lock ioctls are only used on contention, but still.
Unless I'm mistaken, DRM(lock) is always called when a client wants the loc=
for the first time (or when it needs to re-grab after it lost the lock).=20
This is necessary because the DRM makes sure that dev->lock.filp matches=20
the "calling" file. Afterwards, the ioctls are only used on contention.
The entire locking can be subverted anyway, because part of the lock is in=
userspace. I believe the important thing is to make sure that the X server=
can force a return into a sane locking state.
> > Side question: Is killing the offending DRI client enough? When the=20
> > is killed, the /dev/drm fd is closed, which should automatically releas=
> > the lock. On the other hand, I'm pretty sure that we can't just kill a=
> > process immediately (unfortunately, I'm not familiar with process=20
> > in the kernel). What if, for some reason, the process is in a state=20
> > it can't be killed yet?
> We're screwed? :)
Looks like it...
> This sounds like an idea for you to play with, but I'm afraid it won't
> be useful very often in my experience:
> * getting rid of the offending client doesn't help with a wedged
> chip (some way to recover from that would be nice...)
> * it doesn't help if the X server itself spins with the lock held
You were right, of course, while I show my lack of experience with driver=20
writing. In my case I can get the X server's reset code to run, but some=20
way through the reset the machine finally locks up completely (no more=20
networking, no more disk I/O).
I'm curious though, how can a complete lockup like this be caused by the=20
graphics card? My guess would be that it grabs the PCI/AGP bus forever for=
some reason (the dark side of bus mastering, so to speak). Is there=20
anything else that could be the cause?
> > Side question #2: Is it safe to release the DRM lock in the watchdog?=20
> > might be races where the offending DRI client is currently executing a=
> > ioctl when the watchdog fires.
> Not sure, but this might not be a problem when just killing the
> offending process?
On the other hand, it might sometimes be useful to be a little bit nicer to=
the offending process (see point 4 below).
I had a go at implementing my watchdog idea for Linux, see the attached=20
patch. It basically works, but I couldn't test it on a system where the DRI=
actually works without locking up... *sigh*
Now for some notes:
1. This only affects the DRM for Linux. I don't have an installation of BSD=
and while I know a little bit about the Linux kernel, I don't know anything=
about the BSD kernel(s).
2. The timeout cannot be configured yet. I didn't find "prior art" as to ho=
something like it should be configured, so I'm open for input. For a Linux=
driver, adding to the /proc entries seems to be the logical way to go, but=
the DRI is very ioctl-centric. Maybe both?
3. Privileged processes may take the hardware lock for an infinite amount o=
time. This is necessary because the X server holds the lock when VT is=20
Currently, "privileged" means capable(CAP_SYS_ADMIN). I would prefer if it=
meant "the multiplexing controller process", i.e. the one that=20
authenticates other processes. Unfortunately, this distinction isn't made=20
anywhere in the DRM as far as I can see. This means that runaway DRI=20
clients owned by root aren't killed by the watchdog, either.
4. Keith mentioned single-stepping through a driver, and he does have a=20
point. Unfortunately, I also believe that it's not that simple.
Suppose an application developer debugs a windowed OpenGL application, on=20
the local machine, without a dual-head setup. It may sound like a naive=20
thing to do, but this actually works on Windows (yes, Windows is *a lot*=20
more stable than Linux/BSD in that respect).
Now suppose she's got a bug in her application (e.g. bad vertex array) that=
triggers a segmentation fault inside the GL driver, while the hardware lock=
is held. GDB will catch that signal, so the process won't die, which in=20
turn means that the lock is not released. Thus the developer's machine=20
locks up unless the watchdog kicks in (of course, the watchdog in its=20
current form will also frustrate her to no end).
> Earthling Michel D=C3=A4nzer | Debian (powerpc), X and DRI devel=
> Libre software enthusiast | http://svcs.affero.net/rm.php?r=3Ddaenzer
=2D----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
=2D----END PGP SIGNATURE-----