From: Alexander S. <Ale...@at...> - 2001-08-31 14:07:07
|
Hello Pontus, Total lockups are generally hard to solve. Yes, i am searching for a reliable method to work on them myselves, but i havent found yet. If you can tune features at a fine scale and trigger any positive effect then you might have won in locating the reason. But often fine tuning is not availabel. Rather its so that you throw code with heavily different concepts towards the same hardware - so "fine tune" wont apply. The problem is that hard lockups are more likely a sign of hardware problems than of system corruption. I want to say, a bus lockup will surely freeze your box. Expect that to happen if you provide wrong data to your adapter via DMA or if there are sequences of actions that are not tolerated in that way. Lack of locking in a multiprocessing/multithreading environment is a common reason. If you are in the XServers code, you will see multiple of software and hardware related locks. You can only track how it comes to the hang, but nothing more. But as usual, as soon as you start adding some debug prints, the stall might no longer happen or you will not get all the messages up to the point where the stall happens due to buffers in the system. And last but not least, such errors might not even be all the same and not really happen synchronously to the piece software that raised the problem. My last but one approach wach parallel port debugging (meaning, setting a bit pattern for each code component). Just a set of LEDs and resistors soldered on a SUB-D connector. But in my case it wasnt delivering any hints. Other methods would be duplicating/logging of DMA buffers to some external storage in the drivers code. But this again raises the timing and reproducability problem. Logic analyzers? Not even them are that helpful. Despite their complexity to apply, there are always limitations on what you can track and determining what really goes on. on the interface between system and adapters. The simulator approach (big cubes that act like some hardware due to ASIC rule programming but i.e. at 1/100th the speed) isnt really an option because just because of their limited availability (including the ASCI rules) and in fact of the incompleteness of simulating any sort of glitches in a complex system like a PC. Concerning your description - the phenomen sounds rather compareable to things i had seen myself shortly. I am counting on you finding the solution and win the nobel price of computer science... *just kidding* Its just a problem of information - the computer does not tell you what went wrong, even if it is obvious that something went wrong. And of course it takes ages to turn around, even with a journaling filesystem. I'd surely like other folks joining this discussion, but there might be only a few. Regards AlexS. PS: If i'd know about any problems in common code with that magnitude, i wouldnt hesitate to tell anybody about it - its even my benefit if its fixed in the upcoming releases of DRI and XFree86. > -----Original Message----- > From: Pontus Hedman [mailto:rp...@ve...] > Sent: Friday, August 31, 2001 02:41 > To: dri...@li... > Subject: [Dri-devel] How do I debug a lock-the-box crash? > > > I'm using the DRI X from recent CVS with a 2.4.6 kernel, > with a Rage Fury 128 card. Everything works just great, > except that many 3D apps cause the machine to lock up > solid, more or less at random. I'm talking keyboard > unresponsive (no numlock or alt-sysrq-b reaction) > and no response to pings. > > The most reliable way to cause the lockup is to run > FlightGear and switch focus between its window and > some other window rapidly. > > I'm at a loss as to how to even start to debug this, > since the box locks up without any hint about what's wrong. > Any suggestions? Or is it more likely that my > QDI Advance 9 motherboard is a flaky piece of junk? > > Pontus > > _______________________________________________ > Dri-devel mailing list > Dri...@li... > http://lists.sourceforge.net/lists/listinfo/dri-devel > |