From: Jerome G. <j.g...@gm...> - 2005-05-22 19:00:22
|
Hi, I setup a x86 with radeon 9800 pro or xt, trying to find why it locks. I see little improvement with option no silken mouse can you test and tell me if it dones anythings for you (X -nosilk). My thought on this lockups is that it's similar to the one r200 users report, X taking 100% of CPU waiting for something. I saw a mail from Felix about a lock holding issue will try to dig in mail archive. Anyone have any idea on that ? Could it be the mouse code in xorg ? Or is it in r300_mesa or drm ? I really suspect xorg radeon code... Jerome Glisse |
From: Michel <mi...@da...> - 2005-05-22 19:09:36
|
On Sun, 2005-05-22 at 21:00 +0200, Jerome Glisse wrote: >=20 > I setup a x86 with radeon 9800 pro or xt, trying to find > why it locks. I see little improvement with option no silken > mouse can you test and tell me if it dones anythings for > you (X -nosilk). >=20 > My thought on this lockups is that it's similar to the one > r200 users report, X taking 100% of CPU waiting for > something. I saw a mail from Felix about a lock holding > issue will try to dig in mail archive. Those are common symptoms of a lockup. The same symptoms aren't necessarily due to the same cause though. > Anyone have any idea on that ? Could it be the mouse > code in xorg ? Or is it in r300_mesa or drm ? Most likely IMO. --=20 Earthling Michel D=C3=A4nzer | Debian (powerpc), X and DRI develop= er Libre software enthusiast | http://svcs.affero.net/rm.php?r=3Ddaenzer |
From: Adam K K. <ad...@vo...> - 2005-05-22 20:11:20
|
On Sun, 2005-05-22 at 21:00 +0200, Jerome Glisse wrote: > Hi, > > I setup a x86 with radeon 9800 pro or xt, trying to find > why it locks. I see little improvement with option no silken > mouse can you test and tell me if it dones anythings for > you (X -nosilk). > Sorry, that didn't do any good. Again, I may have gotten a few more minutes before it locked up, but it still locked up rather quickly. I think I remember that there's some way to increase debugging output from the drm module, but can't remember how. I have a serial console, and these lockups rarely take down the whole machine, so I can debug this further if someone wants to point me in the right direction. Adam |
From: Vladimir D. <vo...@mi...> - 2005-05-22 22:11:30
|
> few more minutes before it locked up, but it still locked up > rather quickly. > > I think I remember that there's some way to increase debugging > output from the drm module, but can't remember how. I have a > serial console, and these lockups rarely take down the whole > machine, so I can debug this further if someone wants to point > me in the right direction. insmod drm.ko debug=1 thank you ! Vladimir Dergachev > > Adam > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click > -- > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > |
From: Aapo T. <ae...@ra...> - 2005-05-23 07:09:41
Attachments:
wfi.diff
|
On Sun, 22 May 2005 21:00:19 +0200 Jerome Glisse <j.g...@gm...> wrote: > Hi, > > I setup a x86 with radeon 9800 pro or xt, trying to find > why it locks. I see little improvement with option no silken > mouse can you test and tell me if it dones anythings for > you (X -nosilk). Try attached patch against radeon_cursor.c if you can. I doubt that actually fixes it because swcursor should of done that... -- Aapo Tahkola |
From: Michel <mi...@da...> - 2005-05-23 16:33:39
|
On Mon, 2005-05-23 at 10:10 +0300, Aapo Tahkola wrote: > On Sun, 22 May 2005 21:00:19 +0200 > Jerome Glisse <j.g...@gm...> wrote: >=20 > > I setup a x86 with radeon 9800 pro or xt, trying to find > > why it locks. I see little improvement with option no silken > > mouse can you test and tell me if it dones anythings for > > you (X -nosilk). >=20 > Try attached patch against radeon_cursor.c if you can. > I doubt that actually fixes it because swcursor should of done that... CURSOR_SWAPPING_START() already syncs the engine in current X.Org HEAD, and wait for idle is useless to harmful in the cursor position functions, at least with silken mouse. So the only thing that could make a difference is the wait for idle before showing or hiding the cursor (which BTW is superfluous in the LoadCursorImage functions, the X server explicitly hides the cursor before and shows it again after those). --=20 Earthling Michel D=C3=A4nzer | Debian (powerpc), X and DRI develop= er Libre software enthusiast | http://svcs.affero.net/rm.php?r=3Ddaenzer |
From: Adam K K. <ad...@vo...> - 2005-05-23 13:21:45
|
Vladimir Dergachev wrote: >> few more minutes before it locked up, but it still locked up >> rather quickly. >> >> I think I remember that there's some way to increase debugging >> output from the drm module, but can't remember how. I have a >> serial console, and these lockups rarely take down the whole >> machine, so I can debug this further if someone wants to point >> me in the right direction. > > > insmod drm.ko debug=1 > > thank you ! > > Vladimir Dergachev This is all the output from the drm module with debugging enabled: http://www.visualtech.com/drm.txt.gz The logging starts with X launching, and ends with UnrealTournament in a lockup. It's about 1.6 megs gzipped, and 54 megs ungzipped, so there's a whole lot of data in there, most of it probably redundant... I haven't had a chane yet to try Aapo's patch against radeon_cursor.c, but I should have some time when I get home this afternoon. Adam |
From: Vladimir D. <vo...@mi...> - 2005-05-23 14:08:49
|
>> insmod drm.ko debug=1 >> >> thank you ! >> >> Vladimir Dergachev > > > This is all the output from the drm module with debugging enabled: > > http://www.visualtech.com/drm.txt.gz Thank you ! By the looks of it the kernel buffer is overflowing with messages. In the past I found useful not to turn drm debugging on, but, rather, insert printk statements in various place in radeon code. This should also provide more information about what is actually going on. Also, it might be interesting to see what happens when you make various calls into NOPs - for example do things work better when you make texture calls into NOPs ? We would not expect the correct output, of course. best Vladimir Dergachev > > The logging starts with X launching, and ends with UnrealTournament > in a lockup. It's about 1.6 megs gzipped, and 54 megs ungzipped, so > there's a whole lot of data in there, most of it probably redundant... > > I haven't had a chane yet to try Aapo's patch against radeon_cursor.c, but > I should have some time when I get home this afternoon. > > Adam > |
From: Nicolai H. <pre...@gm...> - 2005-05-23 16:43:24
|
On Sunday 22 May 2005 21:00, Jerome Glisse wrote: > Hi, >=20 > I setup a x86 with radeon 9800 pro or xt, trying to find > why it locks. I see little improvement with option no silken > mouse can you test and tell me if it dones anythings for > you (X -nosilk). >=20 > My thought on this lockups is that it's similar to the one > r200 users report, X taking 100% of CPU waiting for > something. I saw a mail from Felix about a lock holding > issue will try to dig in mail archive. If I interpret the logs correctly, all those lockups are of the form where= =20 the R300 fails to process the ring buffer any further, i.e. the R300 locks= =20 up. This in turn causes the 3D driver or the X server (depending on the=20 exact circumstances, and probably in a rather random fashion) to wait for=20 the R300 to become idle in an endless loop. The 100% CPU usage is merely caused by the fact that we're polling the chip= =20 instead of doing proper IRQ-based wait-for-idle. > Anyone have any idea on that ? Could it be the mouse > code in xorg ? Or is it in r300_mesa or drm ? I really > suspect xorg radeon code... It is easy to blame the DDX, but the truth is, we just don't know. The=20 people seeing lockups should try to figure out whether there is a direct=20 causal connection between e.g. mouse movements and lockups. If you are in a= =20 fullscreen OpenGL applications, not moving the mouse, no popups occuring=20 from something like a panel applet, and the chip *still* locks up, it is=20 highly unlikely that the DDX is at fault. It is equally likely that the lockup is caused by, say, alignment or=20 wraparound issues of the ring buffer. Note that fglrx always submits commands in indirect buffers, which are=20 stored linearly in physical memory. We, on the other hand, always submit=20 commands into the ring buffer, which is not linear (because it wraps=20 around). Also, fglrx likes to emit NOPs into the command stream sometimes,= =20 though I haven't been able to find an exact pattern in those NOPs. We never= =20 emit NOPs (or do we?). So the fact is: We just don't know whether alignment/wraparound can cause=20 trouble. The emission of NOPs by fglrx is IMO significant evidence that=20 there *are* issues in this area, at least on some chipsets, but it could=20 just be some weird artifact of the fglrx codebase. cu, Nicolai |
From: Adam K K. <ad...@vo...> - 2005-05-23 23:33:03
|
Nicolai Haehnle wrote: >It is easy to blame the DDX, but the truth is, we just don't know. The >people seeing lockups should try to figure out whether there is a direct >causal connection between e.g. mouse movements and lockups. If you are in a >fullscreen OpenGL applications, not moving the mouse, no popups occuring >from something like a panel applet, and the chip *still* locks up, it is >highly unlikely that the DDX is at fault. > > How about if you're in a window, no mouse movement, no popups, running the Q3A demo level three times in a row followed by the UTBench demo twice, and there are no lockups? Does that tell you anything? Vladimir Dergachev wrote: > > In the past I found useful not to turn drm debugging on, but, rather, > insert printk statements in various place in radeon code. This should > also provide more information about what is actually going on. I can't make any promises. My partner already thinks I spend too much time in front of the computer :-) I'll see what I can do, though. Think a printk statement at the start and end of every function? Have any suggestions about what files to start with? Is there some consensus that this is probably a cursor problem now? Adam |
From: Michel <mi...@da...> - 2005-05-25 17:32:10
|
On Mon, 2005-05-23 at 18:45 +0200, Nicolai Haehnle wrote: >=20 > It is equally likely that the lockup is caused by, say, alignment or=20 > wraparound issues of the ring buffer. > Note that fglrx always submits commands in indirect buffers, which are=20 > stored linearly in physical memory. We, on the other hand, always submit=20 > commands into the ring buffer, which is not linear (because it wraps=20 > around). Also, fglrx likes to emit NOPs into the command stream sometimes= ,=20 > though I haven't been able to find an exact pattern in those NOPs. We nev= er=20 > emit NOPs (or do we?). >=20 > So the fact is: We just don't know whether alignment/wraparound can cause= =20 > trouble. The emission of NOPs by fglrx is IMO significant evidence that=20 > there *are* issues in this area, at least on some chipsets, but it could=20 > just be some weird artifact of the fglrx codebase. The NOPs in the ring buffer are there for alignment/performance reasons, they shouldn't affect lockups either way. --=20 Earthling Michel D=C3=A4nzer | Debian (powerpc), X and DRI develop= er Libre software enthusiast | http://svcs.affero.net/rm.php?r=3Ddaenzer |
From: Vladimir D. <vo...@mi...> - 2005-05-24 02:16:59
|
> Vladimir Dergachev wrote: > >> >> In the past I found useful not to turn drm debugging on, but, rather, >> insert printk statements in various place in radeon code. This should >> also provide more information about what is actually going on. > > > I can't make any promises. My partner already thinks I spend too much > time in front of the computer :-) I'll see what I can do, though. Think a > printk statement at the start and end of every function? Have any This is probably overkill and might not be very useful Rather try, at first, to just print a printk logging which command is being executed (r300_cmd.c) - this is not very thorough, but, maybe, there is a pattern. > suggestions about what files to start with? Is there some consensus that > this is probably a cursor problem now? It is still possible that a cursor problem is triggering a lockup caused by something else - like corrupt command sequence (well, not that easy - I put a rudimentary checker to protect against that, but it could be lack of extra end3d) Also, another cause of lockups could be that we need to insert extra NOPs to have the engine wait a certain time after some operations. I tried experimenting with this earlier with little luck, but maybe someone else would be successful. Lastly, it might be a good idea for someone comfortable with oscilloscope to try to see what could be extracted from power supply lines and such - kinda like using an encephalogram - there could be indications that something is wrong before the trigger. best Vladimir Dergachev > > Adam > > > > > ------------------------------------------------------- > This SF.Net email is sponsored by Oracle Space Sweepstakes > Want to be the first software developer in space? > Enter now for the Oracle Space Sweepstakes! > http://ads.osdn.com/?ad_id=7412&alloc_id=16344&op=click > -- > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel > |
From: Adam K K. <ad...@vo...> - 2005-05-24 16:33:41
|
Vladimir Dergachev wrote: >> Vladimir Dergachev wrote: >> >>> >>> In the past I found useful not to turn drm debugging on, but, rather, >>> insert printk statements in various place in radeon code. This should >>> also provide more information about what is actually going on. >> >> >> >> I can't make any promises. My partner already thinks I spend too much >> time in front of the computer :-) I'll see what I can do, though. >> Think a >> printk statement at the start and end of every function? Have any > > > This is probably overkill and might not be very useful > > Rather try, at first, to just print a printk logging which command is > being executed (r300_cmd.c) - this is not very thorough, but, maybe, > there is a pattern. I added a printk for each function in r300_cmdbuf.c... When Q3A locked up, and the last thing showing up in syslog was r300_pacify. So I added printk's after every line in r300_pacify :-) The last thing in syslog was: OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) OUT_RING( 0x0 ) ADVANCE_RING() So it seems to be making it all the way through r300_pacify, which had been called from r300_check_range, from r300_emit_unchecked_state. Here's the sequence: r300_emit_raw r300_emit_packet3 r300_emit_raw r300_emit_unchecked_state r300_check_range r300_emit_unchecked_state r300_check_range r300_pacify RING_LOCALS BEGIN_RING(6) OUT_RING( CP_PACKET0( R300_RB3D_DSTCACHE_CTLSTAT, 0 ) ) OUT_RING( 0xa ) OUT_RING( CP_PACKET0( 0x4f18, 0 ) ) OUT_RING( 0x3 ) OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) OUT_RING( 0x0 ) ADVANCE_RING() Does this tell us anything? Adam |
From: Adam K K. <ad...@vo...> - 2005-05-24 16:44:36
|
Adam K Kirchhoff wrote: > Vladimir Dergachev wrote: > >>> Vladimir Dergachev wrote: >>> >>>> >>>> In the past I found useful not to turn drm debugging on, but, rather, >>>> insert printk statements in various place in radeon code. This should >>>> also provide more information about what is actually going on. >>> >>> >>> >>> >>> I can't make any promises. My partner already thinks I spend too much >>> time in front of the computer :-) I'll see what I can do, though. >>> Think a >>> printk statement at the start and end of every function? Have any >> >> >> >> This is probably overkill and might not be very useful >> >> Rather try, at first, to just print a printk logging which command is >> being executed (r300_cmd.c) - this is not very thorough, but, maybe, >> there is a pattern. > > > > I added a printk for each function in r300_cmdbuf.c... When Q3A > locked up, and the last thing showing up in syslog was r300_pacify. > So I added printk's after every line in r300_pacify :-) The last > thing in syslog was: > > OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) > OUT_RING( 0x0 ) > ADVANCE_RING() > > So it seems to be making it all the way through r300_pacify, which had > been called from r300_check_range, from r300_emit_unchecked_state. > > Here's the sequence: > r300_emit_raw > r300_emit_packet3 > r300_emit_raw > r300_emit_unchecked_state > r300_check_range > r300_emit_unchecked_state > r300_check_range > r300_pacify > RING_LOCALS > BEGIN_RING(6) > OUT_RING( CP_PACKET0( R300_RB3D_DSTCACHE_CTLSTAT, 0 ) ) > OUT_RING( 0xa ) > OUT_RING( CP_PACKET0( 0x4f18, 0 ) ) > OUT_RING( 0x3 ) > OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) > OUT_RING( 0x0 ) > ADVANCE_RING() > > Does this tell us anything? > > Adam > > > By the way, this was consisten in all four lockups I've experienced since making these changes. Adam |
From: Keith W. <ke...@tu...> - 2005-05-24 16:56:17
|
Adam K Kirchhoff wrote: > Adam K Kirchhoff wrote: > >> Vladimir Dergachev wrote: >> >>>> Vladimir Dergachev wrote: >>>> >>>>> >>>>> In the past I found useful not to turn drm debugging on, but, rather, >>>>> insert printk statements in various place in radeon code. This should >>>>> also provide more information about what is actually going on. >>>> >>>> >>>> >>>> >>>> >>>> I can't make any promises. My partner already thinks I spend too much >>>> time in front of the computer :-) I'll see what I can do, though. >>>> Think a >>>> printk statement at the start and end of every function? Have any >>> >>> >>> >>> >>> This is probably overkill and might not be very useful >>> >>> Rather try, at first, to just print a printk logging which command is >>> being executed (r300_cmd.c) - this is not very thorough, but, maybe, >>> there is a pattern. >> >> >> >> >> I added a printk for each function in r300_cmdbuf.c... When Q3A >> locked up, and the last thing showing up in syslog was r300_pacify. >> So I added printk's after every line in r300_pacify :-) The last >> thing in syslog was: >> >> OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) >> OUT_RING( 0x0 ) >> ADVANCE_RING() >> >> So it seems to be making it all the way through r300_pacify, which had >> been called from r300_check_range, from r300_emit_unchecked_state. >> >> Here's the sequence: >> r300_emit_raw >> r300_emit_packet3 >> r300_emit_raw >> r300_emit_unchecked_state >> r300_check_range >> r300_emit_unchecked_state >> r300_check_range >> r300_pacify >> RING_LOCALS >> BEGIN_RING(6) >> OUT_RING( CP_PACKET0( R300_RB3D_DSTCACHE_CTLSTAT, 0 ) ) >> OUT_RING( 0xa ) >> OUT_RING( CP_PACKET0( 0x4f18, 0 ) ) >> OUT_RING( 0x3 ) >> OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) >> OUT_RING( 0x0 ) >> ADVANCE_RING() >> Does this tell >> us anything? >> >> Adam >> >> >> > > By the way, this was consisten in all four lockups I've experienced > since making these changes. Just because the CPU got to that point doesn't mean that the hardware didn't stall earlier. At very least you'd want to modify ADVANCE_RING() so that it 1) flushes the ring to hardware and 2) waits for hardware to idle, providing an indication that hardware was able to digest that packet and not get sick... Keith |
From: Nicolai H. <pre...@gm...> - 2005-05-24 17:04:52
|
On Tuesday 24 May 2005 18:33, Adam K Kirchhoff wrote: > Vladimir Dergachev wrote: >=20 > >> Vladimir Dergachev wrote: > >> > >>> > >>> In the past I found useful not to turn drm debugging on, but, rather, > >>> insert printk statements in various place in radeon code. This should > >>> also provide more information about what is actually going on.=20 > >> > >> > >> > >> I can't make any promises. My partner already thinks I spend too much > >> time in front of the computer :-) I'll see what I can do, though. =20 > >> Think a > >> printk statement at the start and end of every function? Have any > > > > > > This is probably overkill and might not be very useful > > > > Rather try, at first, to just print a printk logging which command is=20 > > being executed (r300_cmd.c) - this is not very thorough, but, maybe,=20 > > there is a pattern. >=20 >=20 > I added a printk for each function in r300_cmdbuf.c... When Q3A locked=20 > up, and the last thing showing up in syslog was r300_pacify. So I added= =20 > printk's after every line in r300_pacify :-) The last thing in syslog=20 was: >=20 > OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) > OUT_RING( 0x0 ) > ADVANCE_RING() >=20 > So it seems to be making it all the way through r300_pacify, which had=20 > been called from r300_check_range, from r300_emit_unchecked_state. >=20 > Here's the sequence: > r300_emit_raw > r300_emit_packet3 > r300_emit_raw > r300_emit_unchecked_state > r300_check_range > r300_emit_unchecked_state > r300_check_range > r300_pacify > RING_LOCALS > BEGIN_RING(6) > OUT_RING( CP_PACKET0( R300_RB3D_DSTCACHE_CTLSTAT, 0 ) ) > OUT_RING( 0xa ) > OUT_RING( CP_PACKET0( 0x4f18, 0 ) ) > OUT_RING( 0x3 ) > OUT_RING( CP_PACKET3( RADEON_CP_NOP, 0 ) ) > OUT_RING( 0x0 ) > ADVANCE_RING() > =20 >=20 > Does this tell us anything? Unfortunately, I don't think so. The thing is, all those OUT_RING and=20 ADVANCE_RING commands do not really call into the hardware immediately; all= =20 they do is write stuff to the ring buffer, but the ring buffer is just some= =20 memory area without any magic of its own. Only a call to COMMIT_RING will tell the hardware that new commands are=20 waiting in the ring buffer, and the only thing we do know is that=20 *something* in the ring buffer before the last COMMIT_RING causes the chip= =20 to hang. So another possible way to investigate this could be: =2D Call radeon_do_wait_for_idle() at the end of the COMMIT_RING macro, and= =20 define RADEON_FIFO_DEBUG (this will print out additional information when=20 wait_for_idle fails) =2D Increasingly add COMMIT_RING macros into r300_cmdbuf.c to pinpoint the= =20 exact location of the problem, if at all possible. It would be very helpful if you could single out one command we send using= =20 this procedure. Note that in the worst case (depending on the actual nature of the lockup i= n=20 hardware), those debugging changes could actually *remove* the lockup (e.g.= =20 because they remove a race condition that caused the lockup in the first=20 place). cu, Nicolai |
From: Jerome G. <j.g...@gm...> - 2005-05-24 18:00:13
|
I was doing some heavy debugging of all things that might be send (allmost every debug flag + self print) unfortunetly the overload of writting all this infos into a file cause the lock to disapear. No lock in 10min of quake3 while i get a lock in less 1 minutes without debugging. Thus as you suggest Nicolai NOP might be needed, i will try to track with commit ring. Jerome Glisse |
From: Jerome G. <j.g...@gm...> - 2005-05-24 20:54:13
|
On 5/24/05, Nicolai Haehnle <pre...@gm...> wrote: > Unfortunately, I don't think so. The thing is, all those OUT_RING and > ADVANCE_RING commands do not really call into the hardware immediately; a= ll > they do is write stuff to the ring buffer, but the ring buffer is just so= me > memory area without any magic of its own. >=20 > Only a call to COMMIT_RING will tell the hardware that new commands are > waiting in the ring buffer, and the only thing we do know is that > *something* in the ring buffer before the last COMMIT_RING causes the chi= p > to hang. >=20 > So another possible way to investigate this could be: > - Call radeon_do_wait_for_idle() at the end of the COMMIT_RING macro, and > define RADEON_FIFO_DEBUG (this will print out additional information when > wait_for_idle fails) > - Increasingly add COMMIT_RING macros into r300_cmdbuf.c to pinpoint the > exact location of the problem, if at all possible. >=20 > It would be very helpful if you could single out one command we send usin= g > this procedure. >=20 > Note that in the worst case (depending on the actual nature of the lockup= in > hardware), those debugging changes could actually *remove* the lockup (e.= g. > because they remove a race condition that caused the lockup in the first > place). >=20 Below a sample of what i get when a lockup occur. There is something that seems strange to me, i saw CP_RB_RTPR change while i am in a lockup and CP_RB_WTPR increase 6 by 6, I haven't let the things live for too much time (about 2mins before reboot) but i looks like it still process ring buffer but slowly. Anyway i must misunderstood this i have to dig up more this drm code to understand it a little more. By the way why radeon_cp_flush is disactivated ? May 24 21:33:25 localhost kernel: [drm:radeon_do_wait_for_idle] *ERROR* fai= led! May 24 21:33:25 localhost kernel: radeon_status: May 24 21:33:25 localhost kernel: RBBM_STATUS =3D 0x80010140 May 24 21:33:25 localhost kernel: CP_RB_RTPR =3D 0x0003fdf0 May 24 21:33:25 localhost kernel: CP_RB_WTPR =3D 0x00000d95 May 24 21:33:25 localhost kernel: AIC_CNTL =3D 0x00000000 May 24 21:33:25 localhost kernel: AIC_STAT =3D 0x00000004 May 24 21:33:25 localhost kernel: AIC_PT_BASE =3D 0x00000000 May 24 21:33:25 localhost kernel: TLB_ADDR =3D 0x00000000 May 24 21:33:25 localhost kernel: TLB_DATA =3D 0x00000000 |
From: Adam K K. <ad...@vo...> - 2005-05-24 18:10:41
|
Jerome Glisse wrote: >I was doing some heavy debugging of all things that might be send >(allmost every debug flag + self print) unfortunetly the overload of >writting all this infos into a file cause the lock to disapear. No lock in >10min of quake3 while i get a lock in less 1 minutes without debugging. > >Thus as you suggest Nicolai NOP might be needed, i will try to >track with commit ring. > >Jerome Glisse > > > FYI, I will say that one of my four attempts at Q3A (with the printks added to the drm driver) lasted for over 30 minutes before it locked up, which is much longer than I had previously experienced. The debugging info probably reduced the chances of a lockup in my case, too. Adam |
From: Jerome G. <j.g...@gm...> - 2005-05-24 18:29:49
|
On 5/24/05, Adam K Kirchhoff <ad...@vo...> wrote: > Jerome Glisse wrote: >=20 > FYI, I will say that one of my four attempts at Q3A (with the printks > added to the drm driver) lasted for over 30 minutes before it locked up, > which is much longer than I had previously experienced. The debugging > info probably reduced the chances of a lockup in my case, too. Ooops maybe i am not patient enought but my log was around 1giga thus i prefered stop there :) But this is definitly a clue because i have lockup= even without moving in quake 3. Thus i have a lockup with no debug and no lock with debug with what we may assume are same command (draw the same scene from the same point of view...). IMHO this tell us that this lock is probably related to = not waiting for something, maybe there is some cmd that we better wait for completion before doing anything else... Jerome Glisse |
From: Nicolai H. <pre...@gm...> - 2005-05-25 12:42:52
|
On Tuesday 24 May 2005 22:54, Jerome Glisse wrote: > On 5/24/05, Nicolai Haehnle <pre...@gm...> wrote: > > Unfortunately, I don't think so. The thing is, all those OUT_RING and > > ADVANCE_RING commands do not really call into the hardware immediately;= =20 all > > they do is write stuff to the ring buffer, but the ring buffer is just= =20 some > > memory area without any magic of its own. > >=20 > > Only a call to COMMIT_RING will tell the hardware that new commands are > > waiting in the ring buffer, and the only thing we do know is that > > *something* in the ring buffer before the last COMMIT_RING causes the=20 chip > > to hang. > >=20 > > So another possible way to investigate this could be: > > - Call radeon_do_wait_for_idle() at the end of the COMMIT_RING macro,=20 and > > define RADEON_FIFO_DEBUG (this will print out additional information=20 when > > wait_for_idle fails) > > - Increasingly add COMMIT_RING macros into r300_cmdbuf.c to pinpoint the > > exact location of the problem, if at all possible. > >=20 > > It would be very helpful if you could single out one command we send=20 using > > this procedure. > >=20 > > Note that in the worst case (depending on the actual nature of the=20 lockup in > > hardware), those debugging changes could actually *remove* the lockup=20 (e.g. > > because they remove a race condition that caused the lockup in the first > > place). > >=20 >=20 > Below a sample of what i get when a lockup occur. There is something > that seems strange to me, i saw CP_RB_RTPR change while i am in a > lockup and CP_RB_WTPR increase 6 by 6, I haven't let the things live > for too much time (about 2mins before reboot) but i looks like it > still process ring buffer but slowly. The increases of the write pointer is easily by 6 dwords is easily explaine= d=20 by radeon_do_cp_idle: This function always emits a series of 6 dwords=20 (cache flushes and stuff) before calling wait_for_idle. My understanding is= =20 that these commands make sure the chip is in a completely clean state. Are you sure the read pointer is still moving 2mins after the lockup? That= =20 would be rather surprising, to say the least. > Anyway i must misunderstood this=20 > i have to dig up more this drm code to understand it a little more. > > By the way why radeon_cp_flush is disactivated ? The only thing that calls radeon_cp_flush is radeon_cp_stop, which is never= =20 called during normal 3D operation and COMMIT_RING should take care of=20 posting the write pointer. I don't know the meaning of bit 31 of WPTR. cu, Nicolai > May 24 21:33:25 localhost kernel: [drm:radeon_do_wait_for_idle] *ERROR*=20 failed! > May 24 21:33:25 localhost kernel: radeon_status: > May 24 21:33:25 localhost kernel: RBBM_STATUS =3D 0x80010140 > May 24 21:33:25 localhost kernel: CP_RB_RTPR =3D 0x0003fdf0 > May 24 21:33:25 localhost kernel: CP_RB_WTPR =3D 0x00000d95 > May 24 21:33:25 localhost kernel: AIC_CNTL =3D 0x00000000 > May 24 21:33:25 localhost kernel: AIC_STAT =3D 0x00000004 > May 24 21:33:25 localhost kernel: AIC_PT_BASE =3D 0x00000000 > May 24 21:33:25 localhost kernel: TLB_ADDR =3D 0x00000000 > May 24 21:33:25 localhost kernel: TLB_DATA =3D 0x00000000 2 |
From: Jerome G. <j.g...@gm...> - 2005-05-25 13:05:18
|
> The increases of the write pointer is easily by 6 dwords is easily explai= ned > by radeon_do_cp_idle: This function always emits a series of 6 dwords > (cache flushes and stuff) before calling wait_for_idle. My understanding = is > that these commands make sure the chip is in a completely clean state. >=20 > Are you sure the read pointer is still moving 2mins after the lockup? Tha= t > would be rather surprising, to say the least. Yes, i saw it in two differents lockup but it takes sometimes before happen= ing. It may have happen in my other lockup if i let them more time. Jerome Glisse |
From: Vladimir D. <vo...@mi...> - 2005-05-25 15:02:07
|
> > Are you sure the read pointer is still moving 2mins after the lockup? That > would be rather surprising, to say the least. > I think I can imagine how this might be happenning. You see a lockup from the driver point of view is when the 3d engine busy bit is constantly on. The read pointer is updated by the CP engine, not the 3d engine. It could be that something would cause the CP engine to loop around sending commands to 3d engine forever. This would keep the 3d engine bit on, update the read pointer and appear to be a lockup to the driver. One way to try to make sure this does not happen is to put code in the DRM driver to control the active size of the ring buffer. Also, there might be an issue where the CP engine expects the ring buffer to be padded with NOPs in a certain way (say to have pointers always on 256 bit boundaries) - I don't think we are doing this. Michel - any chance you could comment ? Thank you ! best Vladimir Dergachev |
From: Nicolai H. <pre...@gm...> - 2005-05-25 15:34:32
|
On Wednesday 25 May 2005 17:01, Vladimir Dergachev wrote: > > Are you sure the read pointer is still moving 2mins after the lockup?=20 That > > would be rather surprising, to say the least. > > >=20 > I think I can imagine how this might be happenning. You see a lockup from > the driver point of view is when the 3d engine busy bit is constantly on. >=20 > The read pointer is updated by the CP engine, not the 3d engine. It could= =20 > be that something would cause the CP engine to loop around sending=20 > commands to 3d engine forever. This would keep the 3d engine bit on,=20 > update the read pointer and appear to be a lockup to the driver. What you're saying is, some command that we sent could be misinterpreted by= =20 the 3D engine (or we sent something that we didn't intend to send,=20 considering lack of docs etc.) as a command that takes insanely long to=20 complete. > One way to try to make sure this does not happen is to put code in the DR= M=20 > driver to control the active size of the ring buffer. That could be useful for debugging, but that's about it. The thing is, we=20 *want* to have the ring buffer full. If we didn't want that, we could just= =20 make the ring buffer smaller. But that doesn't really *solve* the problem=20 either because even very small commands can take an insane amount of time=20 to finish. In any case, it would be interesting to know how fast the RPTR still moves= =20 and if it becomes unstuck at some point. You also need to watch out for=20 when the X server finally decides to reset the CP. I believe there's a bug= =20 where the X server waits much longer than intended to do this, but the=20 reset could still mess with results if you're waiting for too long. > Also, there might be an issue where the CP engine expects the ring buffer= =20 > to be padded with NOPs in a certain way (say to have pointers always on=20 > 256 bit boundaries) - I don't think we are doing this. Yes, that's what I mentioned in an earlier mail. cu, Nicolai |
From: Vladimir D. <vo...@mi...> - 2005-05-25 15:58:41
|
On Wed, 25 May 2005, Nicolai Haehnle wrote: > On Wednesday 25 May 2005 17:01, Vladimir Dergachev wrote: >>> Are you sure the read pointer is still moving 2mins after the lockup? > That >>> would be rather surprising, to say the least. >>> >> >> I think I can imagine how this might be happenning. You see a lockup from >> the driver point of view is when the 3d engine busy bit is constantly on. >> >> The read pointer is updated by the CP engine, not the 3d engine. It could >> be that something would cause the CP engine to loop around sending >> commands to 3d engine forever. This would keep the 3d engine bit on, >> update the read pointer and appear to be a lockup to the driver. > > What you're saying is, some command that we sent could be misinterpreted by > the 3D engine (or we sent something that we didn't intend to send, > considering lack of docs etc.) as a command that takes insanely long to > complete. No - because then the read pointer would be stationary. The CP engine should not submit commands until the 3d engine busy is 0. (but it reacts faster than we can catch this). My interpretation was that CP engine just keeps submitting the same command to 3d engine, without fetching it and just increases the read pointer all the time. It might help to dump not only read pointer but also the register showing the last command executed. > >> One way to try to make sure this does not happen is to put code in the DRM >> driver to control the active size of the ring buffer. > > That could be useful for debugging, but that's about it. The thing is, we > *want* to have the ring buffer full. If we didn't want that, we could just > make the ring buffer smaller. But that doesn't really *solve* the problem > either because even very small commands can take an insane amount of time > to finish. I am thinking that there might be a bug where CP engine does something funny when the ring buffer is completely full. Maybe we need to keep a small chunk of space open so that start and end pointers are different. > > In any case, it would be interesting to know how fast the RPTR still moves > and if it becomes unstuck at some point. You also need to watch out for > when the X server finally decides to reset the CP. I believe there's a bug > where the X server waits much longer than intended to do this, but the > reset could still mess with results if you're waiting for too long. Yep. best Vladimir Dergachev |