From: F. <j_r...@ya...> - 2002-06-09 22:32:27
Attachments:
kmsg.txt
|
Leif, I have had some problems with the DMA code (not just now but since the beginning of the week). Sometimes get lockups and some crashes quite easily, others I get bored of jumping around in UT waiting for something bad to happen. But finally I got something - please look at the attached dump - it shows the ring contents. There are two things strange in that picture: 1- In the BM_COMMAND value specified in the table head is 0x400001a0 while the register says 0x40000038 2- The value of BM_SYSTEM_MEM_ADDR in the table head and tail is both 0x00534000. Although this doesn't mean much since the information pointed by the tail is garbage left from the previous pass, I have trouble in believing in coincidences. The feeling I get from this is that we are writing to pending buffers causing the engine to lock. But I don't see how... This only happened in my testbox (which is rather slow). Have you experienced anything like this? I really wanted to get this straight to make a call for testers. Although on my laptop (Pentium III 700 + Mach64 4MB AGP 2x) glxgears just rised a single fps with the new DMA, on my testbox (AMD K6 350 + Mach64 8MB AGP 2x) they've increased from 144 to 235. Tuxracer is about twice the fps too. UT not really because the CPU has trouble coping with it, but e.g. on my laptop UT is much smoother. The improvements have just begun and we really gonna need extensive testing to avoid accumulate bugs and to know the impact of the improvements on the performance. José Fonseca |
From: F. <j_r...@ya...> - 2002-06-09 22:38:49
|
On 2002.06.09 23:32 José Fonseca wrote: > ... > > The feeling I get from this is that we are writing to pending buffers > causing the engine to lock. But I don't see how... This only happened in > my testbox (which is rather slow). Have you experienced anything like > this? > > ... I forgot to add that it would be really nice to dump the buffer corresponding to the head so that we could analyze if the buffer is valid or was trashed. Is there any other way to know the virtual address from the physical address of the buffer whithout searching for a match in the buffers lists? This would be also nice to make a MMIO alternative code path which would also read the ring table and effectively emulated the ring buffer too (for debugging purposes only as DMA is the way to go). José Fonseca |
From: Leif D. <lde...@re...> - 2002-06-09 23:53:23
|
On Sun, 9 Jun 2002, Jos=E9 Fonseca wrote: > Leif, >=20 > I have had some problems with the DMA code (not just now but since the=20 > beginning of the week). Sometimes get lockups and some crashes quite=20 > easily, others I get bored of jumping around in UT waiting for somethin= g=20 > bad to happen. Anything reproducible, or is it at different places every time? =20 > But finally I got something - please look at the attached dump - it sho= ws=20 > the ring contents. There are two things strange in that picture: >=20 > 1- In the BM_COMMAND value specified in the table head is 0x400001a0=20 > while the register says 0x40000038 I think this is because the card is decrementing the byte count as it=20 processes the buffer. You'll notice that BM_SYSTEM_MEM_ADDR also=20 looks like it's being incremented as the buffer is processed. I don't=20 think this indicates a problem. The register state looks like the engine= =20 locked while processing the buffer at 0x00534000. =20 > 2- The value of BM_SYSTEM_MEM_ADDR in the table head and tail is both= =20 > 0x00534000. Although this doesn't mean much since the information point= ed=20 > by the tail is garbage left from the previous pass, I have trouble in=20 > believing in coincidences. >=20 > The feeling I get from this is that we are writing to pending buffers=20 > causing the engine to lock. But I don't see how... This only happened i= n=20 > my testbox (which is rather slow). Have you experienced anything like t= his? I haven't had any lockups since your changes, apart from once when I was working on the BM_HOSTDATA blit changes, and I haven't had any since I committed my changes. I've done quite a bit of "testing" with various games lately, too. ;) But, as you say, there could be bugs that only cro= p up on a slower system (I test on a Dell Inspiron 7k laptop w/ PII 400 and a Rage LT Pro 8M AGP 2x). I should mention that I have _EXTRA_CHECKING=20 disabled. I suppose it's possible that the checking code causes a=20 problem, but I'm assuming that you had lockups before adding it. One thing to be aware of is that do_release_used_buffers unconditionally=20 frees all pending buffers, so it should only be called when the card is=20 known to be idle and finished with all pending buffers on the ring. But = I=20 think the current code is safe in that respect. Maybe it would help in=20 debugging if you also dump the contents of the buffer pointed to by the=20 head (and maybe the tail if you think there might be a problem there). BTW, did anything come before the ring dump in the log? Do you know wher= e=20 it was triggered from? > I really wanted to get this straight to make a call for testers. Althou= gh=20 > on my laptop (Pentium III 700 + Mach64 4MB AGP 2x) glxgears just rised = a=20 > single fps with the new DMA, on my testbox (AMD K6 350 + Mach64 8MB AGP= =20 > 2x) they've increased from 144 to 235. Tuxracer is about twice the fps=20 > too. UT not really because the CPU has trouble coping with it, but e.g.= on=20 > my laptop UT is much smoother. The improvements have just begun and we=20 > really gonna need extensive testing to avoid accumulate bugs and to kno= w=20 > the impact of the improvements on the performance. >=20 > Jos=E9 Fonseca >=20 --=20 Leif Delgass=20 http://www.retinalburn.net |
From: F. <j_r...@ya...> - 2002-06-10 00:21:13
|
On 2002.06.10 00:53 Leif Delgass wrote: > On Sun, 9 Jun 2002, José Fonseca wrote: > > > Leif, > > > > I have had some problems with the DMA code (not just now but since the > > beginning of the week). Sometimes get lockups and some crashes quite > > easily, others I get bored of jumping around in UT waiting for > something > > bad to happen. > > Anything reproducible, or is it at different places every time? There isn't a true pattern. Sometimes it happens when I click on the mouse or keyboard, others it happens by itself, and others it simply doesn't ever happen. I still have no clue. > > > But finally I got something - please look at the attached dump - it > shows > > the ring contents. There are two things strange in that picture: > > > > 1- In the BM_COMMAND value specified in the table head is 0x400001a0 > > while the register says 0x40000038 > > I think this is because the card is decrementing the byte count as it > processes the buffer. You'll notice that BM_SYSTEM_MEM_ADDR also > looks like it's being incremented as the buffer is processed. I don't > think this indicates a problem. The register state looks like the engine > locked while processing the buffer at 0x00534000. But if so should they be incremented/decremented by the same amount? > ... > > > > The feeling I get from this is that we are writing to pending buffers > > causing the engine to lock. But I don't see how... This only happened > in > > my testbox (which is rather slow). Have you experienced anything like > this? > > I haven't had any lockups since your changes, apart from once when I was > working on the BM_HOSTDATA blit changes, and I haven't had any since I > committed my changes. I've done quite a bit of "testing" with various > games lately, too. ;) But, as you say, there could be bugs that only This is good to know. > crop > up on a slower system (I test on a Dell Inspiron 7k laptop w/ PII 400 and > a Rage LT Pro 8M AGP 2x). I should mention that I have _EXTRA_CHECKING > disabled. I suppose it's possible that the checking code causes a > problem, but I'm assuming that you had lockups before adding it. > Yes, the _EXTRA_CHECKING was already part of my attempt to hunt this down... > One thing to be aware of is that do_release_used_buffers unconditionally > frees all pending buffers, so it should only be called when the card is > known to be idle and finished with all pending buffers on the ring. But > I > think the current code is safe in that respect. Maybe it would help in > debugging if you also dump the contents of the buffer pointed to by the > head (and maybe the tail if you think there might be a problem there). Ok. Tomorrow I'll add a extra check there. I'll also add an wait_for_idle in ADVANCE_RING to force a sync with the engine to see if this still happens. I just wanted to be sure that there isn't a race condition which we aren't aware of... because if there is one, it could give more headaches afterwards than now. Anyway I'm getting a little bored with this, so I'm already moving forward too: I'm gonna have the DMA* macros to hold a a DMA buffer across successive calls instead of wasting a buffer for each call. > > BTW, did anything come before the ring dump in the log? Do you know > where it was triggered from? No, this time there wasn't nothing more. The trigger is hard to determine: they problem is noticed only when waiting for buffers, so sometimes there are some messages related to timeouts. > ... José Fonseca |
From: Leif D. <lde...@re...> - 2002-06-10 01:37:53
|
On Mon, 10 Jun 2002, Jos=E9 Fonseca wrote: > On 2002.06.10 00:53 Leif Delgass wrote: > > On Sun, 9 Jun 2002, Jos=E9 Fonseca wrote: > >=20 > > > Leif, > > > > > > I have had some problems with the DMA code (not just now but since = the > > > beginning of the week). Sometimes get lockups and some crashes quit= e > > > easily, others I get bored of jumping around in UT waiting for > > something > > > bad to happen. > >=20 > > Anything reproducible, or is it at different places every time? >=20 > There isn't a true pattern. Sometimes it happens when I click on the mo= use=20 > or keyboard, others it happens by itself, and others it simply doesn't=20 > ever happen. I still have no clue. >=20 > >=20 > > > But finally I got something - please look at the attached dump - it > > shows > > > the ring contents. There are two things strange in that picture: > > > > > > 1- In the BM_COMMAND value specified in the table head is 0x40000= 1a0 > > > while the register says 0x40000038 > >=20 > > I think this is because the card is decrementing the byte count as it > > processes the buffer. You'll notice that BM_SYSTEM_MEM_ADDR also > > looks like it's being incremented as the buffer is processed. I don'= t > > think this indicates a problem. The register state looks like the en= gine > > locked while processing the buffer at 0x00534000. >=20 > But if so should they be incremented/decremented by the same amount?=20 Maybe, but without knowing about the hardware implementation it's hard to know. Maybe the system address is only incremented in 256 byte chunks? =20 I remember seeing BM_COMMAND seemingly being decremented by the hardware in one of your earlier tests (where there was no lockup), so that looked like normal operation to me. A quick test could be made to repeatedly sample these registers while a full DMA buffer is being processed. Of=20 course, there will always be delay between reads, so it's hard to get a=20 true snapshot of the state of multiple registers at a given moment in=20 time. =20 > > ... > > > > > > The feeling I get from this is that we are writing to pending buffe= rs > > > causing the engine to lock. But I don't see how... This only happen= ed > > in > > > my testbox (which is rather slow). Have you experienced anything li= ke > > this? > >=20 > > I haven't had any lockups since your changes, apart from once when I = was > > working on the BM_HOSTDATA blit changes, and I haven't had any since = I > > committed my changes. I've done quite a bit of "testing" with variou= s > > games lately, too. ;) But, as you say, there could be bugs that only >=20 > This is good to know. btw, when did you last update from cvs? I was wondering if you had tried= =20 the blit changes I made yesterday (June 8). > > crop > > up on a slower system (I test on a Dell Inspiron 7k laptop w/ PII 400= and > > a Rage LT Pro 8M AGP 2x). I should mention that I have _EXTRA_CHECKI= NG > > disabled. I suppose it's possible that the checking code causes a > > problem, but I'm assuming that you had lockups before adding it. > >=20 >=20 > Yes, the _EXTRA_CHECKING was already part of my attempt to hunt this=20 > down... That's what I figured. > > One thing to be aware of is that do_release_used_buffers unconditiona= lly > > frees all pending buffers, so it should only be called when the card = is > > known to be idle and finished with all pending buffers on the ring. = But > > I > > think the current code is safe in that respect. Maybe it would help = in > > debugging if you also dump the contents of the buffer pointed to by t= he > > head (and maybe the tail if you think there might be a problem there). >=20 > Ok. Tomorrow I'll add a extra check there. I'll also add an wait_for_id= le=20 > in ADVANCE_RING to force a sync with the engine to see if this still=20 > happens. >=20 > I just wanted to be sure that there isn't a race condition which we are= n't=20 > aware of... because if there is one, it could give more headaches=20 > afterwards than now. Sure. We should definitely try to find the source of your lockups before= =20 going much further. > Anyway I'm getting a little bored with this, so I'm already moving forw= ard=20 > too: I'm gonna have the DMA* macros to hold a a DMA buffer across=20 > successive calls instead of wasting a buffer for each call. >=20 > >=20 > > BTW, did anything come before the ring dump in the log? Do you know > > where it was triggered from? >=20 > No, this time there wasn't nothing more. The trigger is hard to determi= ne:=20 > they problem is noticed only when waiting for buffers, so sometimes the= re=20 > are some messages related to timeouts. Yeah, a timeout there is usually just a symptom of a lockup, lots of things could potentially be the root cause. --=20 Leif Delgass=20 http://www.retinalburn.net |
From: F. <j_r...@ya...> - 2002-06-11 15:33:25
|
Leif, I've got it. After adding checks and checks to everything I finally discovered it - in some circunstances (such as a regular wait_for_idle with no previous buffer submiting as an example) we were firing GUI master without enabling the BM. It's a miracle the card even worked at all! I'm really relieved for the problem was in the code and in no way associated with the card misbehavior on slow computers! I think that now we can make some code cleaning. If you have no objections, I would to remove some of the code associated with the BUFFER AGING=0, and/or when MACH64_NO_BATCH_DISPATCH=0. The UPDATE_RING_HEAD macro is getting monolithical but after giving some thoughs about it I've come to the conclusion that that code must really be there. The only liberty we have to either leave in a macro, inline it, or put in a seperate directories. What I'll do it leave the most used code path in the macro, and define subroutines for doing the less used paths. On 2002.06.10 02:37 Leif Delgass wrote: > ... > > btw, when did you last update from cvs? I was wondering if you had tried > the blit changes I made yesterday (June 8). > Yeah. I had no problem associated with them, so keep them coming! > ... Regards, José Fonseca |
From: Leif D. <lde...@re...> - 2002-06-11 17:25:01
|
On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > Leif, >=20 > I've got it. After adding checks and checks to everything I finally=20 > discovered it - in some circunstances (such as a regular wait_for_idle=20 > with no previous buffer submiting as an example) we were firing GUI mas= ter=20 > without enabling the BM. It's a miracle the card even worked at all! Now I understand why I didn't experience the lockup. I forgot to mention that in my tree, I renamed your do_dma_flush to do_wait_for_dma (which wa= s still called from do_dma_idle), and I made do_dma_flush call dma_start. =20 do_dma_flush is called by the flush ioctl, which is called before waiting for idle by the DDX and Mesa driver, so this ensured that busmastering wa= s on before waiting for idle. We still need to sort this out, because the DDFlush in the Mesa driver should flush the ring, ensuring all buffers will be completed by the hardware in finite time, but shouldn't wait for idle (only DDFinish needs to wait for idle). A couple of other comments on your change: I think we could just=20 unconditionally turn on busmastering in SRC_CNTL if the engine is idle in= =20 UPDATE_RING_HEAD. One write is likely to use fewer cycles than one read,= =20 and certainly less than one read followed by one write. Plus, if the=20 engine is active, there shouldn't be a need to check if bus mastering is=20 on and enable it. That also makes the do_wait_for_idle in dma_start=20 unnecessary. The flush ioctl could just call UPDATE_RING_HEAD, and=20 possibly wait to see the head advance a couple of descriptors to make sur= e=20 the engine won't stop. =20 > I'm really relieved for the problem was in the code and in no way=20 > associated with the card misbehavior on slow computers! >=20 > I think that now we can make some code cleaning. If you have no=20 > objections, I would to remove some of the code associated with the BUFF= ER=20 > AGING=3D0, and/or when MACH64_NO_BATCH_DISPATCH=3D0. I think that's probably fine, but I think we should see if we can salvage the pseudo-dma path. We should also consider what to do about frame aging. Also, the interrupt path is probably pretty broken now, we should= =20 see if there's any way to make that a workable alternative or remove it. =20 > The UPDATE_RING_HEAD macro is getting monolithical but after giving som= e=20 > thoughs about it I've come to the conclusion that that code must really= be=20 > there. The only liberty we have to either leave in a macro, inline it, = or=20 > put in a seperate directories. What I'll do it leave the most used code= =20 > path in the macro, and define subroutines for doing the less used paths. Maybe, if you make the changes I suggested above, all the work done in th= e block where the engine is idle in UPDATE_RING_HEAD could be moved into dma_start which could be made an inline function. At the moment, that's the only place dma_start is called, right? =20 > On 2002.06.10 02:37 Leif Delgass wrote: > > ... > >=20 > > btw, when did you last update from cvs? I was wondering if you had t= ried > > the blit changes I made yesterday (June 8). > >=20 >=20 > Yeah. I had no problem associated with them, so keep them coming! >=20 > > ... >=20 > Regards, >=20 > Jos=E9 Fonseca >=20 > _______________________________________________________________ >=20 > Don't miss the 2002 Sprint PCS Application Developer's Conference > August 25-28 in Las Vegas - http://devcon.sprintpcs.com/adp/index.cfm?s= ource=3Dosdntextlink >=20 > _______________________________________________ > Dri-devel mailing list > Dri...@li... > https://lists.sourceforge.net/lists/listinfo/dri-devel >=20 --=20 Leif Delgass=20 http://www.retinalburn.net |
From: F. <j_r...@ya...> - 2002-06-11 18:46:22
|
On 2002.06.11 18:24 Leif Delgass wrote: > On Tue, 11 Jun 2002, José Fonseca wrote: > > > Leif, > > > > I've got it. After adding checks and checks to everything I finally > > discovered it - in some circunstances (such as a regular wait_for_idle > > with no previous buffer submiting as an example) we were firing GUI > master > > without enabling the BM. It's a miracle the card even worked at all! > > Now I understand why I didn't experience the lockup. I forgot to mention > that in my tree, I renamed your do_dma_flush to do_wait_for_dma (which > was > still called from do_dma_idle), and I made do_dma_flush call dma_start. > do_dma_flush is called by the flush ioctl, which is called before waiting > for idle by the DDX and Mesa driver, so this ensured that busmastering > was > on before waiting for idle. We still need to sort this out, because the > DDFlush in the Mesa driver should flush the ring, ensuring all buffers > will be completed by the hardware in finite time, but shouldn't wait for > idle (only DDFinish needs to wait for idle). In that case we just need to ensure that the head >= the offset of the last ring submited. > A couple of other comments on your change: I think we could just Although there are some things you can try simplify this is _very_ trick as you gonna see below... > unconditionally turn on busmastering in SRC_CNTL if the engine is idle in No. The engine could be active but not doing BM, e.g., if X submited something to FIFO. > UPDATE_RING_HEAD. One write is likely to use fewer cycles than one read, > and certainly less than one read followed by one write. Plus, if the As I said in the code, what we can do is have a variable indicating whether the busmaster is ON or OFF: > engine is active, there shouldn't be a need to check if bus mastering is > on and enable it. That also makes the do_wait_for_idle in dma_start The Mach64 SDK sample code calls wait_for_idle before enabling the BM, so we need to do that too. > unnecessary. The flush ioctl could just call UPDATE_RING_HEAD, and > possibly wait to see the head advance a couple of descriptors to make > sure the engine won't stop. > What's the point in waiting for anything? I though we both had agree that we shouldn't wait for anything when submiting a buffer... (after all we both drawn the same conclusions in that matter in two distinct times). Note that I'm quite reluctant in making optimizations where you have to make more assumptions to gain a few cycles. We already assume alot, and it's very dificult to determine when our assumptions are wrong. If another bug like this one (which took me a whole week to solve) happens on the hands of a user we'll be forever trying to debug it. I also plan to leave all assumptions that we already do to be expressed in the code (with MACH64_EXTRA_CHECKING or a better name perhaps) so that we can debug these kind of problems more easily when they happen. > > I'm really relieved for the problem was in the code and in no way > > associated with the card misbehavior on slow computers! > > > > I think that now we can make some code cleaning. If you have no > > objections, I would to remove some of the code associated with the > BUFFER > > AGING=0, and/or when MACH64_NO_BATCH_DISPATCH=0. > > I think that's probably fine, but I think we should see if we can salvage > the pseudo-dma path. We should also consider what to do about frame > aging. Also, the interrupt path is probably pretty broken now, we should > see if there's any way to make that a workable alternative or remove it. > I'll leave both those paths as they are for now - they are fairly contained anyway. Later on we'll decide what to do with that code. Regarding the frame aging we can do that by pointing down the ring offset of the last buffer of each frame. > > The UPDATE_RING_HEAD macro is getting monolithical but after giving > some > > thoughs about it I've come to the conclusion that that code must really > be > > there. The only liberty we have to either leave in a macro, inline it, > or > > put in a seperate directories. What I'll do it leave the most used code > > path in the macro, and define subroutines for doing the less used > paths. > > Maybe, if you make the changes I suggested above, all the work done in > the > block where the engine is idle in UPDATE_RING_HEAD could be moved into > dma_start which could be made an inline function. At the moment, that's > the only place dma_start is called, right? > Yep. But after thinking better and it's probably better do all in UPDATE_RING... there isn't much point in putting 2 output instructions in a seperate function where the call overhead takes more instructions than that, especially if they are called only from a single function. > ... José Fonseca |
From: Leif D. <lde...@re...> - 2002-06-11 19:32:04
Attachments:
mach64-drm-20020611.diff.gz
|
On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > On 2002.06.11 18:24 Leif Delgass wrote: > > On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > >=20 > > > Leif, > > > > > > I've got it. After adding checks and checks to everything I finally > > > discovered it - in some circunstances (such as a regular wait_for_i= dle > > > with no previous buffer submiting as an example) we were firing GUI > > master > > > without enabling the BM. It's a miracle the card even worked at all= ! > >=20 > > Now I understand why I didn't experience the lockup. I forgot to men= tion > > that in my tree, I renamed your do_dma_flush to do_wait_for_dma (whic= h > > was > > still called from do_dma_idle), and I made do_dma_flush call dma_star= t. > > do_dma_flush is called by the flush ioctl, which is called before wai= ting > > for idle by the DDX and Mesa driver, so this ensured that busmasterin= g > > was > > on before waiting for idle. We still need to sort this out, because = the > > DDFlush in the Mesa driver should flush the ring, ensuring all buffer= s > > will be completed by the hardware in finite time, but shouldn't wait = for > > idle (only DDFinish needs to wait for idle). >=20 > In that case we just need to ensure that the head >=3D the offset of th= e=20 > last ring submited. But that means essentially waiting for idle. The ring tail is moved ever= y=20 time we add a buffer to the ring, so the "last ring submitted" is always=20 close to the new tail and could be separated from the head by many=20 descriptors. We just need to make sure that the card isn't going to stop= ,=20 so the flush needs to restart DMA if the card is idle and ensure that it=20 _won't_ stop if the card is active.=20 =20 > > A couple of other comments on your change: I think we could just >=20 > Although there are some things you can try simplify this is _very_ tric= k=20 > as you gonna see below... >=20 > > unconditionally turn on busmastering in SRC_CNTL if the engine is idl= e in >=20 > No. The engine could be active but not doing BM, e.g., if X submited=20 > something to FIFO. Notice that I said if the engine is _idle_. The engine can't be active i= f it's idle, right? And it doesn't make sense to turn on busmastering whil= e the card is active. This is assuming if GUI_ACTIVE is 0, then the FIFO i= s also empty. The docs say that a state of idleness (GUI_ACTIVE=3D0) impli= es an empty fifo, so the FIFO check in wait_for_idle is really redundant. I'= m saying we move the call to dma_start into the block in UPDATE_RING_HEAD where we already know the engine is idle from GUI_STAT, and move the two writes to kick off the transfer into dma_start as well (just to group all the writes that start a transfer in one place). =20 > > UPDATE_RING_HEAD. One write is likely to use fewer cycles than one r= ead, > > and certainly less than one read followed by one write. Plus, if the Actually, I realized there are two writes (one to SRC_CNTL and one to=20 BUS_CNTL), so reading SRC_CNTL could be better. As you say, we could kee= p=20 track of this with a variable, as long as we account for the possibility=20 of the card disabling bus mastering in the case of a lockup. =20 > As I said in the code, what we can do is have a variable indicating=20 > whether the busmaster is ON or OFF: >=20 > > engine is active, there shouldn't be a need to check if bus mastering= is > > on and enable it. That also makes the do_wait_for_idle in dma_start >=20 > The Mach64 SDK sample code calls wait_for_idle before enabling the BM, = so=20 > we need to do that too. See above. We'd already know the engine is idle when enabling busmastering. > > unnecessary. The flush ioctl could just call UPDATE_RING_HEAD, and > > possibly wait to see the head advance a couple of descriptors to make > > sure the engine won't stop. > >=20 >=20 > What's the point in waiting for anything? I though we both had agree th= at=20 > we shouldn't wait for anything when submiting a buffer... (after all we= =20 > both drawn the same conclusions in that matter in two distinct times). As I mentioned above, we just need to ensure that the card won't stop because it has already read an old end of list flag. We could have added several descriptors to the ring while the card is still active on an earlier descriptor which the card sees as the end of the ring. So restarting the card if it's idle doesn't ensure that it won't stop if it'= s active. I was just thinking that waiting to see the card advance a coupl= e of descriptors (but not waiting for idle) might be enough to ensure that=20 it isn't going to stop. > Note that I'm quite reluctant in making optimizations where you have to= =20 > make more assumptions to gain a few cycles. We already assume alot, and= =20 > it's very dificult to determine when our assumptions are wrong. If anot= her=20 > bug like this one (which took me a whole week to solve) happens on the=20 > hands of a user we'll be forever trying to debug it. I think the assumption that the card is not active when it's idle is a=20 safe one. ;) =20 > I also plan to leave all assumptions that we already do to be expressed= in=20 > the code (with MACH64_EXTRA_CHECKING or a better name perhaps) so that = we=20 > can debug these kind of problems more easily when they happen. >=20 > > > I'm really relieved for the problem was in the code and in no way > > > associated with the card misbehavior on slow computers! > > > > > > I think that now we can make some code cleaning. If you have no > > > objections, I would to remove some of the code associated with the > > BUFFER > > > AGING=3D0, and/or when MACH64_NO_BATCH_DISPATCH=3D0. > >=20 >=20 > > I think that's probably fine, but I think we should see if we can sal= vage > > the pseudo-dma path. We should also consider what to do about frame > > aging. Also, the interrupt path is probably pretty broken now, we sh= ould > > see if there's any way to make that a workable alternative or remove = it. > >=20 >=20 > I'll leave both those paths as they are for now - they are fairly=20 > contained anyway. Later on we'll decide what to do with that code. >=20 > Regarding the frame aging we can do that by pointing down the ring offs= et=20 > of the last buffer of each frame. >=20 > > > The UPDATE_RING_HEAD macro is getting monolithical but after giving > > some > > > thoughs about it I've come to the conclusion that that code must re= ally > > be > > > there. The only liberty we have to either leave in a macro, inline = it, > > or > > > put in a seperate directories. What I'll do it leave the most used = code > > > path in the macro, and define subroutines for doing the less used > > paths. > >=20 > > Maybe, if you make the changes I suggested above, all the work done i= n > > the > > block where the engine is idle in UPDATE_RING_HEAD could be moved int= o > > dma_start which could be made an inline function. At the moment, tha= t's > > the only place dma_start is called, right? > >=20 >=20 > Yep. But after thinking better and it's probably better do all in=20 > UPDATE_RING... there isn't much point in putting 2 output instructions = in=20 > a seperate function where the call overhead takes more instructions tha= n=20 > that, especially if they are called only from a single function. But doesn't inlining the function remove the overhead? I was suggesting=20 this mainly for readability, it's not really necessary if dma_start isn't= =20 going to be used anywhere else. =20 I'm attaching a diff so you can see what I'm talking about. Hopefully=20 this will remove any confusion. :) --=20 Leif Delgass=20 http://www.retinalburn.net |
From: F. <j_r...@ya...> - 2002-06-11 20:21:53
|
On 2002.06.11 20:31 Leif Delgass wrote: > On Tue, 11 Jun 2002, José Fonseca wrote: > > > On 2002.06.11 18:24 Leif Delgass wrote: > > > On Tue, 11 Jun 2002, José Fonseca wrote: > > > > > > ... > > In that case we just need to ensure that the head >= the offset of the > > last ring submited. > > But that means essentially waiting for idle. The ring tail is moved > every > time we add a buffer to the ring, so the "last ring submitted" is always > close to the new tail and could be separated from the head by many > descriptors. We just need to make sure that the card isn't going to > stop, > so the flush needs to restart DMA if the card is idle and ensure that it > _won't_ stop if the card is active. I see. But it's very complicated to know whether the card is gonna to stop or not. What we can do is have flag which can tell us if we are sure the card will succed or not. When we add a new buffer to the ring and the after the change the head is before the change then we binary "and" the previous flag value with TRUE. When we want to do that kind of flush, we check the value of that flag. If it's true we can release immediately. If it's not then we must wait_for_idle because the card will probably stop, restart the DMA, and return. > > > > A couple of other comments on your change: I think we could just > > > > Although there are some things you can try simplify this is _very_ > trick > > as you gonna see below... > > > > > unconditionally turn on busmastering in SRC_CNTL if the engine is > idle in > > > > No. The engine could be active but not doing BM, e.g., if X submited > > something to FIFO. > > Notice that I said if the engine is _idle_. The engine can't be active > if > it's idle, right? ... Right. Sorry. I didn't understood that the first I read this. I guess that it would be safe to do what you had (*) suggested but as, you said too below, using a variable it's probably a better idea - after all if it's in the cache, the speed of access if several orders faster than doing anything over the bus. (*) Not true.. after seeing your diff I reminded of a problem with this: note that if the card is not idle then you'll gonna offset the -4 entries to correct it, but if the card was processing the X FIFO commands - that's totally wrong and it's gonna corrupt the head offset. So I reiterate what I said on my previous reply regarding this change. Nevertheless this is a non-issue as using a variable achieves the same or better without reordering the code. > ... > > Actually, I realized there are two writes (one to SRC_CNTL and one to > BUS_CNTL), so reading SRC_CNTL could be better. As you say, we could > keep > track of this with a variable, as long as we account for the possibility > of the card disabling bus mastering in the case of a lockup. > > ... > > > > What's the point in waiting for anything? I though we both had agree > that > > we shouldn't wait for anything when submiting a buffer... (after all we > > > both drawn the same conclusions in that matter in two distinct times). > > As I mentioned above, we just need to ensure that the card won't stop > because it has already read an old end of list flag. We could have added > several descriptors to the ring while the card is still active on an > earlier descriptor which the card sees as the end of the ring. So > restarting the card if it's idle doesn't ensure that it won't stop if I'm not sure if I understood correctly what you're saying. Note that once we restart the card we can be sure that it won't stop until it finishes _all_ buffers we supplied until _that_ moment. > it's active. I was just thinking that waiting to see the card advance a > couple > of descriptors (but not waiting for idle) might be enough to ensure that > it isn't going to stop. I think that having a flag indicating whether the card can stop or not is more efficient. What do you think? Another thing that we must address before this is using bigger buffers - most of the ring dumps I see the buffers span just one or two entries at maximum. There are two things: we are using a buffer for each DMA* macro sets (which I'm working on now); and the client submited buffers can be much bigger (they can even span several primitives). Until both this things are addressed the the card is in idle most of the time anyway (at least on my slow testbox). > ... > > > > Yep. But after thinking better and it's probably better do all in > > UPDATE_RING... there isn't much point in putting 2 output instructions > in > > a seperate function where the call overhead takes more instructions > than > > that, especially if they are called only from a single function. > > But doesn't inlining the function remove the overhead? I was suggesting Yes. If you inline you'll get the same thing... (I was being naughty.. :P ) > this mainly for readability, it's not really necessary if dma_start isn't > going to be used anywhere else. > > I'm attaching a diff so you can see what I'm talking about. Hopefully > this will remove any confusion. :) > Yes it helped! Besides my above comment I noticed that you did something with RING_SPACE_TEST_WITH_RETURN, but it wasn't clear what from the diff. Is that code still needed or not? I've disabled and I've been doing fine without it. José Fonseca |
From: Leif D. <lde...@re...> - 2002-06-11 21:12:48
|
On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > On 2002.06.11 20:31 Leif Delgass wrote: > > On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > >=20 > > > On 2002.06.11 18:24 Leif Delgass wrote: > > > > On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > > > > > > > > ... > > > In that case we just need to ensure that the head >=3D the offset o= f the > > > last ring submited. > >=20 > > But that means essentially waiting for idle. The ring tail is moved > > every > > time we add a buffer to the ring, so the "last ring submitted" is alw= ays > > close to the new tail and could be separated from the head by many > > descriptors. We just need to make sure that the card isn't going to > > stop, > > so the flush needs to restart DMA if the card is idle and ensure that= it > > _won't_ stop if the card is active. >=20 > I see. But it's very complicated to know whether the card is gonna to s= top=20 > or not. What we can do is have flag which can tell us if we are sure th= e=20 > card will succed or not. When we add a new buffer to the ring and the=20 > after the change the head is before the change then we binary "and" the= =20 > previous flag value with TRUE. This is where we have to make sure that any assumptions we make can be verified to be true. I haven't done enough testing to really determine a sure fire way of knowing that the card won't stop yet. What I'm concerne= d about is that the card might be doing some read-ahead buffering that we don't know about. That's why I was thinking we might have to see the car= d actually advance a couple of times before determining it won't stop. The test I did with changing BM_GUI_TABLE from a buffer took a couple of descriptors to take effect. =20 > When we want to do that kind of flush, we check the value of that flag.= If=20 > it's true we can release immediately. If it's not then we must=20 > wait_for_idle because the card will probably stop, restart the DMA, and= =20 > return. >=20 > >=20 > > > > A couple of other comments on your change: I think we could just > > > > > > Although there are some things you can try simplify this is _very_ > > trick > > > as you gonna see below... > > > > > > > unconditionally turn on busmastering in SRC_CNTL if the engine is > > idle in > > > > > > No. The engine could be active but not doing BM, e.g., if X submite= d > > > something to FIFO. > >=20 > > Notice that I said if the engine is _idle_. The engine can't be acti= ve > > if > > it's idle, right? ... >=20 > Right. Sorry. I didn't understood that the first I read this. I guess t= hat=20 > it would be safe to do what you had (*) suggested but as, you said too=20 > below, using a variable it's probably a better idea - after all if it's= in=20 > the cache, the speed of access if several orders faster than doing=20 > anything over the bus. >=20 > (*) Not true.. after seeing your diff I reminded of a problem with this= :=20 > note that if the card is not idle then you'll gonna offset the -4 entri= es=20 > to correct it, but if the card was processing the X FIFO commands - tha= t's=20 > totally wrong and it's gonna corrupt the head offset. So I reiterate wh= at=20 > I said on my previous reply regarding this change. Nevertheless this is= a=20 > non-issue as using a variable achieves the same or better without=20 > reordering the code. I don't think it's a problem if the head_addr is one behind the actual position if there are 2D commands still in the FIFO (which could only happen at the final descriptor on the ring). We don't actually act on it until the card is idle. It just means that the last buffer in the ring won't be reclaimed until the card is idle. Actually, if you _did_ advanc= e the head while the card is active, it would trigger the error check you added to freelist_get because head would equal tail, but the card would still be active. > > ... > >=20 > > Actually, I realized there are two writes (one to SRC_CNTL and one to > > BUS_CNTL), so reading SRC_CNTL could be better. As you say, we could > > keep > > track of this with a variable, as long as we account for the possibil= ity > > of the card disabling bus mastering in the case of a lockup. > >=20 > > ... > > > > > > What's the point in waiting for anything? I though we both had agre= e > > that > > > we shouldn't wait for anything when submiting a buffer... (after al= l we > >=20 > > > both drawn the same conclusions in that matter in two distinct time= s). > >=20 > > As I mentioned above, we just need to ensure that the card won't stop > > because it has already read an old end of list flag. We could have a= dded > > several descriptors to the ring while the card is still active on an > > earlier descriptor which the card sees as the end of the ring. So > > restarting the card if it's idle doesn't ensure that it won't stop if >=20 > I'm not sure if I understood correctly what you're saying. Note that on= ce=20 > we restart the card we can be sure that it won't stop until it finishes= =20 > _all_ buffers we supplied until _that_ moment. I wasn't very clear here. What I mean is that if the card is idle and we= =20 restart, we should be fine. The problem is if we _only_ do that and=20 do doing nothing if the card is active.=20 > > it's active. I was just thinking that waiting to see the card advanc= e a > > couple > > of descriptors (but not waiting for idle) might be enough to ensure t= hat > > it isn't going to stop. >=20 > I think that having a flag indicating whether the card can stop or not = is=20 > more efficient. What do you think? It depends on what's required to set a reliable flag. That would have to be done every time we advance the ring tail, whereas a flush ioctl is les= s frequent. We can remove the flush ioctls wherever they are followed by a= n idle ioctl with the current version of the idle ioctl (since it ensures _all_ buffers will complete), which would just leave the flush in DDFlush in the Mesa driver. If an app calls glFlush, it's probably not doing it=20 very often (maybe once or twice a frame?). > Another thing that we must address before this is using bigger buffers = -=20 > most of the ring dumps I see the buffers span just one or two entries a= t=20 > maximum. There are two things: we are using a buffer for each DMA* macr= o=20 > sets (which I'm working on now); and the client submited buffers can be= =20 > much bigger (they can even span several primitives). Until both this=20 > things are addressed the the card is in idle most of the time anyway (a= t=20 > least on my slow testbox). The biggest problem with getting the client submitted buffers to be used more efficiently is state emits. Client-side state emits aren't secure. The current code _does_ allow multiple primitives in a buffer as long as there is no new state between them. The AllocDmaLow will use an existing vertex buffer until it's full or a state change causes a dispatch. =20 > > ... > > > > > > Yep. But after thinking better and it's probably better do all in > > > UPDATE_RING... there isn't much point in putting 2 output instructi= ons > > in > > > a seperate function where the call overhead takes more instructions > > than > > > that, especially if they are called only from a single function. > >=20 > > But doesn't inlining the function remove the overhead? I was suggest= ing >=20 > Yes. If you inline you'll get the same thing... (I was being naughty.. = :P ) >=20 > > this mainly for readability, it's not really necessary if dma_start i= sn't > > going to be used anywhere else. >=20 > >=20 > > I'm attaching a diff so you can see what I'm talking about. Hopefull= y > > this will remove any confusion. :) > >=20 >=20 > Yes it helped! >=20 > Besides my above comment I noticed that you did something with=20 > RING_SPACE_TEST_WITH_RETURN, but it wasn't clear what from the diff. Is= =20 > that code still needed or not? I've disabled and I've been doing fine=20 > without it. Yeah, I meant to explain that. I wasn't quite done there, I just have th= e macro enabled to make sure it didn't cause problems to remove it. The macro you disabled was in the old path, where it _is_ necessary because buffers aren't added to the ring until a flush. This path is probably broken now anyway. In the new path, the only advantage I can see to having it is to reduce the number of times you call wait_ring if the ring is almost full. Actually, the macro can also cause a new pass to be started if the card is idle (via UPDATE_RING_HEAD -- this is what I meant about the flush being 'hidden' in the UPDATE_RING_HEAD macro), so it migh= t cause less idle time in some cases. It shouldn't be required for things to work, it's more a question of whether things perform better with or without it. --=20 Leif Delgass=20 http://www.retinalburn.net |
From: F. <j_r...@ya...> - 2002-06-11 22:29:31
|
On 2002.06.11 22:12 Leif Delgass wrote: > On Tue, 11 Jun 2002, José Fonseca wrote: > > ... > > This is where we have to make sure that any assumptions we make can be > verified to be true. I haven't done enough testing to really determine a > sure fire way of knowing that the card won't stop yet. What I'm > concerned > about is that the card might be doing some read-ahead buffering that we > don't know about. That's why I was thinking we might have to see the > card > actually advance a couple of times before determining it won't stop. The > test I did with changing BM_GUI_TABLE from a buffer took a couple of > descriptors to take effect. I've already tested that before, and it didn't seem that there wasn't a significant buffering noticed - at least with respect with the descriptor table. Only if it ther is a lookahead value. In that case we could compare the BM_GUI_TABLE instead of the head. In any case we would need more testing to be sure then.. Another idea I had is instead of having a flag is having a bookmark - the value of the last commited ring. Whenever we commit a buffer and the head if after that bookmark then we would had to set this bookmark to the beggining of the commited buffer. When we need the card to complete we just would had to wait (and reset the DMA if it stopped until then) for the head to reach the bookmark. Once it reached we could be sure that it would succeed because the ring table wouldn't suffer any change until the end. In any event this will take some experimentation, and from your comments below this isn't as high priority as thing as condensating the state buffers or do a costumized vertex buffer templates for the Mach64 vertex format. > ... > > I don't think it's a problem if the head_addr is one behind the actual > position if there are 2D commands still in the FIFO (which could only > happen at the final descriptor on the ring). We don't actually act on it > until the card is idle. It just means that the last buffer in the ring > won't be reclaimed until the card is idle. Actually, if you _did_ True. > advance > the head while the card is active, it would trigger the error check you > added to freelist_get because head would equal tail, but the card would > still be active. I doubt, because in that case we would wait wait for idle and _then_ restart DMA... (as it done now in CVS) Please, let's not discuss this further. I think we both agree that using a variable is the best wat to go, isn't it? > > > ... > > > > I'm not sure if I understood correctly what you're saying. Note that > once > > we restart the card we can be sure that it won't stop until it finishes > > _all_ buffers we supplied until _that_ moment. > > I wasn't very clear here. What I mean is that if the card is idle and we > restart, we should be fine. The problem is if we _only_ do that and > do doing nothing if the card is active. Ok. > ... > > > > I think that having a flag indicating whether the card can stop or not > is > > more efficient. What do you think? > > It depends on what's required to set a reliable flag. That would have to > be done every time we advance the ring tail, whereas a flush ioctl is > less > frequent. We can remove the flush ioctls wherever they are followed by > an > idle ioctl with the current version of the idle ioctl (since it ensures > _all_ buffers will complete), which would just leave the flush in DDFlush > in the Mesa driver. If an app calls glFlush, it's probably not doing it > very often (maybe once or twice a frame?). I really don't know.. I don't even know why would a regular application (not X) call a idle if the flush wasn't implicit... > > ... > > The biggest problem with getting the client submitted buffers to be used > more efficiently is state emits. Client-side state emits aren't secure. > The current code _does_ allow multiple primitives in a buffer as long as > there is no new state between them. The AllocDmaLow will use an existing > vertex buffer until it's full or a state change causes a dispatch. > Oh.. I didn't had that impression... But even with that restriction, they are a lot smaller than I would expect. I would expect that OpenGL applications made less state changes than that... > ... > > > > Besides my above comment I noticed that you did something with > > RING_SPACE_TEST_WITH_RETURN, but it wasn't clear what from the diff. Is > > that code still needed or not? I've disabled and I've been doing fine > > without it. > > Yeah, I meant to explain that. I wasn't quite done there, I just have > the > macro enabled to make sure it didn't cause problems to remove it. The > macro you disabled was in the old path, where it _is_ necessary because Oops, I didn't noticed that there were two of them..! That explains why when I removed them when I was debugging I still got UPDATE_RING_HEAD called without BM enabled - at the time RING_SPACE_TEST_WITH_RETURN was another of those situations where this could happen. > buffers aren't added to the ring until a flush. This path is probably > broken now anyway. In the new path, the only advantage I can see to > having it is to reduce the number of times you call wait_ring if the ring > is almost full. Actually, the macro can also cause a new pass to be > started if the card is idle (via UPDATE_RING_HEAD -- this is what I meant > about the flush being 'hidden' in the UPDATE_RING_HEAD macro), so it > might > cause less idle time in some cases. It shouldn't be required for things If it was just for that we could just call UPDATE_RING_HEAD as the first thing in the ioctl. > to work, it's more a question of whether things perform better with or > without it. Ok. It's fair enough. José Fonseca |
From: Leif D. <lde...@re...> - 2002-06-12 02:49:53
|
On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > On 2002.06.11 22:12 Leif Delgass wrote: > > On Tue, 11 Jun 2002, Jos=E9 Fonseca wrote: > >=20 > > ... > >=20 > > This is where we have to make sure that any assumptions we make can b= e > > verified to be true. I haven't done enough testing to really determi= ne a > > sure fire way of knowing that the card won't stop yet. What I'm > > concerned > > about is that the card might be doing some read-ahead buffering that = we > > don't know about. That's why I was thinking we might have to see the > > card > > actually advance a couple of times before determining it won't stop. = The > > test I did with changing BM_GUI_TABLE from a buffer took a couple of > > descriptors to take effect. >=20 > I've already tested that before, and it didn't seem that there wasn't a= =20 > significant buffering noticed - at least with respect with the descript= or=20 > table. Only if it ther is a lookahead value. In that case we could comp= are=20 > the BM_GUI_TABLE instead of the head. In any case we would need more=20 > testing to be sure then.. >=20 > Another idea I had is instead of having a flag is having a bookmark - t= he=20 > value of the last commited ring. Whenever we commit a buffer and the he= ad=20 > if after that bookmark then we would had to set this bookmark to the=20 > beggining of the commited buffer. When we need the card to complete we=20 > just would had to wait (and reset the DMA if it stopped until then) for= =20 > the head to reach the bookmark. Once it reached we could be sure that i= t=20 > would succeed because the ring table wouldn't suffer any change until t= he=20 > end. >=20 > In any event this will take some experimentation, and from your comment= s=20 > below this isn't as high priority as thing as condensating the state=20 > buffers or do a costumized vertex buffer templates for the Mach64 verte= x=20 > format. True. I'm not sure it's worth a bookmarking scheme if we only use it for the one place the flush ioctl will be used. =20 > > ... > >=20 > > I don't think it's a problem if the head_addr is one behind the actua= l > > position if there are 2D commands still in the FIFO (which could only > > happen at the final descriptor on the ring). We don't actually act o= n it > > until the card is idle. It just means that the last buffer in the ri= ng > > won't be reclaimed until the card is idle. Actually, if you _did_ >=20 > True. >=20 > > advance > > the head while the card is active, it would trigger the error check y= ou > > added to freelist_get because head would equal tail, but the card wou= ld > > still be active. >=20 > I doubt, because in that case we would wait wait for idle and _then_=20 > restart DMA... (as it done now in CVS) >=20 > Please, let's not discuss this further. I think we both agree that usin= g a=20 > variable is the best wat to go, isn't it? Each and every processor cycle must be precisely documented and accounted for! The lives of rocket-launcher toting space marines depend on our attention to detail! I've clamped onto the throat of this bit of code like a mad dog and will continue to shake it around while foam dribbles from the corners of my mouth long after it's dead. Ahem, ... ok I'm back now, I think I blacked-out for a minute there. Anyway, I just think that in any case it would be better if we only enabl= e bus mastering on idle (if things are going well, an active engine should be the common case). If we do that I don't think it's really a big deal t= o have the extra writes. The writes could be conditional on a read or we could use a variable instead, but I'm not sure it's worth it and it could be error prone. Now that I think about it, there's an added bit of security and safety in making sure that the block 1 registers are enabled and that src_cntl is set for gui-mastering and FIFO synchronization befor= e starting a new DMA pass. It would probably help performance in general t= o find ways to reduce the number of DMA restarts we do also. btw, as I tried to indicate above, I can be a bit of an obstinate bugger=20 sometimes and I'm often just thinking out loud. You might want to have a= =20 salt shaker at hand and administer a grain or two when you read my posts.= =20 8->~ (that's me foaming at the mouth). =20 > >=20 > > > ... > > > > > > I'm not sure if I understood correctly what you're saying. Note tha= t > > once > > > we restart the card we can be sure that it won't stop until it fini= shes > > > _all_ buffers we supplied until _that_ moment. > >=20 > > I wasn't very clear here. What I mean is that if the card is idle an= d we > > restart, we should be fine. The problem is if we _only_ do that and > > do doing nothing if the card is active. >=20 > Ok. >=20 > > ... > > > > > > I think that having a flag indicating whether the card can stop or = not > > is > > > more efficient. What do you think? > >=20 > > It depends on what's required to set a reliable flag. That would hav= e to > > be done every time we advance the ring tail, whereas a flush ioctl is > > less > > frequent. We can remove the flush ioctls wherever they are followed = by > > an > > idle ioctl with the current version of the idle ioctl (since it ensur= es > > _all_ buffers will complete), which would just leave the flush in DDF= lush > > in the Mesa driver. If an app calls glFlush, it's probably not doing= it > > very often (maybe once or twice a frame?). >=20 > I really don't know.. I don't even know why would a regular application= =20 > (not X) call a idle if the flush wasn't implicit... I think that's probably true. With X, there is one case where it seems excessive: when uploading a new cursor image. This function has an XAA sync call in it becuase it writes to the cursor area of the framebuffer. = =20 The cursor image shouldn't be dependant on all 3D draw operations completing first AFAICT. The problem is that I can't think of a clean wa= y to handle this at the moment. For other XAA functions, we need to complete the ring because the X server can change register state that won't be accounted for in buffers already committed to the ring. The other thing, which is really what I was addressing here, is that we still need a flush ioctl for DDFlush that doesn't wait for idle, independant of whether the idle ioctl flushes or not. =20 > > ... > >=20 > > The biggest problem with getting the client submitted buffers to be u= sed > > more efficiently is state emits. Client-side state emits aren't secu= re. > > The current code _does_ allow multiple primitives in a buffer as long= as > > there is no new state between them. The AllocDmaLow will use an exis= ting > > vertex buffer until it's full or a state change causes a dispatch. > >=20 >=20 > Oh.. I didn't had that impression... But even with that restriction, th= ey=20 > are a lot smaller than I would expect. I would expect that OpenGL=20 > applications made less state changes than that... Changing textures is one example, and I'd imagine that happens fairly often. I should point out that a primitive won't be split across multipl= e vertex buffers, so that can leave some unused space as well. As you mentioned, we're going to have to revisit vertex buffers at some point in any case, both for performance and security. BTW, last time I had to boo= t Windows (against my will, of course), I did a quake3 timedemo. On my laptop, the current dri branch is only behind by ~4fps w/ vertex lighting and 2fps w/ lightmap lighting (approx. 82% and 87% of performance in Windows respectively). I'd say we're making good progress. My goal is t= o try to at least match the Windows driver, but with a more secure implementation. --=20 Leif Delgass=20 http://www.retinalburn.net |
From: F. <j_r...@ya...> - 2002-06-12 07:34:28
|
On 2002.06.12 03:49 Leif Delgass wrote: > On Tue, 11 Jun 2002, José Fonseca wrote: > > ... > > > > Please, let's not discuss this further. I think we both agree that > using a > > variable is the best wat to go, isn't it? > > Each and every processor cycle must be precisely documented and accounted > for! The lives of rocket-launcher toting space marines depend on our > attention to detail! I've clamped onto the throat of this bit of code > like a mad dog and will continue to shake it around while foam dribbles > from the corners of my mouth long after it's dead. Ahem, ... ok I'm back > now, I think I blacked-out for a minute there. > :-) > Anyway, I just think that in any case it would be better if we only > enable > bus mastering on idle (if things are going well, an active engine should > be the common case). If we do that I don't think it's really a big deal > to > have the extra writes. The writes could be conditional on a read or we > could use a variable instead, but I'm not sure it's worth it and it could > be error prone. Now that I think about it, there's an added bit of > security and safety in making sure that the block 1 registers are enabled > and that src_cntl is set for gui-mastering and FIFO synchronization > before > starting a new DMA pass. It would probably help performance in general > to find ways to reduce the number of DMA restarts we do also. Without doubt. I hope that by joining the state buffers, by speeding up the vertex buffer construction, and (if we don't manage to increase the buffer size on the client side) hold the buffers a little more on the kernel to achieve the same effect we will then overcome this. > > btw, as I tried to indicate above, I can be a bit of an obstinate bugger > sometimes and I'm often just thinking out loud. You might want to have a > salt shaker at hand and administer a grain or two when you read my posts. > 8->~ (that's me foaming at the mouth). > That's ok! I'm not too far behind you! > ... > > > > Oh.. I didn't had that impression... But even with that restriction, > they > > are a lot smaller than I would expect. I would expect that OpenGL > > applications made less state changes than that... > > Changing textures is one example, and I'd imagine that happens fairly > often. I should point out that a primitive won't be split across > multiple > vertex buffers, so that can leave some unused space as well. As you > mentioned, we're going to have to revisit vertex buffers at some point in > any case, both for performance and security. BTW, last time I had to > boot > Windows (against my will, of course), I did a quake3 timedemo. On my > laptop, the current dri branch is only behind by ~4fps w/ vertex lighting > and 2fps w/ lightmap lighting (approx. 82% and 87% of performance in > Windows respectively). I'd say we're making good progress. My goal is Indeed! Especially attending that we know that there are several optimization to do yet. > to > try to at least match the Windows driver, but with a more secure > implementation. Me too! Ok. Enough of talk and back to coding..! José Fonseca |