Thread: RE: [Kgdb-bugreport] [discuss] 2.4 kgdb SMP fixes
Status: Beta
Brought to you by:
jwessel
From: Shivram U <shi...@wi...> - 2004-01-30 19:47:38
|
Hi George, > > > > The current code resolves this problem by sending an IPI to > the other CPUs to enter the gdb_wait() state. This is on similar > > lines of the kdb code. When a CPU enters the master debugger it > sends an IPI to > the other CPUs. The other CPUs on receving this IPI > > would enter the gdb_wait() function till the master CPU has quit > the debugger. A > new function kgdb_smp_stop() has been added to stop > > other CPUs when a CPU has entered the debugger. > > IMHO an IPI is not strong enough unless it is an NMI IPI. > Otherwise you are > depending on the other cpu(s) being interruptable and if you are > debugging the > kernel, well, there just might be a problem where they are not > interruptable and > will not become so for some time, if ever. Right. The IPI is issued in the form of an NMI and issued to all other CPUs except for the one which entered the debugger. Its should be reasonably safe. > > > > Fix for Instruction Pointer > > ---------------------------- > > On hitting a breakpoint gdb does the following sequence. > > 1. Restore the original value at the Instruction address of > the breakpoint. > > 2. Decrement the instruction pointer by 1 > > 3. Issue a single step to the kgdb stub. We enable a trap flag > and after > > executing a single instruction the debug exception is hit. > > 4. The remote gdb on receipt of this debug exception, > reinserts the Breakpoint > > at the breakpoint address. > > I would expect most breakpoints to stop AT the instruction, not > after it. Why > is this being done at all? > > If this is to be done, then the SS should be treated just like a > gdb commanded > SS, i.e. the other cpu(s) should be held while the SS is being > done. See the > code in the mm-kgdb on this. > > We would stop at the breakpoint but the EIP/RIP would point to the byte after the breakpoint intruction (INT3 instr). When we set a breakpoint at the gdb console, gdb saves the original value of the byte at the breakpoint address, and inserts the breakpoint instruction opcode at the breakpoint address. When we hit the breakpoint gdb would then replace back the original value at the breakpoint address and decrement back the instruction pointer to restart the instruction again. The double fault is when gdb doesn't do the above and continues from where the EIP/RIP is pointing to. > > The problem faced is that when two CPUs try to enter the > debugger at the same time on hitting a breakpoint, > > we experience a double fault. The reason being that for CPU1 > steps 1 and 2 are > executed. > > The trap flag is enabled and we return from the debugger. Now > CPU2 enters and > gdb possibly thinks that this is the single > > step debug exception and doesnt perform step 1 and 2. > > Clearly the first cpu should be completely in kgdb prior to doing > any SS stuff. > By which time the other cpus should be captured in the wait loop. How does the kgdb-mm stub handle situations where a breakpoint is hit on both the CPU at the same time. Which one enters the debugger? In the 2.4 kgdb code it would be possible both of them enter the debugger (handle_exception()) and one of them acquires kgdb_lock while one waits on the lock. Now the problem arises when 1. Both cpus hit a breakpoint and enter the kgdb execption handler (handle_exception) 1. CPU 0 acquired the lock and contacted gdb. we continue at the gdb prompt. 2. gdb issues the "step" instruction and expects the debug exception to be called wherein it would reinsert all the breakpoints back. 3. However since CPU1 was waiting for the lock it would get it and when it contacts gdb, gdb thinks its the trap it was expecting, reinserts all breakpoints but doesnt decrement back the instr pointer of CPU1. 4. CPU1 continues at the value pointing to in the instruction pointer and we get the double fault. > > 2. When we detach from the debugger we need to handle the > situation where > > another CPU already entered the debugger code and was waiting for the > > kgdb lock. Similary for the 'k' (quit from the debugger) packet. > > I am not sure what the given kgbd does here, but usually kgdb > should treat the > detach as just a "c" or continue. It should then be in a state > where it can > handle a subsequent breakpoint and, at that point, attach again. > It is up to > gdb to make sure all breakpoints are cleared at this point. Right. The detach functions as a continue. But the problem is as mentioned above when both/multiple CPUs try to enter the debugger at the same time. One of them is waiting for a lock while the other contacts gdb, recevies the "detach" command and quits the debugger. gdb has now quit too. Now the second CPU aquires the lock and waits for commands from gdb, which is probably no longer present. However this isnt a big problem and i havent tried to fix it for the same reason. I believe we can still reattach and then detach if "detach" is what we really want. > I have not had any problems just aborting gdb on the host system > in the middle > of a kgdb session. I, in fact, do this to test new gdbs on given > issues. With > the mm-kgdb I can do this either when in kgdb (i.e. a breakpoint) > or when not in > kgdb, however, this ladder case requires that I clear breakpoints > first. When > in a kgdb session (i.e. a breakpoint) kgdb clears all code > resident breakpoints > prior to its first prompt after a breakpoint, so this is not an > issue at that time. Some of the issues are probably not present in the mm-kgdb code. I believe that its code is lot different from the 2.4 stub's. I'll try it soon to see if these problems are reproducible there. I havent been following the recent kgdb mails that well. Is the kgdb-mm code merged with the kgdb patch available at sourceforge.net. If not could you point me to the location i can retreive it from. Thanks in advance. Best Regards, Shivram U > -- > George Anzinger ge...@mv... > High-res-timers: http://sourceforge.net/projects/high-res-timers/ > Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml > |
From: George A. <ge...@mv...> - 2004-01-31 00:17:39
|
Shivram U wrote: > Hi, > Please find attached a set of patches for x86 and x86_64 which fixes some > key SMP problems encountered with the kgdb stub. > The patches are > 1. kgdb-x86-smpfixes.diff - Patch against 2.4.20 from kernel.org with > linux-2.4.20-6-kgdb-1.6.patch applied > 2. kgdb-x86_64-smpfixed.diff - Patch against 2.4.23 from kernel.org with > linux-2.4.23-kgdb-x86_64-1.6.patch applied > > Attached a README.txt which describes the changes and the pending things. > I havent attached the text inline due to its size. Hope its ok. > > I work generally work with the SuSE kernel sources . I have attached the > diff against those sources too for your reference. (ref-kgdb-x86.diff and > ref-kgdb-x86_64.diff) > > Request your feedback for the same. > > Best Regards, > Shivram U > > > ------------------------------------------------------------------------ > > Overview of the problems > ------------------------ > > Stopping other CPUs > ------------------- > The original code relied on the nmi watchdog to stop other CPUs when a CPU has entered the debugger. (arch/x86_64/kernel/nmi.c: nmi_watchdog()) > This has two disadvantages > 1. The nmi watchdog needn't necessarily run therby a CPU neednt enter the gdb_wait() (arch/x86_64/kernel/gdbstub.c) in order to wait till the master CPU has quit the debugger > 2. In case the CPU which has entered the debugger send/receives data over the serial port, gdb_interrupt() (the interrupt handler for kgdb) is called on the other CPU which hasnt entered the wait state. The reason for the problem reported at http://www.x86-64.org/lists/discuss/msg03987.html. The easiest way to reproduce this problem is by setting a breakpoint on kfree, kmalloc etc. > > The current code resolves this problem by sending an IPI to the other CPUs to enter the gdb_wait() state. This is on similar lines of the kdb code. When a CPU enters the master debugger it sends an IPI to the other CPUs. The other CPUs on receving this IPI would enter the gdb_wait() function till the master CPU has quit the debugger. A new function kgdb_smp_stop() has been added to stop other CPUs when a CPU has entered the debugger. IMHO an IPI is not strong enough unless it is an NMI IPI. Otherwise you are depending on the other cpu(s) being interruptable and if you are debugging the kernel, well, there just might be a problem where they are not interruptable and will not become so for some time, if ever. > > Fix for Instruction Pointer > ---------------------------- > On hitting a breakpoint gdb does the following sequence. > 1. Restore the original value at the Instruction address of the breakpoint. > 2. Decrement the instruction pointer by 1 > 3. Issue a single step to the kgdb stub. We enable a trap flag and after > executing a single instruction the debug exception is hit. > 4. The remote gdb on receipt of this debug exception, reinserts the Breakpoint > at the breakpoint address. I would expect most breakpoints to stop AT the instruction, not after it. Why is this being done at all? If this is to be done, then the SS should be treated just like a gdb commanded SS, i.e. the other cpu(s) should be held while the SS is being done. See the code in the mm-kgdb on this. > > The problem faced is that when two CPUs try to enter the debugger at the same time on hitting a breakpoint, we experience a double fault. The reason being that for CPU1 steps 1 and 2 are executed. The trap flag is enabled and we return from the debugger. Now CPU2 enters and gdb possibly thinks that this is the single step debug exception and doesnt perform step 1 and 2. Clearly the first cpu should be completely in kgdb prior to doing any SS stuff. By which time the other cpus should be captured in the wait loop. > > Now we check if we entered the debugger on a breakpoint exception (exception vector 3) and if instruction pointer when we entered and when we are about to leave are the same and if the value at the breakpoint address is a valid breakpoint. If these above conditions are true > then we fix the instruction pointer by decrementing it by one. > The possible scenarios with this fix. > 1. If gdb has reinserted the breakpoint at the breakpoint address then the CPU > again hits the breakpoint and now gdb handles the breakpoint. > > 2. The second CPU exits the debugger and restarts at the decremented > instruction pointer and gdb hasnt inserted the breakpoint opcode, we then > lose this breakpoint hit. > > Additional changes > ------------------ > 1. Made the quit from the debugger to release all locks instead of just quitting > > Additional changes to the x86_64 code > ------------------------------------- > 1. Added support for detach functionality similar to the x86 code. > 2. Added support for the 'Z' and 'z' packets required for knowing about the breakpoints from within the kernel. Ported from the kgdb x86 code. > 3. Brought the x86 nmi watchdog handling code to x86_64 stub. (This code may now be redundant for both x86 and x86_64 with the IPI changes) > > TODO > ---- > 1. The gdb problem of getting confused between a single step and a breakpoint > needs to be fixed ideally in gdb. In the kernel stub we merely have sanity > checks which prevent the double faults observed. However the fixes in the > kernel can only prevent the faults ensuring that we can debug the kernel > further. It can also mean that we might lose valuable breakpoint hits which > should have been trapped by gdb. I have to try with gdb 6.0 thoroughly to see if this problem is fixed. I do recollect facing this problem with gdb 6.0 too. Again, why do we want to stop after the breakpoint location instead of at it (i.e. prior to execution). > 2. When we detach from the debugger we need to handle the situation where > another CPU already entered the debugger code and was waiting for the > kgdb lock. Similary for the 'k' (quit from the debugger) packet. I am not sure what the given kgbd does here, but usually kgdb should treat the detach as just a "c" or continue. It should then be in a state where it can handle a subsequent breakpoint and, at that point, attach again. It is up to gdb to make sure all breakpoints are cleared at this point. I have not had any problems just aborting gdb on the host system in the middle of a kgdb session. I, in fact, do this to test new gdbs on given issues. With the mm-kgdb I can do this either when in kgdb (i.e. a breakpoint) or when not in kgdb, however, this ladder case requires that I clear breakpoints first. When in a kgdb session (i.e. a breakpoint) kgdb clears all code resident breakpoints prior to its first prompt after a breakpoint, so this is not an issue at that time. -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: George A. <ge...@mv...> - 2004-01-30 21:12:40
|
Shivram U wrote: > Hi George, > > >>> The current code resolves this problem by sending an IPI to >> >>the other CPUs to enter the gdb_wait() state. This is on similar >> >>lines of the kdb code. When a CPU enters the master debugger it >>sends an IPI to >>the other CPUs. The other CPUs on receving this IPI >> >>would enter the gdb_wait() function till the master CPU has quit >>the debugger. A >>new function kgdb_smp_stop() has been added to stop >> >>other CPUs when a CPU has entered the debugger. >> >>IMHO an IPI is not strong enough unless it is an NMI IPI. >>Otherwise you are >>depending on the other cpu(s) being interruptable and if you are >>debugging the >>kernel, well, there just might be a problem where they are not >>interruptable and >>will not become so for some time, if ever. > > > Right. The IPI is issued in the form of an NMI and issued to all other > CPUs except for the one which entered the debugger. Its should be reasonably > safe. > After the reply, I did find this. I am not sure about defining a special IPI vector for KGDB, but the end result is the same. (I just defined a function: send_NMI_all_but_self().) > >>>Fix for Instruction Pointer >>>---------------------------- >>>On hitting a breakpoint gdb does the following sequence. >>> 1. Restore the original value at the Instruction address of >> >>the breakpoint. >> >>> 2. Decrement the instruction pointer by 1 >>> 3. Issue a single step to the kgdb stub. We enable a trap flag >> >>and after >> >>> executing a single instruction the debug exception is hit. >>> 4. The remote gdb on receipt of this debug exception, >> >>reinserts the Breakpoint >> >>> at the breakpoint address. >> >>I would expect most breakpoints to stop AT the instruction, not >>after it. Why >>is this being done at all? >> >>If this is to be done, then the SS should be treated just like a >>gdb commanded >>SS, i.e. the other cpu(s) should be held while the SS is being >>done. See the >>code in the mm-kgdb on this. >> > > We would stop at the breakpoint but the EIP/RIP would point to the byte > after the breakpoint intruction (INT3 instr). When we set a breakpoint at > the gdb console, gdb saves the original value of the byte at the breakpoint > address, and inserts the breakpoint instruction opcode at the breakpoint > address. When we hit the breakpoint gdb would then replace back the original > value at the breakpoint address and decrement back the instruction pointer > to restart the instruction again. > The double fault is when gdb doesn't do the above and continues from where > the EIP/RIP is pointing to. When does it do this? I think this is the correct operation if the breakpoint is NOT one that gdb inserted. I.e. in that case you want to continue AFTER the breakpoint. This all seems to work correctly in the x86 case. Is this a hardware or gdb issue? > > >>> The problem faced is that when two CPUs try to enter the >> >>debugger at the same time on hitting a breakpoint, >> >>we experience a double fault. The reason being that for CPU1 >>steps 1 and 2 are >>executed. >> >>The trap flag is enabled and we return from the debugger. Now >>CPU2 enters and >>gdb possibly thinks that this is the single >> >>step debug exception and doesnt perform step 1 and 2. >> >>Clearly the first cpu should be completely in kgdb prior to doing >>any SS stuff. >> By which time the other cpus should be captured in the wait loop. > > > How does the kgdb-mm stub handle situations where a breakpoint is hit on > both the CPU at the same time. Which one enters the debugger? I use a version of the spin lock. First I don't do the preempt spinlock thing as this kgdb was used to debug preempt stuff and having kgdb mess with the preempt count was confusing, to say the least. Also, it is not needed as we always do the irq version of the spinlock (in some cases we may separate the irq and the spinlock, but the result is the same. So what this means is that one cpu get through and the other hangs in the entry spinlock. When the NMI all but self is issued, that cpu will be grabbed and tucked away on its very own spinlock (each cpu has a spin lock in the function I call in_kgdb() and I think you call gdb_wait(). From this point on all is as if the second cpu had been just anywhere. A info threads command would show it in the kgdb spinlock, so, if you care, you know were it is. In the 2.4 > kgdb code it would be possible both of them enter the debugger > (handle_exception()) and one of them acquires kgdb_lock while one waits on > the lock. Now the problem arises when > 1. Both cpus hit a breakpoint and enter the kgdb exception handler > (handle_exception) > 1. CPU 0 acquired the lock and contacted gdb. we continue at the gdb prompt. > 2. gdb issues the "step" instruction and expects the debug exception to be > called wherein it would reinsert all the breakpoints back. Here is where it is neccessary to hold those cpus other than self. This is done by not freeing them from their spinlocks in the in_kgdb() function. See the exit code from handle_exception() in the mm-kgdb. > 3. However since CPU1 was waiting for the lock it would get it and when it > contacts gdb, gdb thinks its the trap it was expecting, reinserts all > breakpoints but doesnt decrement back the instr pointer of CPU1. > 4. CPU1 continues at the value pointing to in the instruction pointer and we > get the double fault. > > > >>>2. When we detach from the debugger we need to handle the >> >>situation where >> >>> another CPU already entered the debugger code and was waiting for the >>> kgdb lock. Similarly for the 'k' (quit from the debugger) packet. >> >>I am not sure what the given kgdb does here, but usually kgdb >>should treat the >>detach as just a "c" or continue. It should then be in a state >>where it can >>handle a subsequent breakpoint and, at that point, attach again. >>It is up to >>gdb to make sure all breakpoints are cleared at this point. > > Right. The detach functions as a continue. But the problem is as mentioned > above when both/multiple CPUs try to enter the debugger at the same time. > One of them is waiting for a lock while the other contacts gdb, receives the > "detach" command and quits the debugger. gdb has now quit too. Now the > second CPU acquires the lock and waits for commands from gdb, which is > probably no longer present. However this isnt a big problem and i havent > tried to fix it for the same reason. I believe we can still reattach and > then detach if "detach" is what we really want. Right. I also suspect that it might be wise to just kill the "k" command. It really isn't needed and can lead to the above confusion. I think gdb will allow a macro by the same name as a command to exist and thus one can effectively kill a command. As I said, in most cases you can just kill gdb. I don't use the k command as I am not sure what it is trying to do behind my back and I just know that, what ever it is, it is not needed. > > > >>I have not had any problems just aborting gdb on the host system >>in the middle >>of a kgdb session. I, in fact, do this to test new gdbs on given >>issues. With >>the mm-kgdb I can do this either when in kgdb (i.e. a breakpoint) >>or when not in >>kgdb, however, this ladder case requires that I clear breakpoints >>first. When >>in a kgdb session (i.e. a breakpoint) kgdb clears all code >>resident breakpoints >>prior to its first prompt after a breakpoint, so this is not an >>issue at that time. > > Some of the issues are probably not present in the mm-kgdb code. I believe > that its code is lot different from the 2.4 stub's. Well, I did try to make it more a system debugger than a driver debugger, so it needed to be able to stand up to some rather hard abuse. For example, even with the NMI all but self code to gather other cpus there are cases where it can fail to pull them all in. In this case the code waits an unreasonably long time and then continues, rather than hanging forever. I'll try it soon to see > if these problems are reproducible there. I havent been following the recent > kgdb mails that well. Is the kgdb-mm code merged with the kgdb patch > available at sourceforge.net. If not could you point me to the location i > can retreive it from. Thanks in advance. No, we are still, I believe rather far apart, but I think both Amit and I would like them to be a bit closer. The mm- version, at this time, is for the 2.6 kernel. For the 2.4 kernels, I have a version that I think is rather close in functionality, but you need to ask me for it. The mm version for 2.6 is on the kernel.org site: http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.2-rc2/2.6.2-rc2-mm2/broken-out/ Note that there are several versions of 2.6 here, each with an -mm version and a broken-out directory. Choose what you think most closely matches your kernel. Also, there may be one or more kgdb patches (you will usually want them all) as Andrew usually just adds a patch rather than merge them. -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: Shivram U <shi...@wi...> - 2004-02-02 14:55:31
|
Hi George, > > > We would stop at the breakpoint but the EIP/RIP would point > to the byte > > after the breakpoint intruction (INT3 instr). When we set a > breakpoint at > > the gdb console, gdb saves the original value of the byte at > the breakpoint > > address, and inserts the breakpoint instruction opcode at the breakpoint > > address. When we hit the breakpoint gdb would then replace back > the original > > value at the breakpoint address and decrement back the > instruction pointer > > to restart the instruction again. > > The double fault is when gdb doesn't do the above and > continues from where > > the EIP/RIP is pointing to. > > When does it do this? I think this is the correct operation if > the breakpoint > is NOT one that gdb inserted. I.e. in that case you want to > continue AFTER the > breakpoint. This all seems to work correctly in the x86 case. Is this a > hardware or gdb issue? gdb or the kgdb stub. It shouldnt be done if the breakpoint is not inserted by gdb. With the 'Z' packet, the breakpoint information is within the kernel. The patch checks if gdb inserted a breakpoint at the address and only then does decrements the EIP/RIP > > How does the kgdb-mm stub handle situations where a > breakpoint is hit on > > both the CPU at the same time. Which one enters the debugger? > > I use a version of the spin lock. First I don't do the preempt > spinlock thing > as this kgdb was used to debug preempt stuff and having kgdb mess > with the > preempt count was confusing, to say the least. Also, it is not > needed as we > always do the irq version of the spinlock (in some cases we may > separate the irq > and the spinlock, but the result is the same. > > So what this means is that one cpu get through and the other > hangs in the entry > spinlock. When the NMI all but self is issued, that cpu will be > grabbed and > tucked away on its very own spinlock (each cpu has a spin lock in > the function I > call in_kgdb() and I think you call gdb_wait(). > > From this point on all is as if the second cpu had been just > anywhere. A info > threads command would show it in the kgdb spinlock, so, if you > care, you know > were it is. The patch does this in a way other than what i mentioned further below. If two CPU enter the debugger at the same time then one waits on a kgdb lock. On the NMI the CPU is pushed to gdb_wait() (in_kgdb ()). When the first CPU exits handle_exception (), the second CPU now acquired the lock and enters the debugger > In the 2.4 > > kgdb code it would be possible both of them enter the debugger > > (handle_exception()) and one of them acquires kgdb_lock while > one waits on > > the lock. Now the problem arises when > > 1. Both cpus hit a breakpoint and enter the kgdb exception handler > > (handle_exception) > > 1. CPU 0 acquired the lock and contacted gdb. we continue at > the gdb prompt. > > 2. gdb issues the "step" instruction and expects the debug > exception to be > > called wherein it would reinsert all the breakpoints back. > > Here is where it is neccessary to hold those cpus other than > self. This is done > by not freeing them from their spinlocks in the in_kgdb() > function. See the > exit code from handle_exception() in the mm-kgdb. Right. I see what is being done. If the trap flag is set, the other CPUs are still in in_kgdb() and the first CPU continues with the single step. Correct ? This is the reason why you wouldnt face the problem as gdb now handles the single step properly. > > Right. The detach functions as a continue. But the problem is > as mentioned > > above when both/multiple CPUs try to enter the debugger at the > same time. > > One of them is waiting for a lock while the other contacts gdb, > receives the > > "detach" command and quits the debugger. gdb has now quit too. Now the > > second CPU acquires the lock and waits for commands from gdb, which is > > probably no longer present. However this isnt a big problem and i havent > > tried to fix it for the same reason. I believe we can still reattach and > > then detach if "detach" is what we really want. > > Right. I also suspect that it might be wise to just kill the "k" > command. It > really isn't needed and can lead to the above confusion. I think > gdb will allow > a macro by the same name as a command to exist and thus one can > effectively kill > a command. As I said, in most cases you can just kill gdb. I > don't use the k > command as I am not sure what it is trying to do behind my back > and I just know > that, what ever it is, it is not needed. Yes, currently the 'k' and 'D' behave almost the same, only that i guess the 'D' needs to ack back to gdb, else gdb waits for ack on detach. > > Some of the issues are probably not present in the mm-kgdb > code. I believe > > that its code is lot different from the 2.4 stub's. > > Well, I did try to make it more a system debugger than a driver > debugger, so it > needed to be able to stand up to some rather hard abuse. For > example, even with > the NMI all but self code to gather other cpus there are cases > where it can fail > to pull them all in. In this case the code waits an unreasonably > long time and > then continues, rather than hanging forever. In this case if the IPI fails then the first CPU waits forever. Could there be any specific reason why the NMI could fail, other than that the NMI was already being processed by the other CPU. But then it would eventually be processed. > No, we are still, I believe rather far apart, but I think both > Amit and I would > like them to be a bit closer. The mm- version, at this time, is > for the 2.6 > kernel. For the 2.4 kernels, I have a version that I think is > rather close in > functionality, but you need to ask me for it. The mm version for > 2.6 is on the > kernel.org site: > http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6 .2-rc2/2.6.2-rc2-mm2/broken-out/ Please could you send across the 2.4 version. Thanks a lot. Best Regards, Shivram U -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: George A. <ge...@mv...> - 2004-02-02 19:21:51
Attachments:
kgdb-2.4.20-2.0.patch
|
Shivram U wrote: > Hi George, > > >>> We would stop at the breakpoint but the EIP/RIP would point >> >>to the byte >> >>>after the breakpoint intruction (INT3 instr). When we set a >> >>breakpoint at >> >>>the gdb console, gdb saves the original value of the byte at >> >>the breakpoint >> >>>address, and inserts the breakpoint instruction opcode at the breakpoint >>>address. When we hit the breakpoint gdb would then replace back >> >>the original >> >>>value at the breakpoint address and decrement back the >> >>instruction pointer >> >>>to restart the instruction again. >>> The double fault is when gdb doesn't do the above and >> >>continues from where >> >>>the EIP/RIP is pointing to. >> >>When does it do this? I think this is the correct operation if >>the breakpoint >>is NOT one that gdb inserted. I.e. in that case you want to >>continue AFTER the >>breakpoint. This all seems to work correctly in the x86 case. Is this a >>hardware or gdb issue? > > gdb or the kgdb stub. > > It shouldnt be done if the breakpoint is not inserted by gdb. With the 'Z' > packet, the breakpoint information is within the kernel. The patch checks if > gdb inserted a breakpoint at the address and only then does decrements the > EIP/RIP Is there some reason that gdb has lost this info and doesn't know enough to back up the PC? It seems to me that this should be done by gdb. I guess a fundamental question is just what instruction do you want to be pending when you insert a BP at location X. I want and expect it to be the instruction at X. So I expect gdb to replace the instruction at X with a BP instruction. Then, when hit, I expect it to restore that instruction and back up the PC. I do NOT expect it to execute that instruction until I either continue or single step. If at that time I have not removed the BP at X, I expect gdb to figure out a way to effectively execute the instruction. I have seen one debugger that used an execute instruction to do it, for example. Most of time, these days, it is done by replacing the instruction, single stepping, and then setting the BP back. But this is done on the continue or SS not on the BP trap. AND it is done by gdb with no special knowledge by kgdb. What am I missing here? > > > >>> How does the kgdb-mm stub handle situations where a >> >>breakpoint is hit on >> >>>both the CPU at the same time. Which one enters the debugger? >> >>I use a version of the spin lock. First I don't do the preempt >>spinlock thing >>as this kgdb was used to debug preempt stuff and having kgdb mess >>with the >>preempt count was confusing, to say the least. Also, it is not >>needed as we >>always do the irq version of the spinlock (in some cases we may >>separate the irq >>and the spinlock, but the result is the same. >> >>So what this means is that one cpu get through and the other >>hangs in the entry >>spinlock. When the NMI all but self is issued, that cpu will be >>grabbed and >>tucked away on its very own spinlock (each cpu has a spin lock in >>the function I >>call in_kgdb() and I think you call gdb_wait(). >> >> From this point on all is as if the second cpu had been just >>anywhere. A info >>threads command would show it in the kgdb spinlock, so, if you >>care, you know >>were it is. > > > The patch does this in a way other than what i mentioned further below. If > two CPU enter the debugger at the same time then one waits on a kgdb lock. > On the NMI the CPU is pushed to gdb_wait() (in_kgdb ()). When the first CPU > exits handle_exception (), the second CPU now acquired the lock and enters > the debugger > > >>In the 2.4 >> >>>kgdb code it would be possible both of them enter the debugger >>>(handle_exception()) and one of them acquires kgdb_lock while >> >>one waits on >> >>>the lock. Now the problem arises when >>>1. Both cpus hit a breakpoint and enter the kgdb exception handler >>>(handle_exception) >>>1. CPU 0 acquired the lock and contacted gdb. we continue at >> >>the gdb prompt. >> >>>2. gdb issues the "step" instruction and expects the debug >> >>exception to be >> >>>called wherein it would reinsert all the breakpoints back. >> >>Here is where it is neccessary to hold those cpus other than >>self. This is done >>by not freeing them from their spinlocks in the in_kgdb() >>function. See the >>exit code from handle_exception() in the mm-kgdb. > > > Right. I see what is being done. If the trap flag is set, the other CPUs > are still in in_kgdb() and the first CPU continues with the single step. > Correct ? This is the reason why you wouldnt face the problem as gdb now > handles the single step properly. Yes. > > > >>> Right. The detach functions as a continue. But the problem is >> >>as mentioned >> >>>above when both/multiple CPUs try to enter the debugger at the >> >>same time. >> >>>One of them is waiting for a lock while the other contacts gdb, >> >>receives the >> >>>"detach" command and quits the debugger. gdb has now quit too. Now the >>>second CPU acquires the lock and waits for commands from gdb, which is >>>probably no longer present. However this isnt a big problem and i havent >>>tried to fix it for the same reason. I believe we can still reattach and >>>then detach if "detach" is what we really want. >> >>Right. I also suspect that it might be wise to just kill the "k" >>command. It >>really isn't needed and can lead to the above confusion. I think >>gdb will allow >>a macro by the same name as a command to exist and thus one can >>effectively kill >>a command. As I said, in most cases you can just kill gdb. I >>don't use the k >>command as I am not sure what it is trying to do behind my back >>and I just know >>that, what ever it is, it is not needed. > > > Yes, currently the 'k' and 'D' behave almost the same, only that i guess > the 'D' needs to ack back to gdb, else gdb waits for ack on detach. > > >>> Some of the issues are probably not present in the mm-kgdb >> >>code. I believe >> >>>that its code is lot different from the 2.4 stub's. >> >>Well, I did try to make it more a system debugger than a driver >>debugger, so it >>needed to be able to stand up to some rather hard abuse. For >>example, even with >>the NMI all but self code to gather other cpus there are cases >>where it can fail >>to pull them all in. In this case the code waits an unreasonably >>long time and >>then continues, rather than hanging forever. > > > In this case if the IPI fails then the first CPU waits forever. Could > there be any specific reason why the NMI could fail, other than that the NMI > was already being processed by the other CPU. But then it would eventually > be processed. I put code in to timeout this wait prior to moving to the NMI way of doing things. It still times out from time to time. I think it is related to just when the cpu(s) come up. I suspect there is a window there where either we think there is another cpu and there isn't or it is not ready enough to respond to the NMI or we don't recognize it when it does. It does, after all, have to get to be a task before we will recognize it as having come to the party. > > >>No, we are still, I believe rather far apart, but I think both >>Amit and I would >>like them to be a bit closer. The mm- version, at this time, is >>for the 2.6 >>kernel. For the 2.4 kernels, I have a version that I think is >>rather close in >>functionality, but you need to ask me for it. The mm version for >>2.6 is on the >>kernel.org site: >>http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6 > > .2-rc2/2.6.2-rc2-mm2/broken-out/ > > Please could you send across the 2.4 version. Thanks a lot. Attached appears to be my latest one. I am not sure where this one is WRT the dwarf2 stuff in entry.S. I think I have better in that regard and know I could do even better but for the issue of no time. -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: Luben T. <lt...@pa...> - 2004-02-03 22:22:41
|
George Anzinger wrote: > > Is there some reason that gdb has lost this info and doesn't know enough > to back up the PC? > > It seems to me that this should be done by gdb. I guess a fundamental > question is just what instruction do you want to be pending when you > insert a BP at location X. I want and expect it to be the instruction It is the one after "int 3" (0xCC). > at X. So I expect gdb to replace the instruction at X with a BP > instruction. Then, when hit, I expect it to restore that instruction > and back up the PC. I do NOT expect it to execute that instruction > until I either continue or single step. If at that time I have not > removed the BP at X, I expect gdb to figure out a way to effectively > execute the instruction. I have seen one debugger that used an execute > instruction to do it, for example. Most of time, these days, it is done > by replacing the instruction, single stepping, and then setting the BP > back. But this is done on the continue or SS not on the BP trap. AND it > is done by gdb with no special knowledge by kgdb. I've tried to outfit 2.4.20-19.9 with kgdb-1.6, then kgdb-1.7 and now trying kgdb-2.0 (for 2.4.20 which you sent). 1.6 and 1.7 do not work on SMP (HT). First it is the NMI problem, which I can circumvent by enabling NMI, and second I get the EIP being wrong. I haven't tried 2.0 yet. I picked up 2.0 since NMI seems to be resolved there, but not sure about the SMP EIP problem. Here is a snippet of a conversation between the target and the dev. machine, immediately prior to the EIP problem, from the target's point of view (this uses my own kgdb debugging infrastructure): Waiting for connection from remote gdb... Sent: $S05p0000000000000bbf#c2 Got good: $Hc-1#09 Sent: $OK#9a Got good: $qC#b4 Sent: $QC0000000000000bbf#2e Got good: $qOffsets#4b Sent: $#00 Got good: $?#3f Sent: $S05#b8 Got good: $Hgbbf#d9 Sent: $OK#9a Got good: $g#67 Sent: $0100000020dd3bc0001ecbf6002060c3601ecbf6601ecbf6760000007f540000e78a13c00202000060000000680000006800000068000000ffff0000ffff0000#85 Got good: $qSymbol::#5b Sent: $#00 Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 Connected. Sent: $S05p0000000000008000#30 Got good: $g#67 Sent: $0100000003000000fa030000000160c3389f38c0389f38c00100000000000000e78a13c0020000006000000068000000680038c068000000ffff0000ffff0000#88 Got good: $mc0145250,1#8e Sent: $55#6a Got good: $mc0145250,1#8e Sent: $55#6a Got good: $mc0145250,1#8e Sent: $55#6a Got good: $mc0145250,1#8e Sent: $55#6a Got good: $mc0145251,1#8f Sent: $a1#92 Got good: $mc0145251,1#8f Sent: $a1#92 Got good: $Z0,c0145250,1#d7 Sent: $OK#9a Got good: $c#63 Sent: $S05p0000000000008000#30 Got good: $g#67 ... << more output snipped >> ... Sent: $OK#9a Got good: $z0,c0145250,1#f7 Sent: $OK#9a Got good: $Hc8000#73 Sent: $OK#9a Got good: $s#73 Sent: $S05p0000000000008000#30 Got good: $g#67 Sent: $bc08000000000000c06346c0805495f79c9e38c0b89e38c00202000020000000515214c00203000060000000680000006800000068000000ffff0000ffff0000#e0 Got good: $Z0,c0145250,1#d7 Sent: $OK#9a Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 Sent: $S05p0000000000008000#30 Got good: $g#67 Sent: $bc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000515214c00202000060000000680000006800000068000000ffff0000ffff0000#03 Got good: $Gbc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000505214c00202000060000000680000006800000068000000ffff0000ffff0000#49 Sent: $OK#9a Got good: $Gbc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000505214c00202000060000000680000006800000068000000ffff0000ffff0000#49 Sent: $OK#9a Got good: $z0,c0145250,1#f7 Sent: $OK#9a Got good: $c#63 CPU: 1 EIP: 0060:[<000000bc>] Not tainted EFLAGS: 00010286 EIP is at Using_Versions [] 0xbb (2.4.20-19.9smp-kgdb) eax: f6d0ec00 ebx: f7944c80 ecx: 000001f0 edx: f6d0ec00 esi: 00000206 edi: 000001f0 ebp: c020b7d4 esp: f785de74 ds: 0068 es: 0068 ss: 0068 Process klogd (pid: 2582, stackpage=f785d000) Stack: c36b3820 ffffffe0 00000000 f715d700 f785deac c020a8ef 00000080 000001f0 f785c000 c025dd6e 00000000 f715dc80 f715d700 f7866824 f785dec8 c020aa2e f715d700 0000003a 00000000 00000000 f785def4 f785df08 c025e644 f715d700 Call Trace: [<c020a8ef>] sock_alloc_send_pskb [kernel] 0xcf (0xf785de88)) [<c025dd6e>] unix_wait_for_peer [kernel] 0xae (0xf785de98)) This was on a SMP (HT) machine with a breakpoint on kmalloc(). > +/* scan for the sequence $<data>#<checksum> */ > +void getpacket(char * buffer) > +{ > + unsigned char checksum; > + unsigned char xmitcsum; > + int i; > + int count; > + char ch; > + > + do { > + /* wait around for the start character, ignore all other characters */ > + while ((ch = (getDebugChar() & 0x7f)) != '$'); > + checksum = 0; > + xmitcsum = -1; > + > + count = 0; > + > + /* now, read until a # or end of buffer is found */ > + while (count < BUFMAX) { > + ch = getDebugChar() & 0x7f; > + if (ch == '#') break; > + checksum = checksum + ch; > + buffer[count] = ch; > + count = count + 1; > + } > + buffer[count] = 0; Wouldn't this overwrite someone else's memory if we exited on the loop invariant being false? -- Luben |
From: George A. <ge...@mv...> - 2004-02-03 22:55:16
|
Luben Tuikov wrote: > George Anzinger wrote: > > snip >> +/* scan for the sequence $<data>#<checksum> */ >> +void getpacket(char * buffer) >> +{ >> + unsigned char checksum; >> + unsigned char xmitcsum; >> + int i; >> + int count; >> + char ch; >> + >> + do { >> + /* wait around for the start character, ignore all other >> characters */ >> + while ((ch = (getDebugChar() & 0x7f)) != '$'); >> + checksum = 0; >> + xmitcsum = -1; >> + >> + count = 0; >> + >> + /* now, read until a # or end of buffer is found */ >> + while (count < BUFMAX) { >> + ch = getDebugChar() & 0x7f; >> + if (ch == '#') break; >> + checksum = checksum + ch; >> + buffer[count] = ch; >> + count = count + 1; >> + } >> + buffer[count] = 0; > > > Wouldn't this overwrite someone else's memory if we exited on > the loop invariant being false? Well, sort of. The test should be for one less. Still, I think the buffer we overwrite is the output buffer so this would probably never be an issue. Be nice to fix it, however. Thanks. > -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: Amit S. K. <ami...@em...> - 2004-02-04 05:20:39
|
I think some of the smp panics you guys have found should go away with Tom's change of reporting an exception to gdb with a thread identifier. This way gdb knows if it expects one thread to come back with a SIGTRAP and another thread faults instead. <snip> > 1.6 and 1.7 do not work on SMP (HT). First it is the NMI problem, which > I can circumvent by enabling NMI, and second I get the EIP being wrong. > I haven't tried 2.0 yet. Can you send a gdb log with packet dumping (set debug remote 1) for EIP wrong bug? > I picked up 2.0 since NMI seems to be resolved there, but not sure about > the SMP EIP problem. > > Here is a snippet of a conversation between the target and the dev. > machine, immediately prior to the EIP problem, from the target's point of > view (this uses my own kgdb debugging infrastructure): > <snip> > Got good: $s#73 > Sent: $S05p0000000000008000#30 > Got good: $g#67 > Sent: > $bc08000000000000c06346c0805495f79c9e38c0b89e38c00202000020000000515214c002 >03000060000000680000006800000068000000ffff0000ffff0000#e0 Got good: > $Z0,c0145250,1#d7 > Sent: $OK#9a > Got good: $Hc0#db > Sent: $OK#9a > Got good: $c#63 > Sent: $S05p0000000000008000#30 > Got good: $g#67 > Sent: > $bc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000515214c002 >02000060000000680000006800000068000000ffff0000ffff0000#03 Got good: > $Gbc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000505214c00 >202000060000000680000006800000068000000ffff0000ffff0000#49 Sent: $OK#9a > Got good: > $Gbc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000505214c00 >202000060000000680000006800000068000000ffff0000ffff0000#49 Sent: $OK#9a > Got good: $z0,c0145250,1#f7 > Sent: $OK#9a > Got good: $c#63 This is strange. Why is gdb doing a continue after removing the breakpoint? It has to do a single step after it removes the breakpoint. Have you continued here after a spurious trap shown by gdb? I have mentioned a similar problem in another email, though the one pointed out by me doesn't result in a kernel panic. A gdb side session with packet dumping would be helpful. > CPU: 1 > EIP: 0060:[<000000bc>] Not tainted > EFLAGS: 00010286 > > EIP is at Using_Versions [] 0xbb (2.4.20-19.9smp-kgdb) > eax: f6d0ec00 ebx: f7944c80 ecx: 000001f0 edx: f6d0ec00 > esi: 00000206 edi: 000001f0 ebp: c020b7d4 esp: f785de74 > ds: 0068 es: 0068 ss: 0068 > Process klogd (pid: 2582, stackpage=f785d000) > Stack: c36b3820 ffffffe0 00000000 f715d700 f785deac c020a8ef 00000080 > 000001f0 f785c000 c025dd6e 00000000 f715dc80 f715d700 f7866824 f785dec8 > c020aa2e f715d700 0000003a 00000000 00000000 f785def4 f785df08 c025e644 > f715d700 Call Trace: [<c020a8ef>] sock_alloc_send_pskb [kernel] 0xcf > (0xf785de88)) [<c025dd6e>] unix_wait_for_peer [kernel] 0xae (0xf785de98)) > > This was on a SMP (HT) machine with a breakpoint on kmalloc(). > > > +/* scan for the sequence $<data>#<checksum> */ > > +void getpacket(char * buffer) > > +{ > > + unsigned char checksum; > > + unsigned char xmitcsum; > > + int i; > > + int count; > > + char ch; > > + > > + do { > > + /* wait around for the start character, ignore all other characters > > */ + while ((ch = (getDebugChar() & 0x7f)) != '$'); > > + checksum = 0; > > + xmitcsum = -1; > > + > > + count = 0; > > + > > + /* now, read until a # or end of buffer is found */ > > + while (count < BUFMAX) { > > + ch = getDebugChar() & 0x7f; > > + if (ch == '#') break; > > + checksum = checksum + ch; > > + buffer[count] = ch; > > + count = count + 1; > > + } > > + buffer[count] = 0; > > Wouldn't this overwrite someone else's memory if we exited on > the loop invariant being false? |
From: Luben T. <lt...@pa...> - 2004-02-04 17:09:43
|
Amit S. Kale wrote: >>Here is a snippet of a conversation between the target and the dev. >>machine, immediately prior to the EIP problem, from the target's point of >>view (this uses my own kgdb debugging infrastructure): >> > > <snip> > >> Got good: $s#73 >> Sent: $S05p0000000000008000#30 >> Got good: $g#67 >> Sent: >>$bc08000000000000c06346c0805495f79c9e38c0b89e38c00202000020000000515214c002 >>03000060000000680000006800000068000000ffff0000ffff0000#e0 Got good: >>$Z0,c0145250,1#d7 >> Sent: $OK#9a >> Got good: $Hc0#db >> Sent: $OK#9a >> Got good: $c#63 >> Sent: $S05p0000000000008000#30 >> Got good: $g#67 >> Sent: >>$bc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000515214c002 >>02000060000000680000006800000068000000ffff0000ffff0000#03 Got good: >>$Gbc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000505214c00 >>202000060000000680000006800000068000000ffff0000ffff0000#49 Sent: $OK#9a >> Got good: >>$Gbc08000000000000c06346c080863af7a09e38c0b89e38c00602000020000000505214c00 >>202000060000000680000006800000068000000ffff0000ffff0000#49 Sent: $OK#9a >> Got good: $z0,c0145250,1#f7 >> Sent: $OK#9a >> Got good: $c#63 > > > This is strange. Why is gdb doing a continue after removing the breakpoint? It > has to do a single step after it removes the breakpoint. Have you continued > here after a spurious trap shown by gdb? I have mentioned a similar problem > in another email, though the one pointed out by me doesn't result in a kernel > panic. I think that it sets TF=1, so that it gets back to kgdb after continue. Not 100% sure will have to take a look at that. > A gdb side session with packet dumping would be helpful. I normally turn this off, since it's the same info as shown on the target, but will get one shortly. Thanks, -- Luben |
From: Shivram U <shi...@wi...> - 2004-02-04 02:21:14
|
Hi George, > > It shouldnt be done if the breakpoint is not inserted by gdb. > With the 'Z' > > packet, the breakpoint information is within the kernel. The > patch checks if > > gdb inserted a breakpoint at the address and only then does > decrements the > > EIP/RIP > > Is there some reason that gdb has lost this info and doesn't know > enough to back > up the PC? The check in the patch is when we stopped on a breakpoint and when we are about to continue if the PC is still the same as before. It should be the same if the breakpoint wasnt inserted by gdb, but on the other hand if is inserted by gdb it should ideally be decremented However the problem is the following scenario 1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have hit a breakpoint at the same time) 2. CPU0 contacts gdb and at gdb prompt we continue. 3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets the trap flag and expects to reinsert all breakpoints when the debug exception is received on a single step. 4. Now what happens if CPU1 enters the debugger ? Note that it entered the debugger on a breakpoint and not because of the single step 5. Gdb should ideally treat the exception as a breakpoint, however it assumes that CPU1 entered the debugger due to the trap flag, reinserts the breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes from where it left off and we get the double fault. gdb code seems to maintain a lot of state information after receiving a breakpoints (gdb/infrun.c). > > It seems to me that this should be done by gdb. I guess a > fundamental question > is just what instruction do you want to be pending when you > insert a BP at > location X. I want and expect it to be the instruction at X. So > I expect gdb > to replace the instruction at X with a BP instruction. Then, > when hit, I expect > it to restore that instruction and back up the PC. I do NOT expect it to > execute that instruction until I either continue or single step. > If at that > time I have not removed the BP at X, I expect gdb to figure out a way to > effectively execute the instruction. I have seen one debugger > that used an > execute instruction to do it, for example. Most of time, these > days, it is done > by replacing the instruction, single stepping, and then setting > the BP back. > But this is done on the continue or SS not on the BP trap. AND > it is done by > gdb with no special knowledge by kgdb. Right, so as mentioned above what happens if between the continue and single stepping of one CPU another CPU which had hit the breakpoint contacts gdb. This exactly is the problem i have been facing with the 2.4 stub. I believe the code in kgdb-mm is correct in its handling of this situation. If im correct between the continue and the single stepping the other CPU would not execute. > What am I missing here? I hope i could clarify the problem. Thanks a lot for the patch. I havent yet tried it out yet, but i will soon. Best Regards, Shivram U |
From: Amit S. K. <ami...@em...> - 2004-02-04 04:55:42
|
On Tuesday 03 Feb 2004 8:46 pm, Shivram U wrote: > Hi George, > > > > It shouldnt be done if the breakpoint is not inserted by gdb. > > > > With the 'Z' > > > > > packet, the breakpoint information is within the kernel. The > > > > patch checks if > > > > > gdb inserted a breakpoint at the address and only then does > > > > decrements the > > > > > EIP/RIP > > > > Is there some reason that gdb has lost this info and doesn't know > > enough to back > > up the PC? > > The check in the patch is when we stopped on a breakpoint and when we are > about to continue if the PC is still the same as before. It should be the > same if the breakpoint wasnt inserted by gdb, but on the other hand if is > inserted by gdb it should ideally be decremented > However the problem is the following scenario > 1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have hit a > breakpoint at the same time) > 2. CPU0 contacts gdb and at gdb prompt we continue. > 3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets the > trap flag and expects to reinsert all breakpoints when the debug exception > is received on a single step. > 4. Now what happens if CPU1 enters the debugger ? Note that it entered the > debugger on a breakpoint and not because of the single step > 5. Gdb should ideally treat the exception as a breakpoint, however it > assumes that CPU1 entered the debugger due to the trap flag, reinserts the > breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes from > where it left off and we get the double fault. gdb knows that control has gone to another thread. It doesn't know what happened to the thread it started to single step. Since it can't maintain two contexts it discards all this information and presents the breakpoint to a user as a spurious SIGTRAP. At this point of time the other cpu could be about to fault because of the single step or it may have faulted. It'll appear again as a SIGTRAP because gdb has forgotten previous context. > > gdb code seems to maintain a lot of state information after receiving a > breakpoints (gdb/infrun.c). > > > It seems to me that this should be done by gdb. I guess a > > fundamental question > > is just what instruction do you want to be pending when you > > insert a BP at > > location X. I want and expect it to be the instruction at X. So > > I expect gdb > > to replace the instruction at X with a BP instruction. Then, > > when hit, I expect > > it to restore that instruction and back up the PC. I do NOT expect it to > > execute that instruction until I either continue or single step. > > If at that > > time I have not removed the BP at X, I expect gdb to figure out a way to > > effectively execute the instruction. I have seen one debugger > > that used an > > execute instruction to do it, for example. Most of time, these > > days, it is done > > by replacing the instruction, single stepping, and then setting > > the BP back. > > But this is done on the continue or SS not on the BP trap. AND > > it is done by > > gdb with no special knowledge by kgdb. > > Right, so as mentioned above what happens if between the continue and > single stepping of one CPU another CPU which had hit the breakpoint > contacts gdb. This exactly is the problem i have been facing with the 2.4 > stub. I believe the code in kgdb-mm is correct in its handling of this > situation. If im correct between the continue and the single stepping the > other CPU would not execute. Preventing other cpus from executing during a single step causes this problem: If a user steps over a spinlock which held by other cpu, we have a deadlock. Note that a user may not actually say "next" over a lock statement. A "next" over a function call which is inlined by gcc may result in the same thing, which is single stepping through the whole function code. -Amit > > > What am I missing here? > > I hope i could clarify the problem. Thanks a lot for the patch. I havent > yet tried it out yet, but i will soon. > > Best Regards, > Shivram U > > > Confidentiality Notice > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain confidential or privileged information. If you are not the intended > recipient, please notify the sender at Wipro or Mai...@wi... > immediately and destroy all copies of this message and any attachments. |
From: George A. <ge...@mv...> - 2004-02-04 10:16:14
|
Amit S. Kale wrote: > On Tuesday 03 Feb 2004 8:46 pm, Shivram U wrote: > >>Hi George, >> >> >>>> It shouldnt be done if the breakpoint is not inserted by gdb. >>> >>>With the 'Z' >>> >>> >>>>packet, the breakpoint information is within the kernel. The >>> >>>patch checks if >>> >>> >>>>gdb inserted a breakpoint at the address and only then does >>> >>>decrements the >>> >>> >>>>EIP/RIP >>> >>>Is there some reason that gdb has lost this info and doesn't know >>>enough to back >>>up the PC? >> >> The check in the patch is when we stopped on a breakpoint and when we are >>about to continue if the PC is still the same as before. It should be the >>same if the breakpoint wasnt inserted by gdb, but on the other hand if is >>inserted by gdb it should ideally be decremented >> However the problem is the following scenario >>1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have hit a >>breakpoint at the same time) >>2. CPU0 contacts gdb and at gdb prompt we continue. >>3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets the >>trap flag and expects to reinsert all breakpoints when the debug exception >>is received on a single step. >>4. Now what happens if CPU1 enters the debugger ? Note that it entered the >>debugger on a breakpoint and not because of the single step >>5. Gdb should ideally treat the exception as a breakpoint, however it >>assumes that CPU1 entered the debugger due to the trap flag, reinserts the >>breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes from >>where it left off and we get the double fault. > > > gdb knows that control has gone to another thread. It doesn't know what > happened to the thread it started to single step. Since it can't maintain two > contexts it discards all this information and presents the breakpoint to a > user as a spurious SIGTRAP. At this point of time the other cpu could be > about to fault because of the single step or it may have faulted. It'll > appear again as a SIGTRAP because gdb has forgotten previous context. > > >> gdb code seems to maintain a lot of state information after receiving a >>breakpoints (gdb/infrun.c). >> >> >>>It seems to me that this should be done by gdb. I guess a >>>fundamental question >>>is just what instruction do you want to be pending when you >>>insert a BP at >>>location X. I want and expect it to be the instruction at X. So >>>I expect gdb >>>to replace the instruction at X with a BP instruction. Then, >>>when hit, I expect >>>it to restore that instruction and back up the PC. I do NOT expect it to >>>execute that instruction until I either continue or single step. >>>If at that >>>time I have not removed the BP at X, I expect gdb to figure out a way to >>>effectively execute the instruction. I have seen one debugger >>>that used an >>>execute instruction to do it, for example. Most of time, these >>>days, it is done >>>by replacing the instruction, single stepping, and then setting >>>the BP back. >>>But this is done on the continue or SS not on the BP trap. AND >>>it is done by >>>gdb with no special knowledge by kgdb. >> >> Right, so as mentioned above what happens if between the continue and >>single stepping of one CPU another CPU which had hit the breakpoint >>contacts gdb. This exactly is the problem i have been facing with the 2.4 >>stub. I believe the code in kgdb-mm is correct in its handling of this >>situation. If im correct between the continue and the single stepping the >>other CPU would not execute. > > > Preventing other cpus from executing during a single step causes this problem: > If a user steps over a spinlock which held by other cpu, we have a deadlock. > Note that a user may not actually say "next" over a lock statement. A "next" > over a function call which is inlined by gcc may result in the same thing, > which is single stepping through the whole function code. I don't think it deadlocks. The single stepping cpu will continue to step, around the spinlock, but still it steps. At this point the user could do what ever is needed to eliminate the lock. For example, he might set a break point just beyond the spin lock and continue. To make it a bit easier to spot things like this I have a debug patch for the spinlock code that plants "current" in the spinlock structure when a lock is taken. This makes it easy to see who has the lock. It is also possible, with the mm-kgdb, to not hold the other cpu or to hold only some of them on a single step. WRT the spinlock issue, I would also note that kgdb only knows about the single step as sent by gdb. In most cases this only happens when the user request is "si" and for internal things like moving off of a break point and possibly doing conditional jmps within the context of a "si" or "n". > -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: Amit S. K. <ami...@em...> - 2004-02-04 11:53:25
|
On Wednesday 04 Feb 2004 3:45 pm, George Anzinger wrote: > Amit S. Kale wrote: > > On Tuesday 03 Feb 2004 8:46 pm, Shivram U wrote: > >>Hi George, > >> > >>>> It shouldnt be done if the breakpoint is not inserted by gdb. > >>> > >>>With the 'Z' > >>> > >>>>packet, the breakpoint information is within the kernel. The > >>> > >>>patch checks if > >>> > >>>>gdb inserted a breakpoint at the address and only then does > >>> > >>>decrements the > >>> > >>>>EIP/RIP > >>> > >>>Is there some reason that gdb has lost this info and doesn't know > >>>enough to back > >>>up the PC? > >> > >> The check in the patch is when we stopped on a breakpoint and when we > >> are about to continue if the PC is still the same as before. It should > >> be the same if the breakpoint wasnt inserted by gdb, but on the other > >> hand if is inserted by gdb it should ideally be decremented > >> However the problem is the following scenario > >>1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have hit > >> a breakpoint at the same time) > >>2. CPU0 contacts gdb and at gdb prompt we continue. > >>3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets the > >>trap flag and expects to reinsert all breakpoints when the debug > >> exception is received on a single step. > >>4. Now what happens if CPU1 enters the debugger ? Note that it entered > >> the debugger on a breakpoint and not because of the single step > >>5. Gdb should ideally treat the exception as a breakpoint, however it > >>assumes that CPU1 entered the debugger due to the trap flag, reinserts > >> the breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes > >> from where it left off and we get the double fault. > > > > gdb knows that control has gone to another thread. It doesn't know what > > happened to the thread it started to single step. Since it can't maintain > > two contexts it discards all this information and presents the breakpoint > > to a user as a spurious SIGTRAP. At this point of time the other cpu > > could be about to fault because of the single step or it may have > > faulted. It'll appear again as a SIGTRAP because gdb has forgotten > > previous context. > > > >> gdb code seems to maintain a lot of state information after receiving a > >>breakpoints (gdb/infrun.c). > >> > >>>It seems to me that this should be done by gdb. I guess a > >>>fundamental question > >>>is just what instruction do you want to be pending when you > >>>insert a BP at > >>>location X. I want and expect it to be the instruction at X. So > >>>I expect gdb > >>>to replace the instruction at X with a BP instruction. Then, > >>>when hit, I expect > >>>it to restore that instruction and back up the PC. I do NOT expect it > >>> to execute that instruction until I either continue or single step. If > >>> at that > >>>time I have not removed the BP at X, I expect gdb to figure out a way to > >>>effectively execute the instruction. I have seen one debugger > >>>that used an > >>>execute instruction to do it, for example. Most of time, these > >>>days, it is done > >>>by replacing the instruction, single stepping, and then setting > >>>the BP back. > >>>But this is done on the continue or SS not on the BP trap. AND > >>>it is done by > >>>gdb with no special knowledge by kgdb. > >> > >> Right, so as mentioned above what happens if between the continue and > >>single stepping of one CPU another CPU which had hit the breakpoint > >>contacts gdb. This exactly is the problem i have been facing with the 2.4 > >>stub. I believe the code in kgdb-mm is correct in its handling of this > >>situation. If im correct between the continue and the single stepping the > >>other CPU would not execute. > > > > Preventing other cpus from executing during a single step causes this > > problem: If a user steps over a spinlock which held by other cpu, we have > > a deadlock. Note that a user may not actually say "next" over a lock > > statement. A "next" over a function call which is inlined by gcc may > > result in the same thing, which is single stepping through the whole > > function code. > > I don't think it deadlocks. The single stepping cpu will continue to step, > around the spinlock, but still it steps. At this point the user could do > what ever is needed to eliminate the lock. For example, he might set a > break point just beyond the spin lock and continue. To make it a bit > easier to spot things like this I have a debug patch for the spinlock code > that plants "current" in the spinlock structure when a lock is taken. This > makes it easy to see who has the lock. It is also possible, with the > mm-kgdb, to not hold the other cpu or to hold only some of them on a single > step. If a user types a "step" command gdb expects it to finish it in fixed amount of time. This doesn't hold for spinlocks. They can take potentially indefinite amount of time. GDB doesn't let a user get out of this state. One has to kill gdb and restart it. -Amit > WRT the spinlock issue, I would also note that kgdb only knows about the > single step as sent by gdb. In most cases this only happens when the user > request is "si" and for internal things like moving off of a break point > and possibly doing conditional jmps within the context of a "si" or "n". |
From: Luben T. <lt...@pa...> - 2004-02-04 17:04:33
|
Amit S. Kale wrote: > On Tuesday 03 Feb 2004 8:46 pm, Shivram U wrote: > >>Hi George, >> >> >>>> It shouldnt be done if the breakpoint is not inserted by gdb. >>> >>>With the 'Z' >>> >>> >>>>packet, the breakpoint information is within the kernel. The >>> >>>patch checks if >>> >>> >>>>gdb inserted a breakpoint at the address and only then does >>> >>>decrements the >>> >>> >>>>EIP/RIP >>> >>>Is there some reason that gdb has lost this info and doesn't know >>>enough to back >>>up the PC? >> >> The check in the patch is when we stopped on a breakpoint and when we are >>about to continue if the PC is still the same as before. It should be the >>same if the breakpoint wasnt inserted by gdb, but on the other hand if is >>inserted by gdb it should ideally be decremented >> However the problem is the following scenario >>1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have hit a >>breakpoint at the same time) >>2. CPU0 contacts gdb and at gdb prompt we continue. >>3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets the >>trap flag and expects to reinsert all breakpoints when the debug exception >>is received on a single step. >>4. Now what happens if CPU1 enters the debugger ? Note that it entered the >>debugger on a breakpoint and not because of the single step >>5. Gdb should ideally treat the exception as a breakpoint, however it >>assumes that CPU1 entered the debugger due to the trap flag, reinserts the >>breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes from >>where it left off and we get the double fault. > > > gdb knows that control has gone to another thread. It doesn't know what > happened to the thread it started to single step. Since it can't maintain two > contexts it discards all this information and presents the breakpoint to a > user as a spurious SIGTRAP. At this point of time the other cpu could be > about to fault because of the single step or it may have faulted. It'll > appear again as a SIGTRAP because gdb has forgotten previous context. Yes, I think this is exactly what I'm experiencing. So the right thing to do would seem to be that kgdb should have to keep a per cpu breakpoint state, so that it would know if CPU1 is entering breakpoint b0 on "int 3" trap or because it set the TF flag in EFLAGS. Discerning those would make it possible to properly correct for EIP and pt_regs, and be able to not block other CPUs on breakpoint execution. -- Luben |
From: Luben T. <lt...@pa...> - 2004-02-04 18:19:04
|
>>> The check in the patch is when we stopped on a breakpoint and when >>> we are >>> about to continue if the PC is still the same as before. It should be >>> the >>> same if the breakpoint wasnt inserted by gdb, but on the other hand >>> if is >>> inserted by gdb it should ideally be decremented >>> However the problem is the following scenario >>> 1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have >>> hit a >>> breakpoint at the same time) >>> 2. CPU0 contacts gdb and at gdb prompt we continue. >>> 3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets >>> the >>> trap flag and expects to reinsert all breakpoints when the debug >>> exception >>> is received on a single step. >>> 4. Now what happens if CPU1 enters the debugger ? Note that it >>> entered the >>> debugger on a breakpoint and not because of the single step >>> 5. Gdb should ideally treat the exception as a breakpoint, however it >>> assumes that CPU1 entered the debugger due to the trap flag, >>> reinserts the >>> breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes >>> from >>> where it left off and we get the double fault. >> >> >> >> gdb knows that control has gone to another thread. It doesn't know >> what happened to the thread it started to single step. Since it can't >> maintain two contexts it discards all this information and presents >> the breakpoint to a user as a spurious SIGTRAP. At this point of time >> the other cpu could be about to fault because of the single step or it >> may have faulted. It'll appear again as a SIGTRAP because gdb has >> forgotten previous context. > > > Yes, I think this is exactly what I'm experiencing. > > So the right thing to do would seem to be that kgdb should have to > keep a per cpu breakpoint state, so that it would know if CPU1 > is entering breakpoint b0 on "int 3" trap or because it set the > TF flag in EFLAGS. Discerning those would make it possible to properly > correct for EIP and pt_regs, and be able to not block other CPUs > on breakpoint execution. > Ideally gdb would have one more piece of information to keep, the CPU id, but since it was written as a userspace debugging tool, it doesn't need this information. What we can do is emulate this difference in states in kgdb, since we know the CPU id, and thus ``give'' gdb the information which it is actually interested in. E.g. ``kgdb_step'' will have to be an array with size the number (max) of CPUs (and atomics too). Another helpful piece of info is that when we get a trap because of TF=1 (step), we get "int 1" trap, rather than "int 3". We can use this to discern between if we're continuing or just entering in the exception handler. -- Luben |
From: George A. <ge...@mv...> - 2004-02-04 09:13:45
|
Shivram U wrote: > Hi George, > > >>> It shouldnt be done if the breakpoint is not inserted by gdb. >> >>With the 'Z' >> >>>packet, the breakpoint information is within the kernel. The >> >>patch checks if >> >>>gdb inserted a breakpoint at the address and only then does >> >>decrements the >> >>>EIP/RIP >> >>Is there some reason that gdb has lost this info and doesn't know >>enough to back >>up the PC? > > > The check in the patch is when we stopped on a breakpoint and when we are > about to continue if the PC is still the same as before. It should be the > same if the breakpoint wasnt inserted by gdb, but on the other hand if is > inserted by gdb it should ideally be decremented > However the problem is the following scenario > 1. CPU0 enters the debugger, CPU1 waits on a lock (both of them have hit a > breakpoint at the same time) > 2. CPU0 contacts gdb and at gdb prompt we continue. > 3. Gdb decrements the PC for CPU0 and clears all breakpoints. It sets the > trap flag and expects to reinsert all breakpoints when the debug exception > is received on a single step. > 4. Now what happens if CPU1 enters the debugger ? Note that it entered the > debugger on a breakpoint and not because of the single step > 5. Gdb should ideally treat the exception as a breakpoint, however it > assumes that CPU1 entered the debugger due to the trap flag, reinserts the > breakpoints but doesnt decrement the PC for CPU1. Here CPU1 executes from > where it left off and we get the double fault. > > gdb code seems to maintain a lot of state information after receiving a > breakpoints (gdb/infrun.c). OK, NOW I understand the problem. It would seem that holding other cpus while single stepping would eliminate this problem. I see you agree below... > > >>It seems to me that this should be done by gdb. I guess a >>fundamental question >>is just what instruction do you want to be pending when you >>insert a BP at >>location X. I want and expect it to be the instruction at X. So >>I expect gdb >>to replace the instruction at X with a BP instruction. Then, >>when hit, I expect >>it to restore that instruction and back up the PC. I do NOT expect it to >>execute that instruction until I either continue or single step. >>If at that >>time I have not removed the BP at X, I expect gdb to figure out a way to >>effectively execute the instruction. I have seen one debugger >>that used an >>execute instruction to do it, for example. Most of time, these >>days, it is done >>by replacing the instruction, single stepping, and then setting >>the BP back. >>But this is done on the continue or SS not on the BP trap. AND >>it is done by >>gdb with no special knowledge by kgdb. > > > Right, so as mentioned above what happens if between the continue and > single stepping of one CPU another CPU which had hit the breakpoint contacts > gdb. This exactly is the problem i have been facing with the 2.4 stub. I > believe the code in kgdb-mm is correct in its handling of this situation. If > im correct between the continue and the single stepping the other CPU would > not execute. > > >>What am I missing here? > > > I hope i could clarify the problem. Thanks a lot for the patch. I havent > yet tried it out yet, but i will soon. Yes, thanks. > > Best Regards, > Shivram U > > > Confidentiality Notice > > The information contained in this electronic message and any attachments to this message are intended > for the exclusive use of the addressee(s) and may contain confidential or privileged information. If > you are not the intended recipient, please notify the sender at Wipro or Mai...@wi... immediately > and destroy all copies of this message and any attachments. > > -- George Anzinger ge...@mv... High-res-timers: http://sourceforge.net/projects/high-res-timers/ Preemption patch: http://www.kernel.org/pub/linux/kernel/people/rml |
From: Luben T. <lt...@pa...> - 2004-02-04 04:40:39
|
Hi, I just tried kgdb-2.0 with linux-2.4.20-19.9 (Red Hat's kernel) and when I drop it into the debugger at boot time (boot options: kgdb console=kgdb), it does go into the debugger, but when gdb contacts it, it dies. I've seen the same thing with kgdb-1.6 and kgdb-1.7 for that same kernel (2.4.20-19.9). The problem in 1.6 and 1.7 is the way the target executes the `Hc0' command directive from the development machine. arch/i386/kernel/i386-stub.c::i386_handle_exception() which has the following check: switch (remcomInBuffer[0]) { case 'c': case 's': if (kgdb_contthread && kgdb_contthread != current) { strcpy(remcomOutBuffer, "E00"); break; } ... This breaks when the dev. machine has sent 'Hc0' _previously_: Got: $Hc0#db Sent: $OK#9a Got: $c#63 Sent: $E00#a5 And the target machine dies. The problem is: case 'H': case 'g': .... break; case 'c': atomic_set(&kgdb_killed_or_detached, 0); ptr = &remcomInBuffer[2]; hexToInt(&ptr, &threadid); thread = getthread(threadid); if (!thread && threadid > 0) { remcomOutBuffer[0] = 'E'; remcomOutBuffer[1] = '\0'; break; } kgdb_contthread = thread; remcomOutBuffer[0] = 'O'; remcomOutBuffer[1] = 'K'; remcomOutBuffer[2] = '\0'; break; } If threadid == 0 (`Hc0'), then getthread() returns NULL, but !thread && threadid > 0 is FALSE and then on `c' boom! To correct this, the processing of `Hc0' now looks like this: case 'H': /* first get the thread */ switch (remcomInBuffer[1]) { case 'g': case 'c': ptr = &remcomInBuffer[2]; threadid = 0; if (hexToInt(&ptr, &threadid) > 0 && threadid) thread = getthread(threadid); else thread = current; } /* now do the op */ switch (remcomInBuffer[1]) { case 'c': atomic_set(&kgdb_killed_or_detached, 0); case 'g': if (!thread) { remcomOutBuffer[0] = 'E'; remcomOutBuffer[1] = '\0'; } else { kgdb_contthread = thread; remcomOutBuffer[0] = 'O'; remcomOutBuffer[1] = 'K'; remcomOutBuffer[2] = '\0'; } break; } break; I quickly looked at 2.0 and I suspect that the fix for 2.0 would be something similar (add threadid = thread->pid after thread is assigned above). > +char gdbconbuf[BUFMAX]; > + > +static void kgdb_gdb_message(const char *s, unsigned count) > +{ > + int i; > + int wcount; > + char *bufptr; > + /* > + * This takes care of NMI while spining out chars to gdb > + */ > + IF_SMP(in_kgdb_console=1); > + gdbconbuf[0] = 'O'; > + bufptr = gdbconbuf + 1; > + while (count > 0) { > + if ((count << 1) > (BUFMAX - 2)) { > + wcount = (BUFMAX - 2) >> 1; > + } else { > + wcount = count; > + } > + count -= wcount; > + for (i = 0; i < wcount; i++) { > + bufptr = pack_hex_byte(bufptr, s[i]); > + } > + *bufptr = '\0'; > + s += wcount; > + > + putpacket(gdbconbuf); > + > + } > + IF_SMP(in_kgdb_console=0); > +} The problem with the above function is that it doesn't reset bufptr = gdbconbuf + 1 on each while (count > 0) iteration. My version for 1.6 and 1.7 looks like this: static char gdbconbuf[BUFMAX]; void gdb_console_write(struct console *co, const char *_s, unsigned count) { unsigned long flags; unsigned char *s = (unsigned char *) _s; if (!gdb_initialized || atomic_read(&kgdb_killed_or_detached) || atomic_read(&kgdb_debug_skip_this)) return; local_irq_save(flags); gdbconbuf[0] = 'O'; while (count > 0) { int i; unsigned char *p = gdbconbuf + 1; for (i = 0; 0 < count && i < BUFMAX-2; i += 2, count--) { *p++ = hexchars[*s >> 4]; *p++ = hexchars[*s++ & 0xf]; } *p = 0; putpacket(gdbconbuf); } local_irq_restore(flags); } I'm also experiencing the same problem Shivram is describing about the EIP not being right when 2 cpus hit a breakpoint simultaneously. Has anyone transplanted the solution from 1.6, which Shivram posted, into 1.7 yet? Thanks, -- Luben |
From: Luben T. <lt...@pa...> - 2004-02-05 02:04:43
|
Here is some output I interlaced (target and dev. machine) together to understand more of what is going on. This is a very simple session of setting a break on kmalloc(). I normally disable output on the gdb side (dev. machine) as to not overwhelm the debug output. Hope someone finds this helpful. Legend: N.================================================== <Text from the development machine> -------------------------------------------------- <Text from the target machine> Where N >= 0, integer, and the above pattern repeats. Interlaced output ----------------- Waiting for connection from remote gdb... Sent: $S05p0000000000000bcb#bf Got good: $Hc-1#09 Sent: $OK#9a Got good: $qC#b4 Sent: $QC0000000000000bcb#2b Got good: $qOffsets#4b Sent: $#00 Got good: $?#3f Sent: $S05#b8 Got good: $Hgbcb#d6 Sent: $OK#9a Got good: $g#67 Sent: $010000000200000000bec3f6002060c360bec3f660bec3f6760000007f540000d78a13c00202000060000000680000006800000068000000ffff0000ffff0000#ba Got good: $qSymbol::#5b Sent: $#00 Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 Connected. 0.================================================== <CTRL-C> Program received signal SIGTRAP, Trace/breakpoint trap. breakpoint () at kgdbstub.c:1046 1046 in kgdbstub.c (gdb) break kmalloc Breakpoint 1 at 0xc0145240: file slab.c, line 1557. (gdb) c Continuing. -------------------------------------------------- Sent: $S05p0000000000008000#30 Got good: $g#67 Sent: $0100000003000000fa030000400c60c3389f38c0389f38c00100000000000000d78a13c0020000006000000068000000680038c068000000ffff0000ffff0000#bd Got good: $c#63 Sent: $S05p0000000000008000#30 Got good: $g#67 Sent: $0100000003000000fa030000400c60c3389f38c0389f38c00100000000000000d78a13c0020000006000000068000000680038c068000000ffff0000ffff0000#bd Got good: $c#63 Sent: $S05p0000000000008000#30 Got good: $g#67 Sent: $0100000003000000fa030000400c60c3389f38c0389f38c00100000000000000d78a13c0020000006000000068000000680038c068000000ffff0000ffff0000#bd Got good: $mc0145240,1#8d Sent: $55#6a Got good: $mc0145240,1#8d Sent: $55#6a Got good: $mc0145240,1#8d Sent: $55#6a Got good: $mc0145240,1#8d Sent: $55#6a Got good: $mc0145241,1#8e Sent: $a1#92 Got good: $mc0145241,1#8e Sent: $a1#92 Got good: $Z0,c0145240,1#d6 Sent: $OK#9a Got good: $c#63 1.================================================== [New Thread 2585] [Switching to Thread 2585] Breakpoint 1, kmalloc (size=128, flags=-149549696) at slab.c:1557 1557 { (gdb) c Continuing. -------------------------------------------------- Sent: $S05p0000000000000a19#63 Got good: $g#67 Sent: $bc00000080000000800d16f7800d16f76cfe86f784fe86f700000000f0010000415214c00202000060000000680000006800000068000000ffff0000ffff0000#7d Got good: $P8=405214c0#88 Sent: $#00 Got good: $Gbc00000080000000800d16f7800d16f76cfe86f784fe86f700000000f0010000405214c00202000060000000680000006800000068000000ffff0000ffff0000#c3 Sent: $OK#9a Got good: $Gbc00000080000000800d16f7800d16f76cfe86f784fe86f700000000f0010000405214c00202000060000000680000006800000068000000ffff0000ffff0000#c3 Sent: $OK#9a Got good: $z0,c0145240,1#f6 Sent: $OK#9a Got good: $Hca19#76 Sent: $OK#9a Got good: $s#73 Sent: $S05p0000000000000a19#63 Got good: $g#67 Sent: $bc00000080000000800d16f7800d16f768fe86f784fe86f700000000f0010000415214c00203000060000000680000006800000068000000ffff0000ffff0000#53 Got good: $Z0,c0145240,1#d6 Sent: $OK#9a Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 2.================================================== [New Thread 2581] [Switching to Thread 2581] Breakpoint 1, kmalloc (size=4159947776, flags=240) at slab.c:1557 1557 { (gdb) c Continuing. -------------------------------------------------- Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214c09202000060000000680000006800000068000000ffff0000ffff0000#9d Got good: $G0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000405214c09202000060000000680000006800000068000000ffff0000ffff0000#e3 Sent: $OK#9a Got good: $G0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000405214c09202000060000000680000006800000068000000ffff0000ffff0000#e3 Sent: $OK#9a Got good: $z0,c0145240,1#f6 Sent: $OK#9a Got good: $Hca15#72 Sent: $OK#9a Got good: $s#73 Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $0200000000c4f3f7007a9bf7f0000000e4bd87f700be87f7010000001c000000415214c09203000060000000680000006800000068000000ffff0000ffff0000#9a Got good: $Z0,c0145240,1#d6 Sent: $OK#9a Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 3.================================================== Breakpoint 1, kmalloc (size=4154161812, flags=240) at slab.c:1557 1557 { (gdb) c Continuing. -------------------------------------------------- Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $00000000947a9bf700000000f0000000b0bd87f7c8bd87f7010000005c000000415214c08602000060000000680000006800000068000000ffff0000ffff0000#31 Got good: $G00000000947a9bf700000000f0000000b0bd87f7c8bd87f7010000005c000000405214c08602000060000000680000006800000068000000ffff0000ffff0000#77 Sent: $OK#9a Got good: $G00000000947a9bf700000000f0000000b0bd87f7c8bd87f7010000005c000000405214c08602000060000000680000006800000068000000ffff0000ffff0000#77 Sent: $OK#9a Got good: $z0,c0145240,1#f6 Sent: $OK#9a Got good: $Hca15#72 Sent: $OK#9a Got good: $s#73 Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $00000000947a9bf700000000f0000000acbd87f7c8bd87f7010000005c000000415214c08603000060000000680000006800000068000000ffff0000ffff0000#64 Got good: $Z0,c0145240,1#d6 Sent: $OK#9a Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 4.================================================== Breakpoint 1, kmalloc (size=26, flags=240) at slab.c:1557 1557 { (gdb) c Continuing. -------------------------------------------------- Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $1a0000001a00000000000000f0000000ecbd87f704be87f7010000001c000000415214c08602000060000000680000006800000068000000ffff0000ffff0000#d4 Got good: $G1a0000001a00000000000000f0000000ecbd87f704be87f7010000001c000000405214c08602000060000000680000006800000068000000ffff0000ffff0000#1a Sent: $OK#9a Got good: $G1a0000001a00000000000000f0000000ecbd87f704be87f7010000001c000000405214c08602000060000000680000006800000068000000ffff0000ffff0000#1a Sent: $OK#9a Got good: $z0,c0145240,1#f6 Sent: $OK#9a Got good: $Hca15#72 Sent: $OK#9a Got good: $s#73 Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $1a0000001a00000000000000f0000000e8bd87f704be87f7010000001c000000415214c08603000060000000680000006800000068000000ffff0000ffff0000#aa Got good: $Z0,c0145240,1#d6 Sent: $OK#9a Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 5.================================================== [New Thread 2914] [Switching to Thread 2914] Breakpoint 1, kmalloc (size=0, flags=-147792512) at slab.c:1557 1557 { (gdb) c Continuing. -------------------------------------------------- Sent: $S05p0000000000000b62#62 Got good: $g#67 Sent: $bc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000415214c0020200006000000068000000680011f768000000ffff0000ffff0000#cc Got good: $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a Got good: $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a Got good: $z0,c0145240,1#f6 Sent: $OK#9a Got good: $Hcb62#75 Sent: $OK#9a Got good: $s#73 Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214c0920200006000000068000000680001f768000000ffff0000ffff0000#db Got good: $Z0,c0145240,1#d6 Sent: $OK#9a Got good: $Hc0#db Sent: $OK#9a Got good: $c#63 6.================================================== Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 2581] 0x0000001c in ?? () (gdb) delete 1 (gdb) c Continuing. Can't send signals to this remote system. SIGSEGV not sent. Program received signal SIGTRAP, Trace/breakpoint trap. 0x0000001c in ?? () (gdb) c Continuing. -------------------------------------------------- Sent: $S0bp0000000000000a15#8c Got good: $g#67 Sent: $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c0000008202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: $z0,c0145240,1#f6 Sent: $OK#9a Got good: $C0b#d5 Sent: $#00 Got good: $c#63 <1>Unable to handle kernel NULL pointer dereference at virtual address 0000001c printing eip: 0000001c *pde = 00000000 Oops: 0000 Sent: $S05p0000000000000a15#5f Got good: $g#67 Sent: $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c0000008202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: $c#63 lp parport nfsd iptable_filter ip_tables nfs lockd sunrpc e1000 e100 keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod scsi_mod CPU: 1 EIP: 0060:[<0000001c>] Not tainted EFLAGS: 00010282 EIP is at Using_Versions [] 0x1b (2.4.20-19.9smp-kgdb) eax: f72f4780 ebx: 000000f0 ecx: 000000f0 edx: f72f4780 esi: 00000001 edi: 0000001c ebp: f885f88c esp: f787bdf0 ds: 0068 es: 0068 ss: 0068 Process syslogd (pid: 2581, stackpage=f787b000) Stack: c36b3638 00000000 f787a000 f79b7a00 f787be1c f885739d f886026f 0000001c 000000f0 00000001 00000000 f787be44 f8857474 00000002 00000000 f79b7a00 00000000 f7873980 00000000 f7873980 f7873980 f787be64 f886ea10 f79b7a00 Call Trace: [<f885739d>] new_handle [jbd] 0x2d (0xf787be04)) [<f886026f>] .rodata.str1.1 [jbd] 0x4f (0xf787be08)) [<f8857474>] journal_start_Rsmp_e160503d [jbd] 0x94 (0xf787be20)) [<f886ea10>] ext3_dirty_inode [ext3] 0x120 (0xf787be48)) [<c016d9b1>] __mark_inode_dirty [kernel] 0xb1 (0xf787be68)) [<c01410c0>] generic_file_write [kernel] 0x290 (0xf787be80)) [<f8869115>] ext3_file_write [ext3] 0x35 (0xf787bef4)) [<c0154c90>] do_readv_writev [kernel] 0x240 (0xf787bf18)) [<f88690e0>] ext3_file_write [ext3] 0x0 (0xf787bf48)) [<c0154dee>] sys_writev [kernel] 0x5e (0xf787bfa0)) [<c0109a3f>] system_call [kernel] 0x33 (0xf787bfc0)) Code: Bad EIP value. -- Luben |
From: Luben T. <lt...@pa...> - 2004-02-05 20:22:09
|
Yesterday I posted some output which I caught at the target machine and interlaced it with gdb output. Now that I've been looking a bit more at it, we have hard core evidence of what is going on! It is case 5 of break and continue: I wrote: [cut] > 5.================================================== > > [New Thread 2914] > [Switching to Thread 2914] > > Breakpoint 1, kmalloc (size=0, flags=-147792512) at slab.c:1557 > 1557 { > (gdb) c > Continuing. > > -------------------------------------------------- > > Sent: $S05p0000000000000b62#62 Here we see thread 0xb62 = 2914 break int the debugger on hitting the break point at kmalloc(). > Got good: $g#67 > Sent: $bc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000415214c0020200006000000068000000680011f768000000ffff0000ffff0000#cc > Got good: $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c0020200006000000068000000680011f768000000ffff0000ffff0000#12 > Sent: $OK#9a > Got good: $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c0020200006000000068000000680011f768000000ffff0000ffff0000#12 > Sent: $OK#9a > Got good: $z0,c0145240,1#f6 > Sent: $OK#9a > Got good: $Hcb62#75 > Sent: $OK#9a > Got good: $s#73 Here we've decremented EIP, recovered the original instruction at kmalloc()'s address and set TF=1 (EFLAGS) (trap flags) and let it all loose. But immediately after that, > Sent: $S05p0000000000000a15#5f thread 0xa15 = 2581 breaks into the debugger!!! And gdb (!) continues as if it were thread 0xb62 !!! > Got good: $g#67 > Sent: $0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214c0920200006000000068000000680001f768000000ffff0000ffff0000#db > Got good: $Z0,c0145240,1#d6 > Sent: $OK#9a > Got good: $Hc0#db > Sent: $OK#9a > Got good: $c#63 And of course we get SIGSEGV on 2581 after that. Ideally we want gdb to recognize that it was a different thread, and do the right thing. This was run with gdbmod (gdb 6.0) from kgdb.sourceforge.net, kgdb-1.7 on 2.4.20-19.9 (RH 9). Do we know if gdb has addressed this problem? > 6.================================================== > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 2581] > 0x0000001c in ?? () > (gdb) delete 1 > (gdb) c > Continuing. > Can't send signals to this remote system. SIGSEGV not sent. > > Program received signal SIGTRAP, Trace/breakpoint trap. > 0x0000001c in ?? () > (gdb) c > Continuing. > > -------------------------------------------------- > > Sent: $S0bp0000000000000a15#8c > Got good: $g#67 > Sent: $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c0000008202010060000000680000006800ffff68000000ffff0000ffff0000#56 > Got good: $z0,c0145240,1#f6 > Sent: $OK#9a > Got good: $C0b#d5 > Sent: $#00 > Got good: $c#63 > <1>Unable to handle kernel NULL pointer dereference at virtual address > 0000001c > printing eip: > 0000001c > *pde = 00000000 > Oops: 0000 > Sent: $S05p0000000000000a15#5f > Got good: $g#67 > Sent: $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c0000008202010060000000680000006800ffff68000000ffff0000ffff0000#56 > Got good: $c#63 > lp parport nfsd iptable_filter ip_tables nfs lockd sunrpc e1000 e100 > keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod > scsi_mod > CPU: 1 > EIP: 0060:[<0000001c>] Not tainted > EFLAGS: 00010282 > > EIP is at Using_Versions [] 0x1b (2.4.20-19.9smp-kgdb) > eax: f72f4780 ebx: 000000f0 ecx: 000000f0 edx: f72f4780 > esi: 00000001 edi: 0000001c ebp: f885f88c esp: f787bdf0 > ds: 0068 es: 0068 ss: 0068 > Process syslogd (pid: 2581, stackpage=f787b000) > Stack: c36b3638 00000000 f787a000 f79b7a00 f787be1c f885739d f886026f > 0000001c > 000000f0 00000001 00000000 f787be44 f8857474 00000002 00000000 > f79b7a00 > 00000000 f7873980 00000000 f7873980 f7873980 f787be64 f886ea10 > f79b7a00 > Call Trace: [<f885739d>] new_handle [jbd] 0x2d (0xf787be04)) > [<f886026f>] .rodata.str1.1 [jbd] 0x4f (0xf787be08)) > [<f8857474>] journal_start_Rsmp_e160503d [jbd] 0x94 (0xf787be20)) > [<f886ea10>] ext3_dirty_inode [ext3] 0x120 (0xf787be48)) > [<c016d9b1>] __mark_inode_dirty [kernel] 0xb1 (0xf787be68)) > [<c01410c0>] generic_file_write [kernel] 0x290 (0xf787be80)) > [<f8869115>] ext3_file_write [ext3] 0x35 (0xf787bef4)) > [<c0154c90>] do_readv_writev [kernel] 0x240 (0xf787bf18)) > [<f88690e0>] ext3_file_write [ext3] 0x0 (0xf787bf48)) > [<c0154dee>] sys_writev [kernel] 0x5e (0xf787bfa0)) > [<c0109a3f>] system_call [kernel] 0x33 (0xf787bfc0)) > > > Code: Bad EIP value. > Thanks, -- Luben |
From: Amit S. K. <ami...@em...> - 2004-02-10 06:46:16
|
On Friday 06 Feb 2004 1:51 am, Luben Tuikov wrote: > Yesterday I posted some output which I caught at the target machine and > interlaced it with gdb output. > > Now that I've been looking a bit more at it, we have hard core evidence > of what is going on! > > It is case 5 of break and continue: > > I wrote: > [cut] > > > 5.================================================== > > > > [New Thread 2914] > > [Switching to Thread 2914] > > > > Breakpoint 1, kmalloc (size=0, flags=-147792512) at slab.c:1557 > > 1557 { > > (gdb) c > > Continuing. > > > > -------------------------------------------------- > > > > Sent: $S05p0000000000000b62#62 > > Here we see thread 0xb62 = 2914 break int the debugger on hitting > the break point at kmalloc(). > > > Got good: $g#67 > > Sent: > > $bc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000415214c0 > >020200006000000068000000680011f768000000ffff0000ffff0000#cc Got good: > > $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c > >0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a > > Got good: > > $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c > >0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a > > Got good: $z0,c0145240,1#f6 > > Sent: $OK#9a > > Got good: $Hcb62#75 > > Sent: $OK#9a > > Got good: $s#73 > > Here we've decremented EIP, recovered the original instruction at > kmalloc()'s address and set TF=1 (EFLAGS) (trap flags) and let it all > loose. > > But immediately after that, > > > Sent: $S05p0000000000000a15#5f > > thread 0xa15 = 2581 breaks into the debugger!!! And gdb (!) continues > as if it were thread 0xb62 !!! > > > Got good: $g#67 > > Sent: > > $0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214c0 > >920200006000000068000000680001f768000000ffff0000ffff0000#db Got good: > > $Z0,c0145240,1#d6 > > Sent: $OK#9a > > Got good: $Hc0#db > > Sent: $OK#9a > > Got good: $c#63 > > And of course we get SIGSEGV on 2581 after that. > > Ideally we want gdb to recognize that it was a different > thread, and do the right thing. For this we want kgdb to use T packet instead of S packet to report a signal. This way gdb knows that a signal occured in a different thread. > > This was run with gdbmod (gdb 6.0) from kgdb.sourceforge.net, > kgdb-1.7 on 2.4.20-19.9 (RH 9). > > Do we know if gdb has addressed this problem? gdb doesn't exactly address this problem. It will report a spurious SIGTRAP to a user. This will avoid a panic, though. -Amit > > > 6.================================================== > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 2581] > > 0x0000001c in ?? () > > (gdb) delete 1 > > (gdb) c > > Continuing. > > Can't send signals to this remote system. SIGSEGV not sent. > > > > Program received signal SIGTRAP, Trace/breakpoint trap. > > 0x0000001c in ?? () > > (gdb) c > > Continuing. > > > > -------------------------------------------------- > > > > Sent: $S0bp0000000000000a15#8c > > Got good: $g#67 > > Sent: > > $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c000000 > >8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: > > $z0,c0145240,1#f6 > > Sent: $OK#9a > > Got good: $C0b#d5 > > Sent: $#00 > > Got good: $c#63 > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > 0000001c > > printing eip: > > 0000001c > > *pde = 00000000 > > Oops: 0000 > > Sent: $S05p0000000000000a15#5f > > Got good: $g#67 > > Sent: > > $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c000000 > >8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: > > $c#63 > > lp parport nfsd iptable_filter ip_tables nfs lockd sunrpc e1000 e100 > > keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod > > scsi_mod > > CPU: 1 > > EIP: 0060:[<0000001c>] Not tainted > > EFLAGS: 00010282 > > > > EIP is at Using_Versions [] 0x1b (2.4.20-19.9smp-kgdb) > > eax: f72f4780 ebx: 000000f0 ecx: 000000f0 edx: f72f4780 > > esi: 00000001 edi: 0000001c ebp: f885f88c esp: f787bdf0 > > ds: 0068 es: 0068 ss: 0068 > > Process syslogd (pid: 2581, stackpage=f787b000) > > Stack: c36b3638 00000000 f787a000 f79b7a00 f787be1c f885739d f886026f > > 0000001c > > 000000f0 00000001 00000000 f787be44 f8857474 00000002 00000000 > > f79b7a00 > > 00000000 f7873980 00000000 f7873980 f7873980 f787be64 f886ea10 > > f79b7a00 > > Call Trace: [<f885739d>] new_handle [jbd] 0x2d (0xf787be04)) > > [<f886026f>] .rodata.str1.1 [jbd] 0x4f (0xf787be08)) > > [<f8857474>] journal_start_Rsmp_e160503d [jbd] 0x94 (0xf787be20)) > > [<f886ea10>] ext3_dirty_inode [ext3] 0x120 (0xf787be48)) > > [<c016d9b1>] __mark_inode_dirty [kernel] 0xb1 (0xf787be68)) > > [<c01410c0>] generic_file_write [kernel] 0x290 (0xf787be80)) > > [<f8869115>] ext3_file_write [ext3] 0x35 (0xf787bef4)) > > [<c0154c90>] do_readv_writev [kernel] 0x240 (0xf787bf18)) > > [<f88690e0>] ext3_file_write [ext3] 0x0 (0xf787bf48)) > > [<c0154dee>] sys_writev [kernel] 0x5e (0xf787bfa0)) > > [<c0109a3f>] system_call [kernel] 0x33 (0xf787bfc0)) > > > > > > Code: Bad EIP value. > > Thanks, |
From: Amit S. K. <ami...@em...> - 2004-02-11 07:26:54
|
On Tuesday 10 Feb 2004 12:15 pm, Amit S. Kale wrote: > On Friday 06 Feb 2004 1:51 am, Luben Tuikov wrote: > > Yesterday I posted some output which I caught at the target machine and > > interlaced it with gdb output. > > > > Now that I've been looking a bit more at it, we have hard core evidence > > of what is going on! > > > > It is case 5 of break and continue: > > > > I wrote: > > [cut] > > > > > 5.================================================== > > > > > > [New Thread 2914] > > > [Switching to Thread 2914] > > > > > > Breakpoint 1, kmalloc (size=0, flags=-147792512) at slab.c:1557 > > > 1557 { > > > (gdb) c > > > Continuing. > > > > > > -------------------------------------------------- > > > > > > Sent: $S05p0000000000000b62#62 > > > > Here we see thread 0xb62 = 2914 break int the debugger on hitting > > the break point at kmalloc(). > > > > > Got good: $g#67 > > > Sent: > > > $bc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000415214 > > >c0 020200006000000068000000680011f768000000ffff0000ffff0000#cc Got good: > > > $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f6060200002000000040521 > > >4c 0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: > > > $OK#9a Got good: > > > $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f6060200002000000040521 > > >4c 0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: > > > $OK#9a Got good: $z0,c0145240,1#f6 > > > Sent: $OK#9a > > > Got good: $Hcb62#75 > > > Sent: $OK#9a > > > Got good: $s#73 > > > > Here we've decremented EIP, recovered the original instruction at > > kmalloc()'s address and set TF=1 (EFLAGS) (trap flags) and let it all > > loose. > > > > But immediately after that, > > > > > Sent: $S05p0000000000000a15#5f > > > > thread 0xa15 = 2581 breaks into the debugger!!! And gdb (!) continues > > as if it were thread 0xb62 !!! > > > > > Got good: $g#67 > > > Sent: > > > $0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214 > > >c0 920200006000000068000000680001f768000000ffff0000ffff0000#db Got good: > > > $Z0,c0145240,1#d6 > > > Sent: $OK#9a > > > Got good: $Hc0#db > > > Sent: $OK#9a > > > Got good: $c#63 > > > > And of course we get SIGSEGV on 2581 after that. > > > > Ideally we want gdb to recognize that it was a different > > thread, and do the right thing. > > For this we want kgdb to use T packet instead of S packet to report a > signal. This way gdb knows that a signal occured in a different thread. OOPS! S packet already reports threads. For example $S05p0000000000000b62#62 Changing to T packet won't solve this problem. -Amit > > > This was run with gdbmod (gdb 6.0) from kgdb.sourceforge.net, > > kgdb-1.7 on 2.4.20-19.9 (RH 9). > > > > Do we know if gdb has addressed this problem? > > gdb doesn't exactly address this problem. It will report a spurious SIGTRAP > to a user. This will avoid a panic, though. > > -Amit > > > > 6.================================================== > > > > > > Program received signal SIGSEGV, Segmentation fault. > > > [Switching to Thread 2581] > > > 0x0000001c in ?? () > > > (gdb) delete 1 > > > (gdb) c > > > Continuing. > > > Can't send signals to this remote system. SIGSEGV not sent. > > > > > > Program received signal SIGTRAP, Trace/breakpoint trap. > > > 0x0000001c in ?? () > > > (gdb) c > > > Continuing. > > > > > > -------------------------------------------------- > > > > > > Sent: $S0bp0000000000000a15#8c > > > Got good: $g#67 > > > Sent: > > > $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c0000 > > >00 8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: > > > $z0,c0145240,1#f6 > > > Sent: $OK#9a > > > Got good: $C0b#d5 > > > Sent: $#00 > > > Got good: $c#63 > > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > > 0000001c > > > printing eip: > > > 0000001c > > > *pde = 00000000 > > > Oops: 0000 > > > Sent: $S05p0000000000000a15#5f > > > Got good: $g#67 > > > Sent: > > > $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c0000 > > >00 8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: > > > $c#63 > > > lp parport nfsd iptable_filter ip_tables nfs lockd sunrpc e1000 e100 > > > keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod > > > scsi_mod > > > CPU: 1 > > > EIP: 0060:[<0000001c>] Not tainted > > > EFLAGS: 00010282 > > > > > > EIP is at Using_Versions [] 0x1b (2.4.20-19.9smp-kgdb) > > > eax: f72f4780 ebx: 000000f0 ecx: 000000f0 edx: f72f4780 > > > esi: 00000001 edi: 0000001c ebp: f885f88c esp: f787bdf0 > > > ds: 0068 es: 0068 ss: 0068 > > > Process syslogd (pid: 2581, stackpage=f787b000) > > > Stack: c36b3638 00000000 f787a000 f79b7a00 f787be1c f885739d f886026f > > > 0000001c > > > 000000f0 00000001 00000000 f787be44 f8857474 00000002 00000000 > > > f79b7a00 > > > 00000000 f7873980 00000000 f7873980 f7873980 f787be64 f886ea10 > > > f79b7a00 > > > Call Trace: [<f885739d>] new_handle [jbd] 0x2d (0xf787be04)) > > > [<f886026f>] .rodata.str1.1 [jbd] 0x4f (0xf787be08)) > > > [<f8857474>] journal_start_Rsmp_e160503d [jbd] 0x94 (0xf787be20)) > > > [<f886ea10>] ext3_dirty_inode [ext3] 0x120 (0xf787be48)) > > > [<c016d9b1>] __mark_inode_dirty [kernel] 0xb1 (0xf787be68)) > > > [<c01410c0>] generic_file_write [kernel] 0x290 (0xf787be80)) > > > [<f8869115>] ext3_file_write [ext3] 0x35 (0xf787bef4)) > > > [<c0154c90>] do_readv_writev [kernel] 0x240 (0xf787bf18)) > > > [<f88690e0>] ext3_file_write [ext3] 0x0 (0xf787bf48)) > > > [<c0154dee>] sys_writev [kernel] 0x5e (0xf787bfa0)) > > > [<c0109a3f>] system_call [kernel] 0x33 (0xf787bfc0)) > > > > > > > > > Code: Bad EIP value. > > > > Thanks, |
From: Amit S. K. <ami...@em...> - 2004-02-11 13:55:17
|
Hi, This problem is because of multiple threads hitting the same breakpoint in succession when gdb treats second breakpoint as a result of it's step operation. I read some gdb code today (Don't try this at home :-) I found that gdb tell a stub when to resume all threads and when not to. It can be seen in the gdb output below. Before s command gdb sets thread to be stepped using Hc. An Hc0 packet indicates that all threads are to be resumed and current thread is to be single stepped. An Hc<thread id> indicates that only current thread is to be single stepped while holding other threads where they are. GDB does following things when a user issues a step command on occurance of a breakpoint. <all breakpoints are removed before presenting a gdb command prompt> 1. Do a single step of current thread where it stopped because of a breakpoint. Other threads are to be held where they are. 2. Reinsert all breakpoints. 3. Do a single step of current thread and resume other threads. 4. Keep repeating step 3 until next line of c code is reached. I belived that gdb can always detect that an exception occured in some other thread. That isn't true. Gdb can detect thread change reported by S packet only in step 3. Since other threads are to be held during step 1, gdb completely ignores change of thread when a stub comes back. So we have to do the following: 1. When single stepping on a single thread, don't let other cpus run. 2. Keep current logic when single stepping with resume for other threads. George's kgdb prevents other cpus from running during a single step operation all the time. It uses several globals. I don't think something that complex is required. How about following simple approach? 1. When single stepping on a single thread, do not release slavecpulocks. set debugger_active to 0 and debugger_step to 1. 2. At begining of kgdb_handle_exception, if debugger_step is 1, reset it to 0 and do not lock slavecpulocks. -Amit On Friday 06 Feb 2004 1:51 am, Luben Tuikov wrote: > Yesterday I posted some output which I caught at the target machine and > interlaced it with gdb output. > > Now that I've been looking a bit more at it, we have hard core evidence > of what is going on! > > It is case 5 of break and continue: > > I wrote: > [cut] > > > 5.================================================== > > > > [New Thread 2914] > > [Switching to Thread 2914] > > > > Breakpoint 1, kmalloc (size=0, flags=-147792512) at slab.c:1557 > > 1557 { > > (gdb) c > > Continuing. > > > > -------------------------------------------------- > > > > Sent: $S05p0000000000000b62#62 > > Here we see thread 0xb62 = 2914 break int the debugger on hitting > the break point at kmalloc(). > > > Got good: $g#67 > > Sent: > > $bc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000415214c0 > >020200006000000068000000680011f768000000ffff0000ffff0000#cc Got good: > > $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c > >0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a > > Got good: > > $Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c > >0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a > > Got good: $z0,c0145240,1#f6 > > Sent: $OK#9a > > Got good: $Hcb62#75 > > Sent: $OK#9a > > Got good: $s#73 > > Here we've decremented EIP, recovered the original instruction at > kmalloc()'s address and set TF=1 (EFLAGS) (trap flags) and let it all > loose. > > But immediately after that, > > > Sent: $S05p0000000000000a15#5f > > thread 0xa15 = 2581 breaks into the debugger!!! And gdb (!) continues > as if it were thread 0xb62 !!! > > > Got good: $g#67 > > Sent: > > $0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214c0 > >920200006000000068000000680001f768000000ffff0000ffff0000#db Got good: > > $Z0,c0145240,1#d6 > > Sent: $OK#9a > > Got good: $Hc0#db > > Sent: $OK#9a > > Got good: $c#63 > > And of course we get SIGSEGV on 2581 after that. > > Ideally we want gdb to recognize that it was a different > thread, and do the right thing. > > This was run with gdbmod (gdb 6.0) from kgdb.sourceforge.net, > kgdb-1.7 on 2.4.20-19.9 (RH 9). > > Do we know if gdb has addressed this problem? > > > 6.================================================== > > > > Program received signal SIGSEGV, Segmentation fault. > > [Switching to Thread 2581] > > 0x0000001c in ?? () > > (gdb) delete 1 > > (gdb) c > > Continuing. > > Can't send signals to this remote system. SIGSEGV not sent. > > > > Program received signal SIGTRAP, Trace/breakpoint trap. > > 0x0000001c in ?? () > > (gdb) c > > Continuing. > > > > -------------------------------------------------- > > > > Sent: $S0bp0000000000000a15#8c > > Got good: $g#67 > > Sent: > > $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c000000 > >8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: > > $z0,c0145240,1#f6 > > Sent: $OK#9a > > Got good: $C0b#d5 > > Sent: $#00 > > Got good: $c#63 > > <1>Unable to handle kernel NULL pointer dereference at virtual address > > 0000001c > > printing eip: > > 0000001c > > *pde = 00000000 > > Oops: 0000 > > Sent: $S05p0000000000000a15#5f > > Got good: $g#67 > > Sent: > > $80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c000000 > >8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: > > $c#63 > > lp parport nfsd iptable_filter ip_tables nfs lockd sunrpc e1000 e100 > > keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod > > scsi_mod > > CPU: 1 > > EIP: 0060:[<0000001c>] Not tainted > > EFLAGS: 00010282 > > > > EIP is at Using_Versions [] 0x1b (2.4.20-19.9smp-kgdb) > > eax: f72f4780 ebx: 000000f0 ecx: 000000f0 edx: f72f4780 > > esi: 00000001 edi: 0000001c ebp: f885f88c esp: f787bdf0 > > ds: 0068 es: 0068 ss: 0068 > > Process syslogd (pid: 2581, stackpage=f787b000) > > Stack: c36b3638 00000000 f787a000 f79b7a00 f787be1c f885739d f886026f > > 0000001c > > 000000f0 00000001 00000000 f787be44 f8857474 00000002 00000000 > > f79b7a00 > > 00000000 f7873980 00000000 f7873980 f7873980 f787be64 f886ea10 > > f79b7a00 > > Call Trace: [<f885739d>] new_handle [jbd] 0x2d (0xf787be04)) > > [<f886026f>] .rodata.str1.1 [jbd] 0x4f (0xf787be08)) > > [<f8857474>] journal_start_Rsmp_e160503d [jbd] 0x94 (0xf787be20)) > > [<f886ea10>] ext3_dirty_inode [ext3] 0x120 (0xf787be48)) > > [<c016d9b1>] __mark_inode_dirty [kernel] 0xb1 (0xf787be68)) > > [<c01410c0>] generic_file_write [kernel] 0x290 (0xf787be80)) > > [<f8869115>] ext3_file_write [ext3] 0x35 (0xf787bef4)) > > [<c0154c90>] do_readv_writev [kernel] 0x240 (0xf787bf18)) > > [<f88690e0>] ext3_file_write [ext3] 0x0 (0xf787bf48)) > > [<c0154dee>] sys_writev [kernel] 0x5e (0xf787bfa0)) > > [<c0109a3f>] system_call [kernel] 0x33 (0xf787bfc0)) > > > > > > Code: Bad EIP value. > > Thanks, |
From: Luben T. <lt...@pa...> - 2004-02-11 16:59:53
|
Amit S. Kale wrote: > Hi, > > This problem is because of multiple threads hitting the same breakpoint in > succession when gdb treats second breakpoint as a result of it's step > operation. > > I read some gdb code today (Don't try this at home :-) > I found that gdb tell a stub when to resume all threads and when not to. It > can be seen in the gdb output below. > > Before s command gdb sets thread to be stepped using Hc. An Hc0 packet > indicates that all threads are to be resumed and current thread is to be > single stepped. An Hc<thread id> indicates that only current thread is to be > single stepped while holding other threads where they are. > > GDB does following things when a user issues a step command on occurance of a > breakpoint. > > <all breakpoints are removed before presenting a gdb command prompt> > 1. Do a single step of current thread where it stopped because of a > breakpoint. Other threads are to be held where they are. > 2. Reinsert all breakpoints. > 3. Do a single step of current thread and resume other threads. > 4. Keep repeating step 3 until next line of c code is reached. > > I belived that gdb can always detect that an exception occured in some other > thread. That isn't true. Gdb can detect thread change reported by S packet > only in step 3. Since other threads are to be held during step 1, gdb > completely ignores change of thread when a stub comes back. > > So we have to do the following: > 1. When single stepping on a single thread, don't let other cpus run. > 2. Keep current logic when single stepping with resume for other threads. > > George's kgdb prevents other cpus from running during a single step operation > all the time. It uses several globals. I don't think something that complex > is required. How about following simple approach? > > 1. When single stepping on a single thread, do not release slavecpulocks. set > debugger_active to 0 and debugger_step to 1. > 2. At begining of kgdb_handle_exception, if debugger_step is 1, reset it to 0 > and do not lock slavecpulocks. Yes, I like this. I'll try to find time to implement this for 1.7 for 2.4.20-19.9. I have to post 1.7 for 2.4.20-19.9. > -Amit > > On Friday 06 Feb 2004 1:51 am, Luben Tuikov wrote: > >>Yesterday I posted some output which I caught at the target machine and >>interlaced it with gdb output. >> >>Now that I've been looking a bit more at it, we have hard core evidence >>of what is going on! >> >>It is case 5 of break and continue: >> >>I wrote: >>[cut] >> >> >>>5.================================================== >>> >>>[New Thread 2914] >>>[Switching to Thread 2914] >>> >>>Breakpoint 1, kmalloc (size=0, flags=-147792512) at slab.c:1557 >>>1557 { >>>(gdb) c >>>Continuing. >>> >>>-------------------------------------------------- >>> >>> Sent: $S05p0000000000000b62#62 >> >>Here we see thread 0xb62 = 2914 break int the debugger on hitting >>the break point at kmalloc(). >> >> >>> Got good: $g#67 >>> Sent: >>>$bc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000415214c0 >>>020200006000000068000000680011f768000000ffff0000ffff0000#cc Got good: >>>$Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c >>>0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a >>> Got good: >>>$Gbc08000000000000c06346c080dd30f7c01de5f6d81de5f60602000020000000405214c >>>0020200006000000068000000680011f768000000ffff0000ffff0000#12 Sent: $OK#9a >>> Got good: $z0,c0145240,1#f6 >>> Sent: $OK#9a >>> Got good: $Hcb62#75 >>> Sent: $OK#9a >>> Got good: $s#73 >> >>Here we've decremented EIP, recovered the original instruction at >>kmalloc()'s address and set TF=1 (EFLAGS) (trap flags) and let it all >>loose. >> >>But immediately after that, >> >> >>> Sent: $S05p0000000000000a15#5f >> >>thread 0xa15 = 2581 breaks into the debugger!!! And gdb (!) continues >>as if it were thread 0xb62 !!! >> >> >>> Got good: $g#67 >>> Sent: >>>$0200000000c4f3f7007a9bf7f0000000e8bd87f700be87f7010000001c000000415214c0 >>>920200006000000068000000680001f768000000ffff0000ffff0000#db Got good: >>>$Z0,c0145240,1#d6 >>> Sent: $OK#9a >>> Got good: $Hc0#db >>> Sent: $OK#9a >>> Got good: $c#63 >> >>And of course we get SIGSEGV on 2581 after that. >> >>Ideally we want gdb to recognize that it was a different >>thread, and do the right thing. >> >>This was run with gdbmod (gdb 6.0) from kgdb.sourceforge.net, >>kgdb-1.7 on 2.4.20-19.9 (RH 9). >> >>Do we know if gdb has addressed this problem? >> >> >>>6.================================================== >>> >>>Program received signal SIGSEGV, Segmentation fault. >>>[Switching to Thread 2581] >>>0x0000001c in ?? () >>>(gdb) delete 1 >>>(gdb) c >>>Continuing. >>>Can't send signals to this remote system. SIGSEGV not sent. >>> >>>Program received signal SIGTRAP, Trace/breakpoint trap. >>>0x0000001c in ?? () >>>(gdb) c >>>Continuing. >>> >>>-------------------------------------------------- >>> >>> Sent: $S0bp0000000000000a15#8c >>> Got good: $g#67 >>> Sent: >>>$80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c000000 >>>8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: >>>$z0,c0145240,1#f6 >>> Sent: $OK#9a >>> Got good: $C0b#d5 >>> Sent: $#00 >>> Got good: $c#63 >>> <1>Unable to handle kernel NULL pointer dereference at virtual address >>>0000001c >>> printing eip: >>>0000001c >>>*pde = 00000000 >>>Oops: 0000 >>>Sent: $S05p0000000000000a15#5f >>> Got good: $g#67 >>> Sent: >>>$80472ff7f000000080472ff7f0000000f0bd87f78cf885f8010000001c0000001c000000 >>>8202010060000000680000006800ffff68000000ffff0000ffff0000#56 Got good: >>>$c#63 >>> lp parport nfsd iptable_filter ip_tables nfs lockd sunrpc e1000 e100 >>>keybdev mousedev hid input usb-uhci usbcore ext3 jbd aic7xxx sd_mod >>>scsi_mod >>>CPU: 1 >>>EIP: 0060:[<0000001c>] Not tainted >>>EFLAGS: 00010282 >>> >>>EIP is at Using_Versions [] 0x1b (2.4.20-19.9smp-kgdb) >>>eax: f72f4780 ebx: 000000f0 ecx: 000000f0 edx: f72f4780 >>>esi: 00000001 edi: 0000001c ebp: f885f88c esp: f787bdf0 >>>ds: 0068 es: 0068 ss: 0068 >>>Process syslogd (pid: 2581, stackpage=f787b000) >>>Stack: c36b3638 00000000 f787a000 f79b7a00 f787be1c f885739d f886026f >>>0000001c >>> 000000f0 00000001 00000000 f787be44 f8857474 00000002 00000000 >>>f79b7a00 >>> 00000000 f7873980 00000000 f7873980 f7873980 f787be64 f886ea10 >>>f79b7a00 >>>Call Trace: [<f885739d>] new_handle [jbd] 0x2d (0xf787be04)) >>>[<f886026f>] .rodata.str1.1 [jbd] 0x4f (0xf787be08)) >>>[<f8857474>] journal_start_Rsmp_e160503d [jbd] 0x94 (0xf787be20)) >>>[<f886ea10>] ext3_dirty_inode [ext3] 0x120 (0xf787be48)) >>>[<c016d9b1>] __mark_inode_dirty [kernel] 0xb1 (0xf787be68)) >>>[<c01410c0>] generic_file_write [kernel] 0x290 (0xf787be80)) >>>[<f8869115>] ext3_file_write [ext3] 0x35 (0xf787bef4)) >>>[<c0154c90>] do_readv_writev [kernel] 0x240 (0xf787bf18)) >>>[<f88690e0>] ext3_file_write [ext3] 0x0 (0xf787bf48)) >>>[<c0154dee>] sys_writev [kernel] 0x5e (0xf787bfa0)) >>>[<c0109a3f>] system_call [kernel] 0x33 (0xf787bfc0)) >>> >>> >>>Code: Bad EIP value. >> >>Thanks, > -- Luben |