Thread: [SSI-devel] sys_tgkill() and pproc_psignal() problem
Brought to you by:
brucewalker,
rogertsang
From: Vladimir R. <one...@gm...> - 2006-05-17 21:45:31
|
Hi, On a node startup I've got Oops with the following call stack: -------------------------------------------------------------------------- Stack traceback for pid 72935 0xf6a319b0 72935 1 1 0 R 0xf6a31b70 *multipathd EBP EIP Function (args) 0xf5851e48 0xc0133a4d pproc_psignal+0x8d (0xf5f8c3b0, 0x1, 0xf5851ef8, 0x100000, 0x0) 0xf5851ebc 0xc0224fd5 pvpop_sigproc+0x315 (0xf63a2e00, 0x1, 0x0, 0xf5851ef8, 0x100000) 0xf5851edc 0xc022c882 vpop_sigproc+0x32 (0xf63a2e00, 0x1, 0xf5851ef8, 0x100000, 0xfffffffd) 0xf5851fb4 0xc01354af sys_tgkill+0xaf 0xc0106ae1 syscall_call+0x7 --------------------------------------------- tracing it back found the following code fragment in pproc_psignal() function (kernel/signal.c): if (flags & VSIG_THREAD) { int tgid =3D pinfo->pi_pid; if (tgid && (tgid !=3D ts->tgid))=09=09 (*) pinfo =3D 0; pinfo->pi_pid =3D ts->tgid; (**) } where line (**) is the exact place of the Oops, and pinfo at the moment is = NULL. That code definetely has a bug, however I couldn't explain why that oops happens so rare for me - sometimes it take up to 20 hrs to run a test to reproduce it... Digging dipper, I noted that in sys_tgkill() function (kernel/signal.c) siginfo struct is initialized differently for its openssi and non-openssi versions: openssi: pinfo.pi_pid =3D tgid; regular:=09pinfo.pi_pid =3D current->tgid; In other words, openssi version passed not the calling process information (see the parameter description for vpop_sigproc() and pvpop_sigproc() funct= ions) but the called process info. Are there any reasons to do that? That explains why the condition at line (*) above is false, and pinfo is not set to zero. Always, right? Wrong. Because it's possible for a task_struct pointed by ts to get changed in the middle of the call chain Am I right here? Is it why I had the oops? Thanks Vladimir |
From: Laura R. <lau...@hp...> - 2006-05-17 23:46:01
|
Hi Vladimir, You are correct that there is a bug in the code. The code should probably read something like: if (flags & VSIG_THREAD) { int tgid = pinfo->pi_pid; if (tgid && (tgid != ts->tgid)) return -ESRCH; pinfo->pi_pid = ts->tgid; } I dont have the sandbox up to date, so cant really test this, so my assessment is only by code inspection. The sys_tgkill() is suppose to send the signal sig to a process pid in the thread group tgid. My theory for why you hit the oops is that the process pid no longer belonged to thread group tgid, so pinfo was set to 0 so that ESRCH can be return which are the base semantics. However, because of the bug, it tried to dereference pinfo which had been set to NULL. The reason why pi_pid is set to tgid is because the base code does the test "p->tgid == tgid", pvproc needs to do the same test, so we store it there so we can get to it later. I dont know what your tests are doing, but the check is suppose to solve the problem of threads exiting and pids getting reused. So maybe that's why it take hours to hit the panic since the pid has to get reused, or maybe there's some race condition with exiting thread processes getting signals. laura > > tracing it back found the following code fragment in pproc_psignal() > function (kernel/signal.c): > > if (flags & VSIG_THREAD) { > int tgid = pinfo->pi_pid; > if (tgid && (tgid != ts->tgid)) (*) > pinfo = 0; > pinfo->pi_pid = ts->tgid; (**) > } > where line (**) is the exact place of the Oops, and pinfo at the moment > is NULL. > > That code definetely has a bug, > however I couldn't explain why that oops happens so rare for me - > sometimes it take up to 20 hrs to run a test to reproduce it... > > Digging dipper, I noted that in sys_tgkill() function > (kernel/signal.c) siginfo struct is initialized differently for its > openssi and non-openssi versions: > > openssi: pinfo.pi_pid = tgid; > regular: pinfo.pi_pid = current->tgid; > > In other words, openssi version passed not the calling process information > (see the parameter description for vpop_sigproc() and pvpop_sigproc() > functions) > but the called process info. Are there any reasons to do that? > > That explains why the condition at line (*) above is false, and pinfo > is not set to zero. > Always, right? Wrong. Because it's possible for a task_struct pointed > by ts to get changed in the middle of the call chain > Am I right here? Is it why I had the oops? > > Thanks > Vladimir > > > ------------------------------------------------------- > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=k&kid0709&bid&3057&dat1642 > _______________________________________________ > ssic-linux-devel mailing list > ssi...@li... > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > |
From: Vladimir R. <one...@gm...> - 2006-05-18 20:45:33
|
Laura, Thank you for your reply. My original problem might be reproduced by the following small user-space program: syscall(SYS_tgkill, getpid(), 1, 1); Instead of returning -1 it opps in the kernel. I don't understand why it took so long to be reproduced by the real apps, but it doesn't matter any more. The change you've recommended fixes the problem. Thanks again, Vladimir On 5/17/06, Laura Ramirez <lau...@hp...> wrote: > > Hi Vladimir, > > You are correct that there is a bug in the code. > The code should probably read something like: > if (flags & VSIG_THREAD) { > int tgid =3D pinfo->pi_pid; > if (tgid && (tgid !=3D ts->tgid)) > return -ESRCH; > pinfo->pi_pid =3D ts->tgid; > } > I dont have the sandbox up to date, so cant really test > this, so my assessment is only by code inspection. > The sys_tgkill() is suppose to send the signal sig to a > process pid in the thread group tgid. My theory for why > you hit the oops is that the process pid no longer belonged > to thread group tgid, so pinfo was set to 0 so that ESRCH > can be return which are the base semantics. However, because > of the bug, it tried to dereference pinfo which had been set > to NULL. The reason why pi_pid is set to tgid is because the > base code does the test "p->tgid =3D=3D tgid", pvproc needs to > do the same test, so we store it there so we can get to it later. > I dont know what your tests are doing, but the check is suppose > to solve the problem of threads exiting and pids getting reused. > So maybe that's why it take hours to hit the panic since the pid > has to get reused, or maybe there's some race condition with > exiting thread processes getting signals. > > laura > > > > > > tracing it back found the following code fragment in pproc_psignal() > > function (kernel/signal.c): > > > > if (flags & VSIG_THREAD) { > > int tgid =3D pinfo->pi_pid; > > if (tgid && (tgid !=3D ts->tgid)) (*) > > pinfo =3D 0; > > pinfo->pi_pid =3D ts->tgid; (**) > > } > > where line (**) is the exact place of the Oops, and pinfo at the moment > > is NULL. > > > > That code definetely has a bug, > > however I couldn't explain why that oops happens so rare for me - > > sometimes it take up to 20 hrs to run a test to reproduce it... > > > > Digging dipper, I noted that in sys_tgkill() function > > (kernel/signal.c) siginfo struct is initialized differently for its > > openssi and non-openssi versions: > > > > openssi: pinfo.pi_pid =3D tgid; > > regular: pinfo.pi_pid =3D current->tgid; > > > > In other words, openssi version passed not the calling process informat= ion > > (see the parameter description for vpop_sigproc() and pvpop_sigproc() > > functions) > > but the called process info. Are there any reasons to do that? > > > > That explains why the condition at line (*) above is false, and pinfo > > is not set to zero. > > Always, right? Wrong. Because it's possible for a task_struct pointed > > by ts to get changed in the middle of the call chain > > Am I right here? Is it why I had the oops? > > > > Thanks > > Vladimir > > > > > > ------------------------------------------------------- > > Using Tomcat but need to do more? Need to support web services, securit= y? > > Get stuff done quickly with pre-integrated technology to make your job > > easier > > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geron= imo > > http://sel.as-us.falkag.net/sel?cmd=3Dk&kid=120709&bid&3057&dat=121642 > > _______________________________________________ > > ssic-linux-devel mailing list > > ssi...@li... > > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel > > > > > |
From: Laura R. <lau...@hp...> - 2006-05-18 23:53:54
|
Hi Vladimir, Thanks for testing the fix. I have just checked in the change into the repository. laura Vladimir Razgulin wrote: > Laura, > Thank you for your reply. > My original problem might be reproduced by the following small > user-space program: > > syscall(SYS_tgkill, getpid(), 1, 1); > > Instead of returning -1 it opps in the kernel. > I don't understand why it took so long to be reproduced by the real > apps, but it doesn't matter any more. > The change you've recommended fixes the problem. > Thanks again, > Vladimir > > > On 5/17/06, Laura Ramirez <lau...@hp...> wrote: >> >> Hi Vladimir, >> >> You are correct that there is a bug in the code. >> The code should probably read something like: >> if (flags & VSIG_THREAD) { >> int tgid = pinfo->pi_pid; >> if (tgid && (tgid != ts->tgid)) >> return -ESRCH; >> pinfo->pi_pid = ts->tgid; >> } >> I dont have the sandbox up to date, so cant really test >> this, so my assessment is only by code inspection. >> The sys_tgkill() is suppose to send the signal sig to a >> process pid in the thread group tgid. My theory for why >> you hit the oops is that the process pid no longer belonged >> to thread group tgid, so pinfo was set to 0 so that ESRCH >> can be return which are the base semantics. However, because >> of the bug, it tried to dereference pinfo which had been set >> to NULL. The reason why pi_pid is set to tgid is because the >> base code does the test "p->tgid == tgid", pvproc needs to >> do the same test, so we store it there so we can get to it later. >> I dont know what your tests are doing, but the check is suppose >> to solve the problem of threads exiting and pids getting reused. >> So maybe that's why it take hours to hit the panic since the pid >> has to get reused, or maybe there's some race condition with >> exiting thread processes getting signals. >> >> laura >> >> >> > >> > tracing it back found the following code fragment in pproc_psignal() >> > function (kernel/signal.c): >> > >> > if (flags & VSIG_THREAD) { >> > int tgid = pinfo->pi_pid; >> > if (tgid && (tgid != ts->tgid)) (*) >> > pinfo = 0; >> > pinfo->pi_pid = ts->tgid; (**) >> > } >> > where line (**) is the exact place of the Oops, and pinfo at the moment >> > is NULL. >> > >> > That code definetely has a bug, >> > however I couldn't explain why that oops happens so rare for me - >> > sometimes it take up to 20 hrs to run a test to reproduce it... >> > >> > Digging dipper, I noted that in sys_tgkill() function >> > (kernel/signal.c) siginfo struct is initialized differently for its >> > openssi and non-openssi versions: >> > >> > openssi: pinfo.pi_pid = tgid; >> > regular: pinfo.pi_pid = current->tgid; >> > >> > In other words, openssi version passed not the calling process >> information >> > (see the parameter description for vpop_sigproc() and pvpop_sigproc() >> > functions) >> > but the called process info. Are there any reasons to do that? >> > >> > That explains why the condition at line (*) above is false, and pinfo >> > is not set to zero. >> > Always, right? Wrong. Because it's possible for a task_struct pointed >> > by ts to get changed in the middle of the call chain >> > Am I right here? Is it why I had the oops? >> > >> > Thanks >> > Vladimir >> > >> > >> > ------------------------------------------------------- >> > Using Tomcat but need to do more? Need to support web services, >> security? >> > Get stuff done quickly with pre-integrated technology to make your job >> > easier >> > Download IBM WebSphere Application Server v.1.0.1 based on Apache >> Geronimo >> > http://sel.as-us.falkag.net/sel?cmd=k&kid0709&bid&3057&dat1642 >> > _______________________________________________ >> > ssic-linux-devel mailing list >> > ssi...@li... >> > https://lists.sourceforge.net/lists/listinfo/ssic-linux-devel >> > >> > >> > > |