#28 PCD become inmortal process

open
nobody
None
5
2012-03-01
2012-03-01
No

After running for some time PCD stops to work properly. PCD becomes inmortal, sending kill signal has no effect. In this situation if I kill some managed process it isn't restarted by pcd. The ps output show strange values for the time field. This is the capture of the ps output
~ # ps -eo 'pid,ppid,user,group,stat,tty,comm,time'
PID PPID USER GROUP STAT TT COMMAND TIME
172 1 root 0 R ? pcd 27818:
754 172 root 0 Z ? framerd 29:34

framerd is the managed process I kill manually by sending a TERM signal. It becomes zombie.
I'm using pcd 1.1.4

Discussion

  • Daniel Fraile

    Daniel Fraile - 2012-04-16

    I'm very interested in using pcd in my project, but I can't use it with this bug. Could someone point me on the right direction to try fix it?

     
  • Hai Shalom

    Hai Shalom - 2012-04-17

    Looks like pcd is running in an endless loop (the R may indicate that).
    Does this happen all of the time? Does this happen with other processes pcd manages?
    on which platform are you running?

    I will try to reproduce it locally. If you want to take a look yourself, open process.c, and look for the SIGCHILD handler. When a process terminates, PCD should pick up this signal.

     
  • Daniel Fraile

    Daniel Fraile - 2012-04-17

    I'm running pcd on a microblaze platform (software processor running on a fpga device). I've found the same behaviour, pcd proccess running but not responding, without a zombie supervised process. What I mean is that supervised process are running without problem, but pcd is blocked as you can see with ps
    PID PPID USER GROUP STAT TT COMMAND TIME
    179 1 root 0 R ? pcd 27817:
    It happens always sometimes few minuts after booting the kernel, and sometimes some hours after. So I'm not sure that the loop in sigchild handler is the origin of the problem. Anyway I'll check it.

     
  • Daniel Fraile

    Daniel Fraile - 2012-04-18

    Using strace I've found that pcd is always blocked on the select inside IPC_wait_msg function (line 589 of ipc.c).
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(5, [], NULL, NULL, {0, 0}) = 0 (Timeout)
    read(4, "", 224) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(5, [], NULL, NULL, {0, 0}) = 0 (Timeout)
    read(4, "", 224) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}) = 0 (Timeout)
    nanosleep({0, 200000000}, NULL) = 0
    select(4, [3], NULL, NULL, {0, 0}

    I'm not sure if it is a sign or the origin of the problem, because select is called in a non-blocking way.

     
  • Hai Shalom

    Hai Shalom - 2012-05-21

    I cannot reproduce this here. The select messages you see are not blocking, so it appears to be alive. Please let me know if you have more information.

     
  • Daniel Fraile

    Daniel Fraile - 2012-07-05

    Upgrading kernel from 2.6.37.4 to 2.6.37.6 fixes this issue.

     

Log in to post a comment.

Get latest updates about Open Source Projects, Conferences and News.

Sign up for the SourceForge newsletter:

JavaScript is required for this form.





No, thanks