#167 OpenSSI fails the glibc tst-basic3 test

v1.9.1
closed-fixed
Roger Tsang
5
2008-10-19
2008-07-04
John Hughes
No

The test starts some threads; waits for them to exit then sends a signal. It complains that the signal is not received.

$ cc -pthread tst-basic3.c
$ ./a.out
starting 20 + 1 threads
20 left
19 left
18 left
17 left
16 left
15 left
14 left
13 left
12 left
11 left
10 left
9 left
8 left
7 left
6 left
5 left
4 left
3 left
2 left
1 left
0 left
final_test has been called
Expected signal 'User defined signal 1' from child, got none

Discussion

  • John Hughes
    John Hughes
    2008-07-04

     
    Attachments
  • John Hughes
    John Hughes
    2008-07-04

    Full trace of what happens.

     
    Attachments
  • John Hughes
    John Hughes
    2008-07-04

    Logged In: YES
    user_id=166336
    Originator: YES

    Here's what goes wrong:

    69772 execve("./tst-basic3", ["./tst-basic3"], [/* 15 vars */]) = 0
    [...]
    69793 write(1, " 0 left\n", 8) = 8
    69793 write(1, "final_test has been called", 26) = 26
    69793 write(1, "\n", 1) = 1
    69793 kill(69773, SIGUSR1) = -1 ESRCH (No such process)
    69793 exit_group(0) = ?
    69772 <... waitpid resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0) = 69773
    69772 --- SIGCHLD (Child exited) @ 0 (0) ---
    69772 futex(0xb7f409c0, FUTEX_WAKE, 2147483647) = 0
    69772 write(2, "Expected signal \'User defined si"..., 61) = 61
    69772 exit_group(1) = ?

    I.e. the signal is being sent to the wrong process, 69773 instead of 69772. Why?

    The code is "kill (getpid(), SIGUSR1)"

    Wierd, it looks like the group leader process is exiting, here's another trace:

    :q1

    File Added: zz-strace

     
  • John Hughes
    John Hughes
    2008-07-04

    • labels: --> Process Management
     
  • John Hughes
    John Hughes
    2008-07-04

    Logged In: YES
    user_id=166336
    Originator: YES

    Excerpts from trace:

    69921 execve("./a.out", ["./a.out"], [/* 15 vars */]) = 0
    [...]
    69921 write(2, "Start main pid 69921\n", 21) = 21
    69921 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0xb7db8708) = 69922
    [ that "clone" is a fork ]
    69921 waitpid(69922, <unfinished ...>
    [...]
    69922 write(2, "test running as pid 69922\n", 26) = 26
    69922 clone(child_stack=0xb7db74c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0xb7db7bf8, {entry_number:6, base_addr:0xb7db7bb0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7db7bf8) = 69923
    [ that "clone" is a pthread_create ]
    [...]
    69922 munmap(0xb75b1000, 27800) = 0
    69922 futex(0xb75b0be4, FUTEX_WAKE, 2147483647) = 0
    69922 _exit(0)
    [ wierd, why is 69922 exiting, I'd expect it to be 69223 ]
    69923 write(1, " 0 left in pid 69922\n", 21) = 21
    69923 write(2, "final_test has been called from "..., 42) = 42
    69923 kill(69922, SIGUSR1) = -1 ESRCH (No such process)
    69923 exit_group(0)

    It looks like the thread is running in the wrong clone.

     
  • John Hughes
    John Hughes
    2008-07-04

    Logged In: YES
    user_id=166336
    Originator: YES

    Here's a trace of the same process on a stock 2.6.11 kernel. The odd behaviour of the processes ("parent" thread exits before child) is the same.

    What's different on the non-OpenSSI kernel is the signal delivery - that works.

    2581 clone(child_stack=0xb7e9b4c4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID|CLONE_DETACHED, parent_tidptr=0xb7e9bbf8, {entry_number:6, base_addr:0xb7e9bbb0, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7e9bbf8) = 2582
    2581 write(1, " 1 left in pid 2581\n", 20) = 20
    [...]
    2581 _exit(0) = ?
    2582 write(1, " 0 left in pid 2581\n", 20) = 20
    2582 write(2, "final_test has been called from "..., 41) = 41
    2582 kill(2581, SIGUSR1) = 0
    2582 --- SIGUSR1 (User defined signal 1) @ 0 (0) ---

    File Added: zz-2.6.11

     
  • John Hughes
    John Hughes
    2008-07-04

    trace of same test process on stock non-SSI 2.6.11 kernel

     
    Attachments
  • Roger Tsang
    Roger Tsang
    2008-09-14

    Looks like waitpid() / kill() missed the threads.

     
  • Roger Tsang
    Roger Tsang
    2008-09-14

    Testing a fix which also includes feature enhancement to wait for children of other threads in group.

     
  • Roger Tsang
    Roger Tsang
    2008-09-14

    • milestone: --> v1.9.1
    • assigned_to: nobody --> rogertsang
    • status: open --> open-accepted
     
  • Roger Tsang
    Roger Tsang
    2008-10-02

    • status: open-accepted --> open-fixed
     
  • John Hughes
    John Hughes
    2008-10-19

    Works OK for me in current CVS, thanks Roger.

     
  • Roger Tsang
    Roger Tsang
    2008-10-19

    • status: open-fixed --> closed-fixed