Menu

#109 deadlock if 2 xservers start at same time

v1.9.1
closed-works-for-me
nobody
5
2008-05-24
2006-02-02
John Hughes
No

I've got a cluster with multiple desktop machines. If
I boot the whole cluster, letting all the machines try
to start X at about the same time then the X servers hang.

If I boot the machines one at a time then it works.

While X hangs other processes on the affected nodes
seem to work ok.

The last messages in the Xorg log say it is trying to
"configure int10"; it seems to be talking to the video
BIOS.

strace shows nothing.

ps shows:

$ ps -lyp 264872,330299,658390
S UID PID PPID C PRI NI RSS SZ
WCHAN TTY TIME CMD
S 0 264872 264788 0 66 -10 5376 2890
ssidev tty7 00:00:00 Xorg
S 0 330299 330294 0 66 -10 5396 2891
ssidev tty7 00:00:00 Xorg
S 0 658390 658308 0 66 -10 5392 2890
ssidev tty7 00:00:00 Xorg

What should I do to try and debug this?

Discussion

  • Roger Tsang

    Roger Tsang - 2006-02-03

    Logged In: YES
    user_id=1246761

    `lsof` these processes then go into kdb to get a backtrace
    of these processes. Turn on logging for kernel debug messages.

     
  • John Hughes

    John Hughes - 2006-02-03

    Logged In: YES
    user_id=166336

    Didn't get a lsof; sorry.

    The backtrace looks like:

    [0]kdb> btp 264846
    Stack traceback for pid 264846
    0xf7dd2130 264846 264803 0 1 S 0xf7dd2310 Xorg
    EBP EIP Function (args)
    0xf6cdfc94 0xc04666f6 schedule+0x386 (0xf6c0e8c0,
    0xf6da5914, 0x10000000, 0x0, 0xf6da590c)
    0xf6cdfccc 0xc027d49a tok_wait+0xba (0xf7c20d7c, 0xf6da5900,
    0x0, 0x2, 0x4)
    0xf6cdfd4c 0xc029607e cfstok_req+0x20e (0xf7c20e24, 0x1,
    0x4, 0x3, 0x0)
    0xf6cdfd88 0xc02862f4 cfs_shared_nopage+0x84 (0xf7dff660,
    0x0, 0xf6cdfdcc, 0xc0106290, 0xf6cdf000)
    0xf6cdfddc 0xc015ca6d do_no_page+0xbd (0xf7395c80,
    0xf7dff660, 0x449, 0x1, 0xfffc5000)
    0xf6cdfe10 0xc015cf7b handle_mm_fault+0x1bb (0xf7395c80,
    0xf7dff660, 0x449, 0x1, 0x1)
    0xf6cdfee8 0xc011f31a do_page_fault+0x23a (0x2, 0x0, 0xa004,
    0x5000, 0x0)
    0xc0106383 error_code+0x2b
    Interrupt registers:
    eax = 0x00005003 ebx = 0x00000002 ecx = 0x00000000 edx =
    0x0000a004
    esi = 0x00005000 edi = 0x00000000 esp = 0x00000fd4 eip =
    0x000014fe
    ebp = 0x00000fdc xss = 0x00000100 xcs = 0x0000c000 eflags =
    0x00033213
    xds = 0x00000000 xes = 0x00000000 origeax = 0xffffffff &regs
    = 0xf6cdfef0
    Interrupt from user space, end of kernel trace
    [0]kdb>

    and

    [0]kdb> btp 330200
    Stack traceback for pid 330200
    0xf6e2a0b0 330200 330148 0 1 S 0xf6e2a290 Xorg
    EBP EIP Function (args)
    0xf73ccd68 0xc04666f6 schedule+0x386 (0xf6c1d580,
    0xf790ddd4, 0x10000000, 0x0, 0xf790ddcc)
    0xf73ccda0 0xc027d49a tok_wait+0xba (0xf78d537c, 0xf790ddc0,
    0x0, 0x2, 0x5)
    0xf73cce20 0xc029607e cfstok_req+0x20e (0xf78d5424, 0x1,
    0x4, 0x3, 0x0)
    0xf73cce5c 0xc02862f4 cfs_shared_nopage+0x84 (0xf6e7e804,
    0x0, 0xf73ccea0, 0xf6e33080, 0x1)
    0xf73cceb0 0xc015ca6d do_no_page+0xbd (0xf6f04380,
    0xf6e7e804, 0x42, 0x0, 0xfffc5000)
    0xf73ccee4 0xc015cf7b handle_mm_fault+0x1bb (0xf6f04380,
    0xf6e7e804, 0x42, 0x0,
    0x0)
    0xf73ccfbc 0xc011f31a do_page_fault+0x23a (0xb7aaccf8,
    0x8241710, 0xb7aaced4, 0x10, 0x40)
    0xc0106383 error_code+0x2b
    Interrupt registers:
    eax = 0x00000042 ebx = 0xb7aaccf8 ecx = 0x08241710 edx =
    0xb7aaced4
    esi = 0x00000010 edi = 0x00000040 esp = 0xbfffedcc eip =
    0xb7aa9277
    ebp = 0xbfffede8 xss = 0x0000007b xcs = 0x00000073 eflags =
    0x00013202
    xds = 0xb7aa007b xes = 0x0000007b origeax = 0xffffffff &regs
    = 0xf73ccfc4
    Interrupt from user space, end of kernel trace

     
  • Roger Tsang

    Roger Tsang - 2006-02-04

    Logged In: YES
    user_id=1246761

    It seems these two processes are waiting for another process
    to release the token.

    Do call print_inode 0xf7c20e24, the first arg passsed to
    cfstok_req. Then call print_dentry and print_file to find
    out what file is pointing to. It may give you a hint what
    process to look for.

     
  • Roger Tsang

    Roger Tsang - 2007-04-11

    Logged In: YES
    user_id=1246761
    Originator: NO

    The original bug report is for SSI-1.9.1. Are you still having a problem in SSI-1.9.2 or later?

     
  • John Hughes

    John Hughes - 2007-04-11

    Logged In: YES
    user_id=166336
    Originator: YES

    Don't know - will try to reproduce bug this AM.

     
  • Roger Tsang

    Roger Tsang - 2007-10-12
    • status: open --> open-out-of-date
     
  • Roger Tsang

    Roger Tsang - 2008-01-02
    • status: open-out-of-date --> open-accepted
     
  • Roger Tsang

    Roger Tsang - 2008-01-02

    Logged In: YES
    user_id=1246761
    Originator: NO

    This is caused by rc.sysinit.nodeup /tmp cleanup is not yet cluster-aware and removes X11 sockets and files including those of other nodes because /tmp is global (by default).

    One working solution I have been testing since OPENSSI-FC-2-0-0-PRE1 is to convert /tmp to a local CDSL and have rc.sysinit.nodeup to not do /tmp cleanup if /tmp is not a CDSL. Clustered applications that expect a global shared /tmp are reconfigured to use some other directory like /cluster/tmp on the shared filesystem.

     
  • Roger Tsang

    Roger Tsang - 2008-04-20
    • status: open-accepted --> open-works-for-me
     
  • Roger Tsang

    Roger Tsang - 2008-05-24
    • status: open-works-for-me --> closed-works-for-me
     

Log in to post a comment.