I've got a cluster with multiple desktop machines. If
I boot the whole cluster, letting all the machines try
to start X at about the same time then the X servers hang.
If I boot the machines one at a time then it works.
While X hangs other processes on the affected nodes
seem to work ok.
The last messages in the Xorg log say it is trying to
"configure int10"; it seems to be talking to the video
BIOS.
strace shows nothing.
ps shows:
$ ps -lyp 264872,330299,658390
S UID PID PPID C PRI NI RSS SZ
WCHAN TTY TIME CMD
S 0 264872 264788 0 66 -10 5376 2890
ssidev tty7 00:00:00 Xorg
S 0 330299 330294 0 66 -10 5396 2891
ssidev tty7 00:00:00 Xorg
S 0 658390 658308 0 66 -10 5392 2890
ssidev tty7 00:00:00 Xorg
What should I do to try and debug this?
Logged In: YES
user_id=1246761
`lsof` these processes then go into kdb to get a backtrace
of these processes. Turn on logging for kernel debug messages.
Logged In: YES
user_id=166336
Didn't get a lsof; sorry.
The backtrace looks like:
[0]kdb> btp 264846
Stack traceback for pid 264846
0xf7dd2130 264846 264803 0 1 S 0xf7dd2310 Xorg
EBP EIP Function (args)
0xf6cdfc94 0xc04666f6 schedule+0x386 (0xf6c0e8c0,
0xf6da5914, 0x10000000, 0x0, 0xf6da590c)
0xf6cdfccc 0xc027d49a tok_wait+0xba (0xf7c20d7c, 0xf6da5900,
0x0, 0x2, 0x4)
0xf6cdfd4c 0xc029607e cfstok_req+0x20e (0xf7c20e24, 0x1,
0x4, 0x3, 0x0)
0xf6cdfd88 0xc02862f4 cfs_shared_nopage+0x84 (0xf7dff660,
0x0, 0xf6cdfdcc, 0xc0106290, 0xf6cdf000)
0xf6cdfddc 0xc015ca6d do_no_page+0xbd (0xf7395c80,
0xf7dff660, 0x449, 0x1, 0xfffc5000)
0xf6cdfe10 0xc015cf7b handle_mm_fault+0x1bb (0xf7395c80,
0xf7dff660, 0x449, 0x1, 0x1)
0xf6cdfee8 0xc011f31a do_page_fault+0x23a (0x2, 0x0, 0xa004,
0x5000, 0x0)
0xc0106383 error_code+0x2b
Interrupt registers:
eax = 0x00005003 ebx = 0x00000002 ecx = 0x00000000 edx =
0x0000a004
esi = 0x00005000 edi = 0x00000000 esp = 0x00000fd4 eip =
0x000014fe
ebp = 0x00000fdc xss = 0x00000100 xcs = 0x0000c000 eflags =
0x00033213
xds = 0x00000000 xes = 0x00000000 origeax = 0xffffffff ®s
= 0xf6cdfef0
Interrupt from user space, end of kernel trace
[0]kdb>
and
[0]kdb> btp 330200
Stack traceback for pid 330200
0xf6e2a0b0 330200 330148 0 1 S 0xf6e2a290 Xorg
EBP EIP Function (args)
0xf73ccd68 0xc04666f6 schedule+0x386 (0xf6c1d580,
0xf790ddd4, 0x10000000, 0x0, 0xf790ddcc)
0xf73ccda0 0xc027d49a tok_wait+0xba (0xf78d537c, 0xf790ddc0,
0x0, 0x2, 0x5)
0xf73cce20 0xc029607e cfstok_req+0x20e (0xf78d5424, 0x1,
0x4, 0x3, 0x0)
0xf73cce5c 0xc02862f4 cfs_shared_nopage+0x84 (0xf6e7e804,
0x0, 0xf73ccea0, 0xf6e33080, 0x1)
0xf73cceb0 0xc015ca6d do_no_page+0xbd (0xf6f04380,
0xf6e7e804, 0x42, 0x0, 0xfffc5000)
0xf73ccee4 0xc015cf7b handle_mm_fault+0x1bb (0xf6f04380,
0xf6e7e804, 0x42, 0x0,
0x0)
0xf73ccfbc 0xc011f31a do_page_fault+0x23a (0xb7aaccf8,
0x8241710, 0xb7aaced4, 0x10, 0x40)
0xc0106383 error_code+0x2b
Interrupt registers:
eax = 0x00000042 ebx = 0xb7aaccf8 ecx = 0x08241710 edx =
0xb7aaced4
esi = 0x00000010 edi = 0x00000040 esp = 0xbfffedcc eip =
0xb7aa9277
ebp = 0xbfffede8 xss = 0x0000007b xcs = 0x00000073 eflags =
0x00013202
xds = 0xb7aa007b xes = 0x0000007b origeax = 0xffffffff ®s
= 0xf73ccfc4
Interrupt from user space, end of kernel trace
Logged In: YES
user_id=1246761
It seems these two processes are waiting for another process
to release the token.
Do call print_inode 0xf7c20e24, the first arg passsed to
cfstok_req. Then call print_dentry and print_file to find
out what file is pointing to. It may give you a hint what
process to look for.
Logged In: YES
user_id=1246761
Originator: NO
The original bug report is for SSI-1.9.1. Are you still having a problem in SSI-1.9.2 or later?
Logged In: YES
user_id=166336
Originator: YES
Don't know - will try to reproduce bug this AM.
Logged In: YES
user_id=1246761
Originator: NO
This is caused by rc.sysinit.nodeup /tmp cleanup is not yet cluster-aware and removes X11 sockets and files including those of other nodes because /tmp is global (by default).
One working solution I have been testing since OPENSSI-FC-2-0-0-PRE1 is to convert /tmp to a local CDSL and have rc.sysinit.nodeup to not do /tmp cleanup if /tmp is not a CDSL. Clustered applications that expect a global shared /tmp are reconfigured to use some other directory like /cluster/tmp on the shared filesystem.