OpenSSI Clusters for Linux / Bugs / #109 deadlock if 2 xservers start at same time

Roger Tsang - 2006-02-03

Logged In: YES
user_id=1246761

`lsof` these processes then go into kdb to get a backtrace
of these processes. Turn on logging for kernel debug messages.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2006-02-03

Logged In: YES
user_id=166336

Didn't get a lsof; sorry.

The backtrace looks like:

[0]kdb> btp 264846
Stack traceback for pid 264846
0xf7dd2130 264846 264803 0 1 S 0xf7dd2310 Xorg
EBP EIP Function (args)
0xf6cdfc94 0xc04666f6 schedule+0x386 (0xf6c0e8c0,
0xf6da5914, 0x10000000, 0x0, 0xf6da590c)
0xf6cdfccc 0xc027d49a tok_wait+0xba (0xf7c20d7c, 0xf6da5900,
0x0, 0x2, 0x4)
0xf6cdfd4c 0xc029607e cfstok_req+0x20e (0xf7c20e24, 0x1,
0x4, 0x3, 0x0)
0xf6cdfd88 0xc02862f4 cfs_shared_nopage+0x84 (0xf7dff660,
0x0, 0xf6cdfdcc, 0xc0106290, 0xf6cdf000)
0xf6cdfddc 0xc015ca6d do_no_page+0xbd (0xf7395c80,
0xf7dff660, 0x449, 0x1, 0xfffc5000)
0xf6cdfe10 0xc015cf7b handle_mm_fault+0x1bb (0xf7395c80,
0xf7dff660, 0x449, 0x1, 0x1)
0xf6cdfee8 0xc011f31a do_page_fault+0x23a (0x2, 0x0, 0xa004,
0x5000, 0x0)
0xc0106383 error_code+0x2b
Interrupt registers:
eax = 0x00005003 ebx = 0x00000002 ecx = 0x00000000 edx =
0x0000a004
esi = 0x00005000 edi = 0x00000000 esp = 0x00000fd4 eip =
0x000014fe
ebp = 0x00000fdc xss = 0x00000100 xcs = 0x0000c000 eflags =
0x00033213
xds = 0x00000000 xes = 0x00000000 origeax = 0xffffffff &regs
= 0xf6cdfef0
Interrupt from user space, end of kernel trace
[0]kdb>

and

[0]kdb> btp 330200
Stack traceback for pid 330200
0xf6e2a0b0 330200 330148 0 1 S 0xf6e2a290 Xorg
EBP EIP Function (args)
0xf73ccd68 0xc04666f6 schedule+0x386 (0xf6c1d580,
0xf790ddd4, 0x10000000, 0x0, 0xf790ddcc)
0xf73ccda0 0xc027d49a tok_wait+0xba (0xf78d537c, 0xf790ddc0,
0x0, 0x2, 0x5)
0xf73cce20 0xc029607e cfstok_req+0x20e (0xf78d5424, 0x1,
0x4, 0x3, 0x0)
0xf73cce5c 0xc02862f4 cfs_shared_nopage+0x84 (0xf6e7e804,
0x0, 0xf73ccea0, 0xf6e33080, 0x1)
0xf73cceb0 0xc015ca6d do_no_page+0xbd (0xf6f04380,
0xf6e7e804, 0x42, 0x0, 0xfffc5000)
0xf73ccee4 0xc015cf7b handle_mm_fault+0x1bb (0xf6f04380,
0xf6e7e804, 0x42, 0x0,
0x0)
0xf73ccfbc 0xc011f31a do_page_fault+0x23a (0xb7aaccf8,
0x8241710, 0xb7aaced4, 0x10, 0x40)
0xc0106383 error_code+0x2b
Interrupt registers:
eax = 0x00000042 ebx = 0xb7aaccf8 ecx = 0x08241710 edx =
0xb7aaced4
esi = 0x00000010 edi = 0x00000040 esp = 0xbfffedcc eip =
0xb7aa9277
ebp = 0xbfffede8 xss = 0x0000007b xcs = 0x00000073 eflags =
0x00013202
xds = 0xb7aa007b xes = 0x0000007b origeax = 0xffffffff &regs
= 0xf73ccfc4
Interrupt from user space, end of kernel trace

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2006-02-04

Logged In: YES
user_id=1246761

It seems these two processes are waiting for another process
to release the token.

Do call print_inode 0xf7c20e24, the first arg passsed to
cfstok_req. Then call print_dentry and print_file to find
out what file is pointing to. It may give you a hint what
process to look for.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2007-04-11

Logged In: YES
user_id=1246761
Originator: NO

The original bug report is for SSI-1.9.1. Are you still having a problem in SSI-1.9.2 or later?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2007-04-11

Logged In: YES
user_id=166336
Originator: YES

Don't know - will try to reproduce bug this AM.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2007-10-12

status: open --> open-out-of-date
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-01-02

status: open-out-of-date --> open-accepted
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-01-02

Logged In: YES
user_id=1246761
Originator: NO

This is caused by rc.sysinit.nodeup /tmp cleanup is not yet cluster-aware and removes X11 sockets and files including those of other nodes because /tmp is global (by default).

One working solution I have been testing since OPENSSI-FC-2-0-0-PRE1 is to convert /tmp to a local CDSL and have rc.sysinit.nodeup to not do /tmp cleanup if /tmp is not a CDSL. Clustered applications that expect a global shared /tmp are reconfigured to use some other directory like /cluster/tmp on the shared filesystem.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-04-20

status: open-accepted --> open-works-for-me
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-05-24

status: open-works-for-me --> closed-works-for-me
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

deadlock if 2 xservers start at same time

Group

Searches

Help

#109 deadlock if 2 xservers start at same time

Discussion