OpenSSI Clusters for Linux / Bugs / #154 kernel BUG @ ipc/semc:1931

John Hughes - 2008-04-14

Logged In: YES
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma)
{
int nsems, i;
struct sem_semundo *un, **unp;

BUG_ON(sem_checkid(sma,semid));

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-04-15

Logged In: YES
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock(). Try attached patch.
File Added: ipc_sem.c.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-04-15

ipc_sem.c.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Nobody/Anonymous - 2008-04-15

Logged In: NO

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
continue;
}

- __ssi_semexit(semid, current->tgid, sma);
+ __ssi_semexit(u->semid, current->tgid, sma);
}
}

so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen.

Also the call path that seems to be causing problems seems to be (from the trace above):

[client node]
exit_sem
cli_ripc_semexit

[server node]
[...]
svr_ripc_semexit
ripc_semexit
ssi_semexit
__ssi_semexit

and your patch touches the

exit_sem
__ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision?

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-04-16

Logged In: YES
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU: 0
EIP: 0060:[<c01d3f29>] Not tainted VLI
EFLAGS: 00210202 (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808
esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0
ds: 007b es: 007b ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4
d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380
0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456
Call Trace:
[<c010694f>] show_stack+0x7f/0xa0
[<c0106b04>] show_registers+0x164/0x220
[<c0106e94>] die+0xf4/0x1c0
[<c0107015>] do_trap+0xb5/0xc0
[<c01072cc>] do_invalid_op+0xbc/0xd0
[<c01065a3>] error_code+0x2b/0x30
[<c012a319>] do_exit+0xb9/0x3b0
[<c012a68c>] do_group_exit+0x3c/0xb0
[<c01350cf>] get_signal_to_deliver+0x1ff/0x310
[<c01057c4>] do_signal+0x74/0x140
[<c0105917>] do_notify_resume+0x87/0x8c
[<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.

Related

Bugs: #1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-04-17

Logged In: YES
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of the problem?

It's trying to unlock a lock that isn't locked.

------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU: 0
EIP: 0060:[<c046290b>] Not tainted VLI
EFLAGS: 00010202 (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08
esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10
ds: 007b es: 007b ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5
02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003
f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec
Call Trace:
[<c010694f>] show_stack+0x7f/0xa0
[<c0106b04>] show_registers+0x164/0x220
[<c0106e94>] die+0xf4/0x1c0
[<c0107015>] do_trap+0xb5/0xc0
[<c01072cc>] do_invalid_op+0xbc/0xd0
[<c01065a3>] error_code+0x2b/0x30
[<c01cecbb>] ipc_unlock+0xb/0x10
[<c01ce77e>] ipc_drop_locks+0x1e/0x40
[<c0261dd5>] ripc_drop_locks+0x45/0x60
[<c0258708>] svr_ripc_drop_locks+0x58/0xb0
[<c020abb3>] icssvr_daemon+0x2f3/0xab0
[<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90

Related

Bugs: #1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-04-17

Logged In: YES
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Roger Tsang - 2008-04-17

Logged In: YES
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-04-21

Logged In: YES
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-05-30

Logged In: YES
user_id=166336
Originator: YES

Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address 00000004
printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP
Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU: 1
EIP: 0060:[<c01d4b6d>] Not tainted VLI
EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3)
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0
esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0
ds: 007b es: 007b ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c
c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c
f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48
Call Trace:
[<c010671f>] show_stack+0x7f/0xa0
[<c01068c4>] show_registers+0x164/0x230
[<c0106c74>] die+0xf4/0x1c0
[<c011f56d>] do_page_fault+0x48d/0x689
[<c0106383>] error_code+0x2b/0x30
[<c01d474e>] exit_sem+0x15e/0x190
[<c012a619>] do_exit+0x159/0x4f0
[<c012aa7a>] do_group_exit+0x3a/0xc0
[<c0135163>] get_signal_to_deliver+0x233/0x360
[<c0105590>] do_signal+0x70/0x150
[<c01056c7>] do_notify_resume+0x57/0x8c
[<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34

Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem
EBP EIP Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
0xc0105866 work_notifysig+0x13
[1]kdb>

Related

Bugs: #1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-02

Logged In: YES
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
... create is now done

semctl_down: IPC_RMID 360448
... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
... so we inform the ipc nameserver

sys_semget: key=56a4d5 nsems=1 flags=3b0
... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
... we ask for it's ID

ipcname_getid newid=393216, create=1
... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
... create done

ipc_checkid: 360448 / 32768 != 12
... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence.

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]

Related

Bugs: #1

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-02

Logged In: YES
user_id=166336
Originator: YES

Can it be as simple as that? Look at the code in freeary:

#ifdef CONFIG_SSI
for (un = sma->undo; un;) {
u = un;
un = u->id_next;
kfree(u);
}
#else

sma->undo is left pointing to free'd memory.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-02

Logged In: YES
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-03

Logged In: YES
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node).

Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-03

Crash node with semaphore fun

semcrash.c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-03

Logged In: YES
user_id=166336
Originator: YES

File Added: semcrash.c

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-03

Logged In: YES
user_id=166336
Originator: YES

(of course in my message below I meant:

"Process (B), (where A != B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_undo for process A."

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-04

Logged In: YES
user_id=166336
Originator: YES

Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's just a stale undo operation, which can be ignored.

(This will screw up if the semaphore sequence number wraps around during the lifetime of a process that uses semaphores that are removed and recreated, but the sequence number can go up to 2^(32-15) so I doubt that will happen often).

File Added: sem.c.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-04

Patch to fix crash

sem.c.patch

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-06-04

Logged In: YES
user_id=166336
Originator: YES

Actually it doesn't matter if wrapping sequence numbers give us a collision - the proc sem_undo only gets done if the semaphore has a matching sem_semundo.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-10-19

Fixed in cvs

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

John Hughes - 2008-10-19

assigned_to: nobody --> hughesj

status: open --> closed-fixed
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

kernel BUG @ ipc/semc:1931

Group

Searches

Help

#154 kernel BUG @ ipc/semc:1931

Related

Discussion

Related

Related

Related

Related