Thread: [SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

ssic-linux-devel

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-14 07:57:11

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-14 08:24:13

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-15 01:16:35

Bugs item #1941808, was opened at 2008-04-14 03:57
Message generated for change (Comment added) made by rogertsang
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: Roger Tsang (rogertsang)
Date: 2008-04-14 21:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 04:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-15 13:56:09

Bugs item #1941808, was opened at 2008-04-14 00:57
Message generated for change (Comment added) made by nobody
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 06:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-14 18:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 01:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-16 11:22:59

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-17 08:09:00

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-17 10:21:33

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-17 10:56:09

Bugs item #1941808, was opened at 2008-04-14 03:57
Message generated for change (Comment added) made by rogertsang
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 06:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 06:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 04:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 07:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 09:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-14 21:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 04:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-04-21 13:38:20

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-05-30 14:30:25

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-02 15:30:05

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-02 15:33:32

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-02 16:09:52

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-03 10:55:36

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-03 12:55

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on
it's own node) and a sem_semundo for the semaphore (on the semaphore's
node).

Process (B), (where A !=B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a
different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure
sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-03 11:04:02

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:04

Message:
Logged In: YES 
user_id=166336
Originator: YES

File Added: semcrash.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 12:55

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on
it's own node) and a sem_semundo for the semaphore (on the semaphore's
node).

Process (B), (where A !=B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a
different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure
sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-03 11:41:23

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:41

Message:
Logged In: YES 
user_id=166336
Originator: YES

(of course in my message below I meant:

"Process (B), (where A != B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_undo for process A."

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:04

Message:
Logged In: YES 
user_id=166336
Originator: YES

File Added: semcrash.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 12:55

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on
it's own node) and a sem_semundo for the semaphore (on the semaphore's
node).

Process (B), (where A !=B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a
different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure
sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-04 11:47:54

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-04 13:47

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's
just a stale undo operation, which can be ignored.

(This will screw up if the semaphore sequence number wraps around during
the lifetime of a process that uses semaphores that are removed and
recreated, but the sequence number can go up to 2^(32-15) so I doubt that
will happen often).

File Added: sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:41

Message:
Logged In: YES 
user_id=166336
Originator: YES

(of course in my message below I meant:

"Process (B), (where A != B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_undo for process A."

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:04

Message:
Logged In: YES 
user_id=166336
Originator: YES

File Added: semcrash.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 12:55

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on
it's own node) and a sem_semundo for the semaphore (on the semaphore's
node).

Process (B), (where A !=B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a
different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure
sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-06-04 13:35:57

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
Status: Open
Resolution: None
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
Assigned to: Nobody/Anonymous (nobody)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-06-04 15:35

Message:
Logged In: YES 
user_id=166336
Originator: YES

Actually it doesn't matter if wrapping sequence numbers give us a
collision - the proc sem_undo only gets done if the semaphore has a
matching sem_semundo.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-04 13:47

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's
just a stale undo operation, which can be ignored.

(This will screw up if the semaphore sequence number wraps around during
the lifetime of a process that uses semaphores that are removed and
recreated, but the sequence number can go up to 2^(32-15) so I doubt that
will happen often).

File Added: sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:41

Message:
Logged In: YES 
user_id=166336
Originator: YES

(of course in my message below I meant:

"Process (B), (where A != B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_undo for process A."

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:04

Message:
Logged In: YES 
user_id=166336
Originator: YES

File Added: semcrash.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 12:55

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on
it's own node) and a sem_semundo for the semaphore (on the semaphore's
node).

Process (B), (where A !=B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a
different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure
sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

[SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931

From: SourceForge.net <no...@so...> - 2008-10-19 10:40:06

Bugs item #1941808, was opened at 2008-04-14 09:57
Message generated for change (Comment added) made by hughesj
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: IPC
Group: v1.9.3
>Status: Closed
>Resolution: Fixed
Priority: 5
Private: No
Submitted By: John Hughes (hughesj)
>Assigned to: John Hughes (hughesj)
Summary: kernel BUG @ ipc/semc:1931

Initial Comment:
Seen this one a couple of times:

Kills the keyboard, eventually node dies.

Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores).

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d447c>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at ssi_semexit+0xfc/0x110
eax: 00000001   ebx: 0005800a   ecx: 00000002   edx: e59f3f88
esi: e59f3f88   edi: 00030e83   ebp: f721fe64   esp: f721fe44
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70)
Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360
       f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000
       00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c02614f1>] ripc_semexit+0x31/0x50
 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55



----------------------------------------------------------------------

>Comment By: John Hughes (hughesj)
Date: 2008-10-19 12:39

Message:
Fixed in cvs

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-04 15:35

Message:
Logged In: YES 
user_id=166336
Originator: YES

Actually it doesn't matter if wrapping sequence numbers give us a
collision - the proc sem_undo only gets done if the semaphore has a
matching sem_semundo.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-04 13:47

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's
just a stale undo operation, which can be ignored.

(This will screw up if the semaphore sequence number wraps around during
the lifetime of a process that uses semaphores that are removed and
recreated, but the sequence number can go up to 2^(32-15) so I doubt that
will happen often).

File Added: sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:41

Message:
Logged In: YES 
user_id=166336
Originator: YES

(of course in my message below I meant:

"Process (B), (where A != B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_undo for process A."

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 13:04

Message:
Logged In: YES 
user_id=166336
Originator: YES

File Added: semcrash.c

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-03 12:55

Message:
Logged In: YES 
user_id=166336
Originator: YES

Ok, here's the bug:

Someone creates a semaphore

Process (A) operates on it, creating a sem_undo structure for itself (on
it's own node) and a sem_semundo for the semaphore (on the semaphore's
node).

Process (B), (where A !=B) removes the semaphore, cleaning up the
sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A.

Someone creates a new semaphore that happens to get the same index (but a
different sequence) from the original semaphore.
Process (A) exits - when we try to clean up its sem_undo structure
sem_checkid fails because the sequence numbers don't match.

Attached test program that crashes the system.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 18:09

Message:
Logged In: YES 
user_id=166336
Originator: YES

Nah, that's not the bug - sma gets freed before freeary returns, so who
cares if it has dangling pointers.




----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:33

Message:
Logged In: YES 
user_id=166336
Originator: YES

Can it be as simple as that?  Look at the code in freeary:

#ifdef CONFIG_SSI
        for (un = sma->undo; un;) {
                u = un;
                un = u->id_next;
                kfree(u);
        }
#else

sma->undo is left pointing to free'd memory.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-06-02 17:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the sequence of operations that causes the crash:  Totem makes a
semaphore, ups and downs it a few times; then removes it and recreates it;
carries on upping and downing.

When totem exits it tries to undo the ops on the 1st semaphore - but the
sequence is now that of the 2nd one.

Heres the output of some debugging printks I stuck in my kernel:

sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... so we ask the nameserver for it'd ID

ipcname_getid newid=360448, create=1
   ... it doesn't exist so we must create it

sem_buildid (id=0, seq=11) = 360448
   ... create is now done


semctl_down: IPC_RMID 360448
   ... now totem deletes the semaphore

freeary id=360448

cli_ipcname_rmid id=360448 service=1
    ... so we inform the ipc nameserver


sys_semget: key=56a4d5 nsems=1 flags=3b0
   ... totem re-creates the semaphore

cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1
   ... we ask for it's ID

ipcname_getid newid=393216, create=1
   ... it doesn't exist so we must re-create it

sem_buildid (id=0, seq=12) = 393216
    ... create done


ipc_checkid: 360448 / 32768 != 12
    ... later on totem exits, so we try to perform the UNDO actions, but
we've got the wrong sequence.


------------[ cut here ]------------
kernel BUG at ipc/sem.c:1937!
invalid operand: 0000 [#1]


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-05-30 16:30

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well well well.  I can now reproduce this bug - launch totem (gnome movie
player) on an .mp3 file, quit totem - crash!

In fact I've seen this bug before - the first time on a 2.6.10 based
kernel.

Here's the trace I got from 2.6.10:

Unable to handle kernel NULL pointer dereference at virtual address
00000004
 printing eip:
c01d4b6d
*pde = 00000000
Oops: 0000 [#1]
SMP 
Modules linked in: radeon button ac battery parport_pc parport pcspkr
snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd
soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random
ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000
CPU:    1
EIP:    0060:[<c01d4b6d>]    Not tainted VLI
EFLAGS: 00210246   (2.6.10-ssi-1.9.2-jh-3) 
EIP is at ssi_semexit+0x3d/0xd0
eax: f5005188   ebx: 00060b37   ecx: 00000000   edx: f665d2c0
esi: f5005188   edi: f6b48b80   ebp: f6d5fdcc   esp: f6d5fdb0
ds: 007b   es: 007b   ss: 0068
Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250)
Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80
f6d5fe5c 
       c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000
f6d5fe0c 
       f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80
f6d5fe48 
Call Trace:
 [<c010671f>] show_stack+0x7f/0xa0
 [<c01068c4>] show_registers+0x164/0x230
 [<c0106c74>] die+0xf4/0x1c0
 [<c011f56d>] do_page_fault+0x48d/0x689
 [<c0106383>] error_code+0x2b/0x30
 [<c01d474e>] exit_sem+0x15e/0x190
 [<c012a619>] do_exit+0x159/0x4f0
 [<c012aa7a>] do_group_exit+0x3a/0xc0
 [<c0135163>] get_signal_to_deliver+0x233/0x360
 [<c0105590>] do_signal+0x70/0x150
 [<c01056c7>] do_notify_resume+0x57/0x8c
 [<c0105866>] work_notifysig+0x13/0x15
Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d
50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7
04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 
 
Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops
due to oops @ 0xc01d4b6d
eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 
esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d 
ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 
xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff &regs = 0xf6d5fd7c
[1]kdb>
Stack traceback for pid 396087
0xf7e21250   396087        1  1    1   R  0xf7e21430 *totem
EBP        EIP        Function (args)
0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88,
0x5d, 0xf720c660)
0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84,
0xf7e21718)
0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000)
0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000,
0xf6d5f000)
0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8,
0xf6d5ffc4, 0x0, 0x200282)
0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010,
0xb71e37b0)
0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57
           0xc0105866 work_notifysig+0x13
[1]kdb> 


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-21 15:38

Message:
Logged In: YES 
user_id=166336
Originator: YES

I'm having some difficulty reproducing this problem after a reboot.

I've hacked some debugging printf's into the kernel I'm using and will add
any new info when/if I find it.


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-17 12:56

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Can this be reproduced in the original compiled kernel from the latest
binary release?  Apache uses IPC semaphores and have not run into this bug
on UP/SMP.

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 12:21

Message:
Logged In: YES 
user_id=166336
Originator: YES

Well, since "ripc_drop_locks" is for shared memory not semaphores it's
probably a different bug.


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-17 10:08

Message:
Logged In: YES 
user_id=166336
Originator: YES

Another BUG in the semaphore code - may indicate the underlying cause of
the problem?

It's trying to unlock a lock that isn't locked.


------------[ cut here ]------------
kernel BUG at include/asm/spinlock.h:112!
invalid operand: 0000 [#1]
SMP
Modules linked in: i915 drm button ac battery parport_pc parport floppy
pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm
snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata
hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod
tg3 e1000
CPU:    0
EIP:    0060:[<c046290b>]    Not tainted VLI
EFLAGS: 00010202   (2.6.11-jh-1)
EIP is at _spin_unlock+0x1b/0x30
eax: 00000001   ebx: c0750140   ecx: c0750101   edx: f7e12e08
esi: f70c2400   edi: c0753360   ebp: f7032f10   esp: f7032f10
ds: 007b   es: 007b   ss: 0068
Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930)
Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44
c0261dd5
       02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708
00000003
       f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002
f7032fec
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c01cecbb>] ipc_unlock+0xb/0x10
 [<c01ce77e>] ipc_drop_locks+0x1e/0x40
 [<c0261dd5>] ripc_drop_locks+0x45/0x60
 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0
 [<c020abb3>] icssvr_daemon+0x2f3/0xab0
 [<c01023a5>] kernel_thread_helper+0x5/0x10
Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78
04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00
1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90


----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-16 13:22

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's another example, this time it was going through the local exit_sem
path:

------------[ cut here ]------------
kernel BUG at ipc/sem.c:1913!
invalid operand: 0000 [#1]
SMP
Modules linked in: smbfs i915 drm button ac battery parport_pc parport
pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec
snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc
ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000
CPU:    0
EIP:    0060:[<c01d3f29>]    Not tainted VLI
EFLAGS: 00210202   (2.6.11-jh-1)
EIP is at exit_sem+0x229/0x2b0
eax: 00000001   ebx: c597e808   ecx: 00000001   edx: c597e808
esi: 000e800c   edi: cbf682e0   ebp: d76fce6c   esp: d76fcdd0
ds: 007b   es: 007b   ss: 0068
Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0)
Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80
d1e12ee4
       d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4
defe7380
       0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c
c0125456
Call Trace:
 [<c010694f>] show_stack+0x7f/0xa0
 [<c0106b04>] show_registers+0x164/0x220
 [<c0106e94>] die+0xf4/0x1c0
 [<c0107015>] do_trap+0xb5/0xc0
 [<c01072cc>] do_invalid_op+0xbc/0xd0
 [<c01065a3>] error_code+0x2b/0x30
 [<c012a319>] do_exit+0xb9/0x3b0
 [<c012a68c>] do_group_exit+0x3c/0xb0
 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310
 [<c01057c4>] do_signal+0x74/0x140
 [<c0105917>] do_notify_resume+0x87/0x8c
 [<c0105a86>] work_notifysig+0x13/0x15
Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89
1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07
8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8

So it's not to do with local/remote semaphores.


----------------------------------------------------------------------

Comment By: Nobody/Anonymous (nobody)
Date: 2008-04-15 15:55

Message:
Logged In: NO 

I'm sorry Roger, I don't get the point of your patch.

I suppose the relevant bit is:

@@ -2027,7 +2025,7 @@ namesvr_semexit_go:
 				continue;
 			}
 
-			__ssi_semexit(semid, current->tgid, sma);
+			__ssi_semexit(u->semid, current->tgid, sma);
 		}
 	}

so if semid has been changed to be bad, or if u->semid was bad and has
changed to be good we won't panic.  I can't see how either of these
conditions can happen.

Also the call path that seems to be causing problems seems to be (from the
trace above):

[client node]
   exit_sem
   cli_ripc_semexit

[server node]
   [...]
   svr_ripc_semexit
   ripc_semexit
   ssi_semexit
   __ssi_semexit

and your patch touches the

   exit_sem
   __ssi_semexit

path.

I must admit I'm pretty suprised to see that the client/server stuff is
being used - I thought everything was staying node-local.

Maybe the fix is simply to bail out of __ssi_semexit if check_semid
doesn't match?  Couldn't it just indicate a sem_exit/IPC_RMID collision?


----------------------------------------------------------------------

Comment By: Roger Tsang (rogertsang)
Date: 2008-04-15 03:16

Message:
Logged In: YES 
user_id=1246761
Originator: NO

Maybe semid changed before exit_sem() got sem_lock().  Try attached patch.
File Added: ipc_sem.c.patch

----------------------------------------------------------------------

Comment By: John Hughes (hughesj)
Date: 2008-04-14 10:24

Message:
Logged In: YES 
user_id=166336
Originator: YES

Here's the code:

static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array
*sma)
{
        int nsems, i;
        struct sem_semundo *un, **unp;

        BUG_ON(sem_checkid(sma,semid));



----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541