Thread: [SSI-devel] [ ssic-linux-Bugs-1941808 ] kernel BUG @ ipc/semc:1931
Brought to you by:
brucewalker,
rogertsang
From: SourceForge.net <no...@so...> - 2008-04-14 07:57:11
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-14 08:24:13
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-15 01:16:35
|
Bugs item #1941808, was opened at 2008-04-14 03:57 Message generated for change (Comment added) made by rogertsang You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: Roger Tsang (rogertsang) Date: 2008-04-14 21:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 04:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-15 13:56:09
|
Bugs item #1941808, was opened at 2008-04-14 00:57 Message generated for change (Comment added) made by nobody You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 06:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-14 18:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 01:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-16 11:22:59
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-17 08:09:00
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-17 10:21:33
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-17 10:56:09
|
Bugs item #1941808, was opened at 2008-04-14 03:57 Message generated for change (Comment added) made by rogertsang You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 06:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 06:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 04:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 07:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 09:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-14 21:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 04:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-04-21 13:38:20
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-05-30 14:30:25
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-02 15:30:05
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-02 15:33:32
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-02 16:09:52
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-03 10:55:36
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-03 12:55 Message: Logged In: YES user_id=166336 Originator: YES Ok, here's the bug: Someone creates a semaphore Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node). Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A. Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore. Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match. Attached test program that crashes the system. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-03 11:04:02
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-03 13:04 Message: Logged In: YES user_id=166336 Originator: YES File Added: semcrash.c ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 12:55 Message: Logged In: YES user_id=166336 Originator: YES Ok, here's the bug: Someone creates a semaphore Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node). Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A. Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore. Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match. Attached test program that crashes the system. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-03 11:41:23
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-03 13:41 Message: Logged In: YES user_id=166336 Originator: YES (of course in my message below I meant: "Process (B), (where A != B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_undo for process A." ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:04 Message: Logged In: YES user_id=166336 Originator: YES File Added: semcrash.c ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 12:55 Message: Logged In: YES user_id=166336 Originator: YES Ok, here's the bug: Someone creates a semaphore Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node). Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A. Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore. Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match. Attached test program that crashes the system. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-04 11:47:54
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-04 13:47 Message: Logged In: YES user_id=166336 Originator: YES Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's just a stale undo operation, which can be ignored. (This will screw up if the semaphore sequence number wraps around during the lifetime of a process that uses semaphores that are removed and recreated, but the sequence number can go up to 2^(32-15) so I doubt that will happen often). File Added: sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:41 Message: Logged In: YES user_id=166336 Originator: YES (of course in my message below I meant: "Process (B), (where A != B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_undo for process A." ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:04 Message: Logged In: YES user_id=166336 Originator: YES File Added: semcrash.c ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 12:55 Message: Logged In: YES user_id=166336 Originator: YES Ok, here's the bug: Someone creates a semaphore Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node). Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A. Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore. Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match. Attached test program that crashes the system. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-06-04 13:35:57
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 Status: Open Resolution: None Priority: 5 Private: No Submitted By: John Hughes (hughesj) Assigned to: Nobody/Anonymous (nobody) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-06-04 15:35 Message: Logged In: YES user_id=166336 Originator: YES Actually it doesn't matter if wrapping sequence numbers give us a collision - the proc sem_undo only gets done if the semaphore has a matching sem_semundo. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-04 13:47 Message: Logged In: YES user_id=166336 Originator: YES Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's just a stale undo operation, which can be ignored. (This will screw up if the semaphore sequence number wraps around during the lifetime of a process that uses semaphores that are removed and recreated, but the sequence number can go up to 2^(32-15) so I doubt that will happen often). File Added: sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:41 Message: Logged In: YES user_id=166336 Originator: YES (of course in my message below I meant: "Process (B), (where A != B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_undo for process A." ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:04 Message: Logged In: YES user_id=166336 Originator: YES File Added: semcrash.c ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 12:55 Message: Logged In: YES user_id=166336 Originator: YES Ok, here's the bug: Someone creates a semaphore Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node). Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A. Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore. Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match. Attached test program that crashes the system. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |
From: SourceForge.net <no...@so...> - 2008-10-19 10:40:06
|
Bugs item #1941808, was opened at 2008-04-14 09:57 Message generated for change (Comment added) made by hughesj You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: IPC Group: v1.9.3 >Status: Closed >Resolution: Fixed Priority: 5 Private: No Submitted By: John Hughes (hughesj) >Assigned to: John Hughes (hughesj) Summary: kernel BUG @ ipc/semc:1931 Initial Comment: Seen this one a couple of times: Kills the keyboard, eventually node dies. Possibly seeing it now 'cos I'm using the ALSA DMIX plugin on all my nodes (which uses semaphores). ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d447c>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at ssi_semexit+0xfc/0x110 eax: 00000001 ebx: 0005800a ecx: 00000002 edx: e59f3f88 esi: e59f3f88 edi: 00030e83 ebp: f721fe64 esp: f721fe44 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 262671, threadinfo=f721f000 task=f725ed70) Stack: c07500c0 e59f3f88 0005800a dfdd8580 f721fe68 f721fe74 f7219400 c0753360 f721feb8 c02614f1 0005800a 00030e83 00000004 0004020f 00000000 00000000 00000000 00000000 00000000 0004020f 0004020f 0004020f 00100001 00000000 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c02614f1>] ripc_semexit+0x31/0x50 [<c0256fb3>] svr_ripc_semexit+0xa3/0x100 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 8b 80 8c 00 00 00 89 41 04 43 39 fb 7c c1 a1 90 0d 74 c0 89 46 30 89 34 24 e8 c1 d9 ff ff e9 79 ff ff ff c7 01 00 00 00 00 eb bf <0f> 0b 79 07 8f b8 49 c0 e9 3d ff ff ff 8d b4 26 00 00 00 00 55 ---------------------------------------------------------------------- >Comment By: John Hughes (hughesj) Date: 2008-10-19 12:39 Message: Fixed in cvs ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-04 15:35 Message: Logged In: YES user_id=166336 Originator: YES Actually it doesn't matter if wrapping sequence numbers give us a collision - the proc sem_undo only gets done if the semaphore has a matching sem_semundo. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-04 13:47 Message: Logged In: YES user_id=166336 Originator: YES Ok, the patch is pretty simple - if sem_checkid fails it's not a bug, it's just a stale undo operation, which can be ignored. (This will screw up if the semaphore sequence number wraps around during the lifetime of a process that uses semaphores that are removed and recreated, but the sequence number can go up to 2^(32-15) so I doubt that will happen often). File Added: sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:41 Message: Logged In: YES user_id=166336 Originator: YES (of course in my message below I meant: "Process (B), (where A != B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_undo for process A." ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 13:04 Message: Logged In: YES user_id=166336 Originator: YES File Added: semcrash.c ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-03 12:55 Message: Logged In: YES user_id=166336 Originator: YES Ok, here's the bug: Someone creates a semaphore Process (A) operates on it, creating a sem_undo structure for itself (on it's own node) and a sem_semundo for the semaphore (on the semaphore's node). Process (B), (where A !=B) removes the semaphore, cleaning up the sem_semundo for the semaphore, BUT NOT THE sem_semundo for process A. Someone creates a new semaphore that happens to get the same index (but a different sequence) from the original semaphore. Process (A) exits - when we try to clean up its sem_undo structure sem_checkid fails because the sequence numbers don't match. Attached test program that crashes the system. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 18:09 Message: Logged In: YES user_id=166336 Originator: YES Nah, that's not the bug - sma gets freed before freeary returns, so who cares if it has dangling pointers. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:33 Message: Logged In: YES user_id=166336 Originator: YES Can it be as simple as that? Look at the code in freeary: #ifdef CONFIG_SSI for (un = sma->undo; un;) { u = un; un = u->id_next; kfree(u); } #else sma->undo is left pointing to free'd memory. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-06-02 17:30 Message: Logged In: YES user_id=166336 Originator: YES Here's the sequence of operations that causes the crash: Totem makes a semaphore, ups and downs it a few times; then removes it and recreates it; carries on upping and downing. When totem exits it tries to undo the ops on the 1st semaphore - but the sequence is now that of the 2nd one. Heres the output of some debugging printks I stuck in my kernel: sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... so we ask the nameserver for it'd ID ipcname_getid newid=360448, create=1 ... it doesn't exist so we must create it sem_buildid (id=0, seq=11) = 360448 ... create is now done semctl_down: IPC_RMID 360448 ... now totem deletes the semaphore freeary id=360448 cli_ipcname_rmid id=360448 service=1 ... so we inform the ipc nameserver sys_semget: key=56a4d5 nsems=1 flags=3b0 ... totem re-creates the semaphore cli_ipcname_getid: key=56a4d5 service=1, node=6 server=1 ... we ask for it's ID ipcname_getid newid=393216, create=1 ... it doesn't exist so we must re-create it sem_buildid (id=0, seq=12) = 393216 ... create done ipc_checkid: 360448 / 32768 != 12 ... later on totem exits, so we try to perform the UNDO actions, but we've got the wrong sequence. ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1937! invalid operand: 0000 [#1] ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-05-30 16:30 Message: Logged In: YES user_id=166336 Originator: YES Well well well. I can now reproduce this bug - launch totem (gnome movie player) on an .mp3 file, quit totem - crash! In fact I've seen this bug before - the first time on a 2.6.10 based kernel. Here's the trace I got from 2.6.10: Unable to handle kernel NULL pointer dereference at virtual address 00000004 printing eip: c01d4b6d *pde = 00000000 Oops: 0000 [#1] SMP Modules linked in: radeon button ac battery parport_pc parport pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sd_mod aic7xxx scsi_mod tg3 e1000 CPU: 1 EIP: 0060:[<c01d4b6d>] Not tainted VLI EFLAGS: 00210246 (2.6.10-ssi-1.9.2-jh-3) EIP is at ssi_semexit+0x3d/0xd0 eax: f5005188 ebx: 00060b37 ecx: 00000000 edx: f665d2c0 esi: f5005188 edi: f6b48b80 ebp: f6d5fdcc esp: f6d5fdb0 ds: 007b es: 007b ss: 0068 Process totem (pid: 396087, threadinfo=f6d5f000 task=f7e21250) Stack: c0732040 00070000 f6d5fdd8 c0153177 00070000 f6d5f000 f6b48b80 f6d5fe5c c01d474e 00070000 00060b37 c015dc88 0000005d f720c660 00000000 f6d5fe0c f6b48bcc f6b48bc0 f720c660 f6c4dc58 00000006 f720c660 f580ac80 f6d5fe48 Call Trace: [<c010671f>] show_stack+0x7f/0xa0 [<c01068c4>] show_registers+0x164/0x230 [<c0106c74>] die+0xf4/0x1c0 [<c011f56d>] do_page_fault+0x48d/0x689 [<c0106383>] error_code+0x2b/0x30 [<c01d474e>] exit_sem+0x15e/0x190 [<c012a619>] do_exit+0x159/0x4f0 [<c012aa7a>] do_group_exit+0x3a/0xc0 [<c0135163>] get_signal_to_deliver+0x233/0x360 [<c0105590>] do_signal+0x70/0x150 [<c01056c7>] do_notify_resume+0x57/0x8c [<c0105866>] work_notifysig+0x13/0x15 Code: c0 8b 5d 0c 89 44 24 04 e8 a1 a8 ff ff 85 c0 89 c6 74 33 8b 48 44 8d 50 44 eb 0c 8d 76 00 39 59 04 74 2b 89 ca 8b 09 85 c9 75 f3 <8b> 41 04 c7 04 24 a8 14 49 c0 89 44 24 04 e8 50 36 f5 ff 89 34 Entering kdb (current=0xf7e21250, pid 396087) on processor 1 Oops: Oops due to oops @ 0xc01d4b6d eax = 0xf5005188 ebx = 0x00060b37 ecx = 0x00000000 edx = 0xf665d2c0 esi = 0xf5005188 edi = 0xf6b48b80 esp = 0xf6d5fdb0 eip = 0xc01d4b6d ebp = 0xf6d5fdcc xss = 0xc03a0068 xcs = 0x00000060 eflags = 0x00210246 xds = 0xf665007b xes = 0x0000007b origeax = 0xffffffff ®s = 0xf6d5fd7c [1]kdb> Stack traceback for pid 396087 0xf7e21250 396087 1 1 1 R 0xf7e21430 *totem EBP EIP Function (args) 0xf6d5fdcc 0xc01d4b6d ssi_semexit+0x3d (0x70000, 0x60b37, 0xc015dc88, 0x5d, 0xf720c660) 0xf6d5fe5c 0xc01d474e exit_sem+0x15e (0xf7e21250, 0x2b, 0x1, 0xf68b2c84, 0xf7e21718) 0xf6d5fe8c 0xc012a619 do_exit+0x159 (0x0, 0x0, 0x0, 0x9, 0xf6d5f000) 0xf6d5feac 0xc012aa7a do_group_exit+0x3a (0x9, 0x0, 0x0, 0xf6d5f000, 0xf6d5f000) 0xf6d5fedc 0xc0135163 get_signal_to_deliver+0x233 (0xf6d5ff18, 0xf6d5fef8, 0xf6d5ffc4, 0x0, 0x200282) 0xf6d5ffa4 0xc0105590 do_signal+0x70 (0xf7214580, 0x8297010, 0x8297010, 0xb71e37b0) 0xf6d5ffbc 0xc01056c7 do_notify_resume+0x57 0xc0105866 work_notifysig+0x13 [1]kdb> ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-21 15:38 Message: Logged In: YES user_id=166336 Originator: YES I'm having some difficulty reproducing this problem after a reboot. I've hacked some debugging printf's into the kernel I'm using and will add any new info when/if I find it. ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-17 12:56 Message: Logged In: YES user_id=1246761 Originator: NO Can this be reproduced in the original compiled kernel from the latest binary release? Apache uses IPC semaphores and have not run into this bug on UP/SMP. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 12:21 Message: Logged In: YES user_id=166336 Originator: YES Well, since "ripc_drop_locks" is for shared memory not semaphores it's probably a different bug. ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-17 10:08 Message: Logged In: YES user_id=166336 Originator: YES Another BUG in the semaphore code - may indicate the underlying cause of the problem? It's trying to unlock a lock that isn't locked. ------------[ cut here ]------------ kernel BUG at include/asm/spinlock.h:112! invalid operand: 0000 [#1] SMP Modules linked in: i915 drm button ac battery parport_pc parport floppy pcspkr snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc i2c_i801 i2c_core ata_piix libata hw_random ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c046290b>] Not tainted VLI EFLAGS: 00010202 (2.6.11-jh-1) EIP is at _spin_unlock+0x1b/0x30 eax: 00000001 ebx: c0750140 ecx: c0750101 edx: f7e12e08 esi: f70c2400 edi: c0753360 ebp: f7032f10 esp: f7032f10 ds: 007b es: 007b ss: 0068 Process icssvr_daemon (pid: 197135, threadinfo=f7032000 task=f70cd930) Stack: f7032f18 c01cecbb f7032f28 c01ce77e f7e12e08 02668001 f7032f44 c0261dd5 02668001 f7e12e08 c0750140 00000001 f7032f5c f7032f6c c0258708 00000003 f7032f5c 02668001 00000000 00000000 02668001 00000002 00000002 f7032fec Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c01cecbb>] ipc_unlock+0xb/0x10 [<c01ce77e>] ipc_drop_locks+0x1e/0x40 [<c0261dd5>] ripc_drop_locks+0x45/0x60 [<c0258708>] svr_ripc_drop_locks+0x58/0xb0 [<c020abb3>] icssvr_daemon+0x2f3/0xab0 [<c01023a5>] kernel_thread_helper+0x5/0x10 Code: 1c 0c 49 c0 eb e6 8d 76 00 8d bc 27 00 00 00 00 55 89 c2 89 e5 81 78 04 ad 4e ad de b1 01 75 15 0f b6 02 84 c0 7f 04 86 0a 5d c3 <0f> 0b 70 00 1c 0c 49 c0 eb f2 0f 0b 6f 00 1c 0c 49 c0 eb e1 90 ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-16 13:22 Message: Logged In: YES user_id=166336 Originator: YES Here's another example, this time it was going through the local exit_sem path: ------------[ cut here ]------------ kernel BUG at ipc/sem.c:1913! invalid operand: 0000 [#1] SMP Modules linked in: smbfs i915 drm button ac battery parport_pc parport pcspkr i2c_i801 i2c_core ata_piix libata snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd soundcore snd_page_alloc ehci_hcd uhci_hcd sr_mod sd_mod mptsas mptscsih mptbase scsi_mod tg3 e1000 CPU: 0 EIP: 0060:[<c01d3f29>] Not tainted VLI EFLAGS: 00210202 (2.6.11-jh-1) EIP is at exit_sem+0x229/0x2b0 eax: 00000001 ebx: c597e808 ecx: 00000001 edx: c597e808 esi: 000e800c edi: cbf682e0 ebp: d76fce6c esp: d76fcdd0 ds: 007b es: 007b ss: 0068 Process firefox-bin (pid: 743423, threadinfo=d76fc000 task=df5f58b0) Stack: c07500c0 c597e808 000e800c 00000000 d76fce00 c015d84d c165eb80 d1e12ee4 d76fc000 00000001 000b0f63 d76fc000 cfcfd42c cfcfd420 d1e12ee4 defe7380 0000000b df5f5d78 d76fce28 defe7380 defe73c8 df5f5d78 d76fce3c c0125456 Call Trace: [<c010694f>] show_stack+0x7f/0xa0 [<c0106b04>] show_registers+0x164/0x220 [<c0106e94>] die+0xf4/0x1c0 [<c0107015>] do_trap+0xb5/0xc0 [<c01072cc>] do_invalid_op+0xbc/0xd0 [<c01065a3>] error_code+0x2b/0x30 [<c012a319>] do_exit+0xb9/0x3b0 [<c012a68c>] do_group_exit+0x3c/0xb0 [<c01350cf>] get_signal_to_deliver+0x1ff/0x310 [<c01057c4>] do_signal+0x74/0x140 [<c0105917>] do_notify_resume+0x87/0x8c [<c0105a86>] work_notifysig+0x13/0x15 Code: 80 8c 00 00 00 89 41 04 46 3b 75 88 7c c0 a1 90 0d 74 c0 89 43 30 89 1c 24 e8 14 df ff ff e9 52 ff ff ff c7 01 00 00 00 00 eb be <0f> 0b 79 07 8f b8 49 c0 e9 05 ff ff ff 89 44 24 04 89 34 24 e8 So it's not to do with local/remote semaphores. ---------------------------------------------------------------------- Comment By: Nobody/Anonymous (nobody) Date: 2008-04-15 15:55 Message: Logged In: NO I'm sorry Roger, I don't get the point of your patch. I suppose the relevant bit is: @@ -2027,7 +2025,7 @@ namesvr_semexit_go: continue; } - __ssi_semexit(semid, current->tgid, sma); + __ssi_semexit(u->semid, current->tgid, sma); } } so if semid has been changed to be bad, or if u->semid was bad and has changed to be good we won't panic. I can't see how either of these conditions can happen. Also the call path that seems to be causing problems seems to be (from the trace above): [client node] exit_sem cli_ripc_semexit [server node] [...] svr_ripc_semexit ripc_semexit ssi_semexit __ssi_semexit and your patch touches the exit_sem __ssi_semexit path. I must admit I'm pretty suprised to see that the client/server stuff is being used - I thought everything was staying node-local. Maybe the fix is simply to bail out of __ssi_semexit if check_semid doesn't match? Couldn't it just indicate a sem_exit/IPC_RMID collision? ---------------------------------------------------------------------- Comment By: Roger Tsang (rogertsang) Date: 2008-04-15 03:16 Message: Logged In: YES user_id=1246761 Originator: NO Maybe semid changed before exit_sem() got sem_lock(). Try attached patch. File Added: ipc_sem.c.patch ---------------------------------------------------------------------- Comment By: John Hughes (hughesj) Date: 2008-04-14 10:24 Message: Logged In: YES user_id=166336 Originator: YES Here's the code: static inline void __ssi_semexit(int semid, pid_t pid, struct sem_array *sma) { int nsems, i; struct sem_semundo *un, **unp; BUG_ON(sem_checkid(sma,semid)); ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=405834&aid=1941808&group_id=32541 |