|
From: Garrett C. <yan...@gm...> - 2009-10-18 23:48:47
|
Hi kernel.org folks, I'm trying to track down an issue with the sem01 [1] and shmem_test_03 [2] testcases from LTP because they consistently hang on our cavium / mips32 boards when executing semop. This section of code is where everything breaks down in shmem_test_03.c is: static void lock_resource (int semaphore) { struct sembuf buf; buf.sem_op = -1; /* Obtain resource */ buf.sem_num = semaphore; buf.sem_flg = 0; if (semop (semid, &buf, 1) < 0) /* <-- Hangs here indefinitely */ sys_error ("semop (LOCK) failed", __LINE__); } The first lock_resource appears to be recursive (a glibc bug?), according to the gdb output, which doesn't make sense, but then again I'm not ruling out a `Shrodinger's cat effect' by having gdb present observing the program. A few datapoints: 1. I see a BUG note in the kernel.org manpage about kernel versions [2.6.1, 2.6.10] (from <http://www.kernel.org/doc/man-pages/online/pages/man2/semop.2.html>), BUGS top When a process terminates, its set of associated semadj structures is used to undo the effect of all of the semaphore operations it performed with the SEM_UNDO flag. This raises a difficulty: if one (or more) of these semaphore adjustments would result in an attempt to decrease a semaphore's value below zero, what should an implementation do? One possible approach would be to block until all the semaphore adjustments could be performed. This is however undesirable since it could force process termination to block for arbitrarily long periods. Another possibility is that such semaphore adjustments could be ignored altogether (somewhat analogously to failing when IPC_NOWAIT is specified for a semaphore operation). Linux adopts a third approach: decreasing the semaphore value as far as possible (i.e., to zero) and allowing process termination to proceed immediately. In kernels 2.6.x, x <= 10, there is a bug that in some circumstances prevents a process that is waiting for a semaphore value to become zero from being woken up when the value does actually become zero. This bug is fixed in kernel 2.6.11. but we're using 2.6.24 [with some patches backported from 2.6.25 and 2.6.26 of the kernel AFAIK], so this doesn't make sense. 2. We have ppc targets that don't run into any issues with this particular test, but the architecture is completely different, as is the glibc version (2.3.3 with NPTL support backported from 2.4 for mips32; 2.3.4 is our ppc version) -_-... My questions for the experts are: 1. Does this sound familiar at all? 2. Do you have any suggestions for how I should diagnose this further? 3. Are there are a series of additional tests I can run with a different set of syscall or kernel API's that may exercise similar sections of code? I've attached a simpler version of shmem_test_03 (sem_test.c) from LTP that isolates the particular issue on our mips platform, as well as the gdb log, and script used to produce the log, as a final reference point for this issue. Please CC me on all replies as I'm not subscribed to either the linux-mips-kernel or linux-mips mailing lists. Many thanks! -Garrett Output from semctl_test -- [o:~]$ ./semctl_test semget PASSED semctl set (WRITE) PASSED semctl set (READ) PASSED semop [o:~]$ logout 1. http://ltp.cvs.sourceforge.net/viewvc/ltp/ltp/testcases/kernel/ipc/semaphore/sem01.c?revision=1.2&view=markup 2. http://ltp.cvs.sourceforge.net/viewvc/ltp/ltp/testcases/kernel/ipc/ipc_stress/shmem_test_03.c?revision=1.7&view=markup |