[Linux-mips-kernel] Userland `hang' with sem01 / shmem_test_03 on cavium mips32

SourceForge Headquarters 1320 Columbia Street Suite 310 San Diego, CA 92101 +1 (858) 422-6466

Hi kernel.org folks,

    I'm trying to track down an issue with the sem01 [1] and
shmem_test_03 [2] testcases from LTP because they consistently hang on
our cavium / mips32 boards when executing semop. This section of code
is where everything breaks down in shmem_test_03.c is:

static void lock_resource (int semaphore)
{
        struct sembuf   buf;

        buf.sem_op = -1;                /* Obtain resource */
        buf.sem_num = semaphore;
        buf.sem_flg = 0;

        if (semop (semid, &buf, 1) < 0) /* <-- Hangs here indefinitely */
                sys_error ("semop (LOCK) failed", __LINE__);
}

    The first lock_resource appears to be recursive (a glibc bug?),
according to the gdb output, which doesn't make sense, but then again
I'm not ruling out a `Shrodinger's cat effect' by having gdb present
observing the program.

    A few datapoints:
    1. I see a BUG note in the kernel.org manpage about kernel
versions [2.6.1, 2.6.10] (from
<http://www.kernel.org/doc/man-pages/online/pages/man2/semop.2.html>),

BUGS         top

       When a process terminates, its set of associated semadj
structures is used to
       undo the effect of all of the semaphore operations it performed with the
       SEM_UNDO flag.  This raises a difficulty: if one (or more) of
these semaphore
       adjustments would result in an attempt to decrease a
semaphore's value below
       zero, what should an implementation do?  One possible approach
would be to
       block until all the semaphore adjustments could be performed.
This is however
       undesirable since it could force process termination to block
for arbitrarily
       long periods.  Another possibility is that such semaphore
adjustments could be
       ignored altogether (somewhat analogously to failing when IPC_NOWAIT is
       specified for a semaphore operation).  Linux adopts a third approach:
       decreasing the semaphore value as far as possible (i.e., to
zero) and allowing
       process termination to proceed immediately.

       In kernels 2.6.x, x <= 10, there is a bug that in some
circumstances prevents
       a process that is waiting for a semaphore value to become zero from being
       woken up when the value does actually become zero.  This bug is fixed in
       kernel 2.6.11.

    but we're using 2.6.24 [with some patches backported from 2.6.25
and 2.6.26 of the kernel AFAIK], so this doesn't make sense.
    2. We have ppc targets that don't run into any issues with this
particular test, but the architecture is completely different, as is
the glibc version (2.3.3 with NPTL support backported from 2.4 for
mips32; 2.3.4 is our ppc version) -_-...

    My questions for the experts are:
    1. Does this sound familiar at all?
    2. Do you have any suggestions for how I should diagnose this further?
    3. Are there are a series of additional tests I can run with a
different set of syscall or kernel API's that may exercise similar
sections of code?

    I've attached a simpler version of shmem_test_03 (sem_test.c) from
LTP that isolates the particular issue on our mips platform, as well
as the gdb log, and script used to produce the log, as a final
reference point for this issue.
    Please CC me on all replies as I'm not subscribed to either the
linux-mips-kernel or linux-mips mailing lists.
Many thanks!
-Garrett

Output from semctl_test --
[o:~]$ ./semctl_test
semget PASSED
semctl set (WRITE) PASSED
semctl set (READ) PASSED
semop
[o:~]$ logout

1. http://ltp.cvs.sourceforge.net/viewvc/ltp/ltp/testcases/kernel/ipc/semaphore/sem01.c?revision=1.2&view=markup
2. http://ltp.cvs.sourceforge.net/viewvc/ltp/ltp/testcases/kernel/ipc/ipc_stress/shmem_test_03.c?revision=1.7&view=markup