#45 *** Grunt thread exiting with 1 still active

open
nobody
None
5
2005-07-25
2005-07-25
Marc Johnson
No

After running a network test, dynamo continuosly
outputs to console:

*** Grunt thread exiting with 1 still active
*** Grunt thread exiting with 1 still active
*** Grunt thread exiting with 1 still active

...and locks up the GUI (iometer).

Ctrl-C'ing dynamo is the only way to unlock it.

Built with iometer-2004.07.30-post.DS2, Makefile-
Linux.x86_64.

Linux lablx000 2.4.21-27.EL #1 SMP Wed Dec 1
21:53:39 EST 2004 x86_64 x86_64 x86_64 GNU/Linux

Any fixes for this?

Thanks and please advise,
Marc

Discussion

  • Ming Zhang
    Ming Zhang
    2005-07-27

    Logged In: YES
    user_id=896657

    Thanks for reporting. I will check this once get time.

     
  • Ming Zhang
    Ming Zhang
    2005-08-16

    Logged In: YES
    user_id=896657

    this is caused by the bug fix on #1088486. the reason is
    that a network io is differ from disk io. for diskio, the
    worker stops and then wait current io ends and quits, which
    is pretty reasonable. but for network worker, it is possible
    that there is always 1 io unfinished for certain unknown
    reason, then it dead loop there.

     
  • Paul Drews
    Paul Drews
    2008-12-18

    I debugged this since it was critical for my work. The problem is that when completing a network test between two endpoints, Iometer tells the endpoints to stop. The "stop" commands can never be quite simultaneous (and even if they were, subsequent protocol actions could cross in the network). An endpoint that shuts down its socket before hearing about it from the other end causes a network-protocol message regarding the shutdown to propagate to the other endpoint. The other endpoint gets a error 104 "connection reset by peer" under Linux x86_64 (Fedora 9) on the next or any outstanding asynchronous IO, and an error 32 "broken pipe" on subsequent ones. This causes two problems in the code: (1) the error doesn't get noticed by some layers, and (2) even if it were noticed, the top-level IOGrunt code that tries to drain all the outstanding in-progress IOs sees these as partially-completed and loops forever submitting new attempts to complete the IOs.

    This is observed with the 2008-06-22-rc2 version of Iometer (latest available as of this date), built with Linux Fedora 9 X86_64. I have included a patch that fixes this, although I have not had the opportunity to test it on other OS versions. I would be grateful if this or an equivalent fix can be included in the next release:

    Index: src/IOCompletionQ.cpp

    --- src.orig/IOCompletionQ.cpp
    +++ src/IOCompletionQ.cpp
    @@ -319,6 +319,20 @@ BOOL GetQueuedCompletionStatus(HANDLE cq
    // have to considere changes there as well.
    SetLastError(cqid->element_list[i].error);
    return (FALSE);
    + } else if (cqid->element_list[i].error != 0) {
    + // Sadly, some systems overload the "read"
    + // return with the (positive) error value.
    + // Checking the explicit "error" value is
    + // a more reliable way to distinguish an
    + // actual error. Typical errors are
    + // 104: connection reset by peer
    + // 32: broken pipe.
    + // Note that it is important that ReadFile()
    + // and WriteFile() preset this error value
    + // to 0 when starting an async IO.
    + *bytes_transferred = 0;
    + SetLastError(cqid->element_list[i].error);
    + return (FALSE);
    } else {
    return (TRUE);
    }
    @@ -547,6 +561,7 @@ BOOL ReadFile(HANDLE file_handle, void *

    aiocbp = &this_cq->element_list[free_index].aiocbp;

    + memset(aiocbp, 0, sizeof(*aiocbp));
    aiocbp->aio_buf = buffer;
    aiocbp->aio_fildes = filep->fd;
    aiocbp->aio_nbytes = bytes_to_read;
    @@ -558,6 +573,7 @@ BOOL ReadFile(HANDLE file_handle, void *
    this_cq->element_list[free_index].data = lpOverlapped;
    this_cq->element_list[free_index].bytes_transferred = 0;
    this_cq->element_list[free_index].completion_key = filep->completion_key;
    + this_cq->element_list[free_index].error = 0;

    *bytes_read = 0;

    @@ -654,6 +670,7 @@ BOOL WriteFile(HANDLE file_handle, void

    aiocbp = &this_cq->element_list[free_index].aiocbp;

    + memset(aiocbp, 0, sizeof(*aiocbp));
    aiocbp->aio_buf = buffer;
    aiocbp->aio_fildes = filep->fd;
    aiocbp->aio_nbytes = bytes_to_write;
    @@ -665,6 +682,7 @@ BOOL WriteFile(HANDLE file_handle, void
    this_cq->element_list[free_index].data = lpOverlapped;
    this_cq->element_list[free_index].bytes_transferred = 0;
    this_cq->element_list[free_index].completion_key = filep->completion_key;
    + this_cq->element_list[free_index].error = 0;

    *bytes_written = 0;

    Index: src/IOGrunt.cpp

    --- src.orig/IOGrunt.cpp
    +++ src/IOGrunt.cpp
    @@ -1098,6 +1098,7 @@ void Grunt::Do_IOs()
    Target *target;
    Raw_Result *target_results; // Pointer to results for selected target.
    Raw_Result *prev_target_results;
    + draining_ios = FALSE;

    while (grunt_state != TestIdle) {
    #if defined(IOMTR_OSFAMILY_NETWARE)
    @@ -1336,6 +1337,7 @@ void Grunt::Do_IOs()
    } // while grunt_state is not TestIdle

    // Drain any outstanding I/Os from the completion queue
    + draining_ios = TRUE;
    while (outstanding_ios > 0) {
    #if defined(IOMTR_OSFAMILY_NETWARE)
    pthread_yield(); // NetWare is non-preemptive
    @@ -1366,8 +1368,15 @@ ReturnVal Grunt::Complete_IO(int timeout
    switch (io_cq->GetStatus(&bytes, &trans_id, timeout)) {
    case ReturnSuccess:
    // I/O completed. Make sure we received everything we requested.
    - if (bytes < (int)trans_slots[trans_id].size)
    - Do_Partial_IO(&trans_slots[trans_id], bytes);
    + if (bytes < (int)trans_slots[trans_id].size) {
    + if (! draining_ios) {
    + Do_Partial_IO(&trans_slots[trans_id], bytes);
    + } else {
    + // We're draining outstanding IOs, so
    + // don't initiate a new one.
    + Record_IO(&trans_slots[trans_id], 0);
    + }
    + }
    else
    Record_IO(&trans_slots[trans_id], timer_value());
    return ReturnSuccess;
    Index: src/IOGrunt.h
    ===================================================================
    --- src.orig/IOGrunt.h
    +++ src/IOGrunt.h
    @@ -196,6 +196,7 @@ class Grunt {
    int available_head;
    int available_tail;
    int outstanding_ios;
    + BOOL draining_ios;
    //
    // Operations on related I/O transaction arrays.
    void Initialize_Transaction_Arrays();

    I also note that "stropts" package seems to be obsolete or deprecated or some such thing. In Linux Fedora 9, it goes away completely, including its header file. It's not actually used for Linux x86_64 builds, so the following patch suffices to let Iometer build properly in Fedora 9 and later:

    Index: src/IOPerformance.h

    --- src.orig/IOPerformance.h
    +++ src/IOPerformance.h
    @@ -97,7 +97,7 @@
    #include <net/if.h>
    #endif

    -#if defined(IOMTR_OS_LINUX) || defined(IOMTR_OSFAMILY_NETWARE) || defined(IOMTR_OS_SOLARIS)
    +#if defined(IOMTR_OSFAMILY_NETWARE) || defined(IOMTR_OS_SOLARIS)
    #include <stropts.h>
    #endif