Donate Share

Iometer

Tracker: Bugs

5 *** Grunt thread exiting with 1 still active - ID: 1244848
Last Update: Comment added ( pcdrews )

After running a network test, dynamo continuosly
outputs to console:

*** Grunt thread exiting with 1 still active
*** Grunt thread exiting with 1 still active
*** Grunt thread exiting with 1 still active

...and locks up the GUI (iometer).

Ctrl-C'ing dynamo is the only way to unlock it.

Built with iometer-2004.07.30-post.DS2, Makefile-
Linux.x86_64.

Linux lablx000 2.4.21-27.EL #1 SMP Wed Dec 1
21:53:39 EST 2004 x86_64 x86_64 x86_64 GNU/Linux

Any fixes for this?

Thanks and please advise,
Marc



Marc Johnson ( marcjohnson ) - 2005-07-25 23:25

5

Open

None

Nobody/Anonymous

None

None

Public


Comments ( 3 )

Date: 2008-12-18 06:28
Sender: pcdrews

I debugged this since it was critical for my work. The problem is that
when completing a network test between two endpoints, Iometer tells the
endpoints to stop. The "stop" commands can never be quite simultaneous
(and even if they were, subsequent protocol actions could cross in the
network). An endpoint that shuts down its socket before hearing about it
from the other end causes a network-protocol message regarding the shutdown
to propagate to the other endpoint. The other endpoint gets a error 104
"connection reset by peer" under Linux x86_64 (Fedora 9) on the next or any
outstanding asynchronous IO, and an error 32 "broken pipe" on subsequent
ones. This causes two problems in the code: (1) the error doesn't get
noticed by some layers, and (2) even if it were noticed, the top-level
IOGrunt code that tries to drain all the outstanding in-progress IOs sees
these as partially-completed and loops forever submitting new attempts to
complete the IOs.

This is observed with the 2008-06-22-rc2 version of Iometer (latest
available as of this date), built with Linux Fedora 9 X86_64. I have
included a patch that fixes this, although I have not had the opportunity
to test it on other OS versions. I would be grateful if this or an
equivalent fix can be included in the next release:

Index: src/IOCompletionQ.cpp
===================================================================
--- src.orig/IOCompletionQ.cpp
+++ src/IOCompletionQ.cpp
@@ -319,6 +319,20 @@ BOOL GetQueuedCompletionStatus(HANDLE cq
// have to considere changes there as well.
SetLastError(cqid->element_list[i].error);
return (FALSE);
+ } else if (cqid->element_list[i].error != 0) {
+ // Sadly, some systems overload the "read"
+ // return with the (positive) error value.
+ // Checking the explicit "error" value is
+ // a more reliable way to distinguish an
+ // actual error. Typical errors are
+ // 104: connection reset by peer
+ // 32: broken pipe.
+ // Note that it is important that ReadFile()
+ // and WriteFile() preset this error value
+ // to 0 when starting an async IO.
+ *bytes_transferred = 0;
+ SetLastError(cqid->element_list[i].error);
+ return (FALSE);
} else {
return (TRUE);
}
@@ -547,6 +561,7 @@ BOOL ReadFile(HANDLE file_handle, void *

aiocbp = &this_cq->element_list[free_index].aiocbp;

+ memset(aiocbp, 0, sizeof(*aiocbp));
aiocbp->aio_buf = buffer;
aiocbp->aio_fildes = filep->fd;
aiocbp->aio_nbytes = bytes_to_read;
@@ -558,6 +573,7 @@ BOOL ReadFile(HANDLE file_handle, void *
this_cq->element_list[free_index].data = lpOverlapped;
this_cq->element_list[free_index].bytes_transferred = 0;
this_cq->element_list[free_index].completion_key =
filep->completion_key;
+ this_cq->element_list[free_index].error = 0;

*bytes_read = 0;

@@ -654,6 +670,7 @@ BOOL WriteFile(HANDLE file_handle, void

aiocbp = &this_cq->element_list[free_index].aiocbp;

+ memset(aiocbp, 0, sizeof(*aiocbp));
aiocbp->aio_buf = buffer;
aiocbp->aio_fildes = filep->fd;
aiocbp->aio_nbytes = bytes_to_write;
@@ -665,6 +682,7 @@ BOOL WriteFile(HANDLE file_handle, void
this_cq->element_list[free_index].data = lpOverlapped;
this_cq->element_list[free_index].bytes_transferred = 0;
this_cq->element_list[free_index].completion_key =
filep->completion_key;
+ this_cq->element_list[free_index].error = 0;

*bytes_written = 0;

Index: src/IOGrunt.cpp
===================================================================
--- src.orig/IOGrunt.cpp
+++ src/IOGrunt.cpp
@@ -1098,6 +1098,7 @@ void Grunt::Do_IOs()
Target *target;
Raw_Result *target_results; // Pointer to results for selected target.
Raw_Result *prev_target_results;
+ draining_ios = FALSE;

while (grunt_state != TestIdle) {
#if defined(IOMTR_OSFAMILY_NETWARE)
@@ -1336,6 +1337,7 @@ void Grunt::Do_IOs()
} // while grunt_state is not TestIdle

// Drain any outstanding I/Os from the completion queue
+ draining_ios = TRUE;
while (outstanding_ios > 0) {
#if defined(IOMTR_OSFAMILY_NETWARE)
pthread_yield(); // NetWare is non-preemptive
@@ -1366,8 +1368,15 @@ ReturnVal Grunt::Complete_IO(int timeout
switch (io_cq->GetStatus(&bytes, &trans_id, timeout)) {
case ReturnSuccess:
// I/O completed. Make sure we received everything we requested.
- if (bytes < (int)trans_slots[trans_id].size)
- Do_Partial_IO(&trans_slots[trans_id], bytes);
+ if (bytes < (int)trans_slots[trans_id].size) {
+ if (! draining_ios) {
+ Do_Partial_IO(&trans_slots[trans_id], bytes);
+ } else {
+ // We're draining outstanding IOs, so
+ // don't initiate a new one.
+ Record_IO(&trans_slots[trans_id], 0);
+ }
+ }
else
Record_IO(&trans_slots[trans_id], timer_value());
return ReturnSuccess;
Index: src/IOGrunt.h
===================================================================
--- src.orig/IOGrunt.h
+++ src/IOGrunt.h
@@ -196,6 +196,7 @@ class Grunt {
int available_head;
int available_tail;
int outstanding_ios;
+ BOOL draining_ios;
//
// Operations on related I/O transaction arrays.
void Initialize_Transaction_Arrays();



I also note that "stropts" package seems to be obsolete or deprecated or
some such thing. In Linux Fedora 9, it goes away completely, including its
header file. It's not actually used for Linux x86_64 builds, so the
following patch suffices to let Iometer build properly in Fedora 9 and
later:

Index: src/IOPerformance.h
===================================================================
--- src.orig/IOPerformance.h
+++ src/IOPerformance.h
@@ -97,7 +97,7 @@
#include <net/if.h>
#endif

-#if defined(IOMTR_OS_LINUX) || defined(IOMTR_OSFAMILY_NETWARE) ||
defined(IOMTR_OS_SOLARIS)
+#if defined(IOMTR_OSFAMILY_NETWARE) || defined(IOMTR_OS_SOLARIS)
#include <stropts.h>
#endif




Date: 2005-08-16 02:19
Sender: cheungmingProject AdminAccepting Donations

Logged In: YES
user_id=896657

this is caused by the bug fix on #1088486. the reason is
that a network io is differ from disk io. for diskio, the
worker stops and then wait current io ends and quits, which
is pretty reasonable. but for network worker, it is possible
that there is always 1 io unfinished for certain unknown
reason, then it dead loop there.




Date: 2005-07-27 13:36
Sender: cheungmingProject AdminAccepting Donations

Logged In: YES
user_id=896657

Thanks for reporting. I will check this once get time.


Attached File

No Files Currently Attached

Change

No changes have been made to this artifact.