Re: [Chromium-dev] Problems with shutdown of crserver for non-tcpip connections

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

Brian Paul wrote:

> Joel Welling wrote:
>
>> Hi folks;
>>   I'm encountering a problem managing the crserver processes for 
>> configs in which I use the teac network mechanism; I think it will 
>> also be seen in things like GM.  When I spawn the crserver I keep 
>> track of its PID, and when I want it to terminate I send a TERM 
>> signal to that PID.  Fair enough; the process that got that signal 
>> terminates, running through the 'teardown' routine in the crserverlib 
>> code.
>>   The problem is, by this time the crserver has spawned 2 child 
>> processes.  I believe they are actually other threads, but I'm not 
>> sure- I know they are not being created with crSpawn().  One of those 
>> threads is very, very busy doing a wait loop, waiting for messages on 
>> my high-bandwidth network.  That process, and the third process which 
>> is its child, do *not* die when their parent (the originally spawned 
>> process) gets the TERM signal.
>>   Can anyone confirm for me that these are threads, and tell me which 
>> bit of code is likely to be spawning them?  Are their PIDs or thread 
>> IDs getting saved anywhere?  I can modify the teardown procedure to 
>> kill them if I have their names, but I can't simply kill the whole 
>> process group- that kills innocent bystanders.
>
>
> I don't know how the crserver would be spawning any threads.  The 
> crserver itself isn't even thread-safe.
>
> Are you sure the GM library isn't creating the threads?
>
> -Brian
>
I also haven't seen the threading behavior, but I have seen the networks 
hang.  The issue has been that faking a connection based protocol on 
connectionless networks (Quadrics/Myrinet/IB) has had an issue with 
shutdowns for as long as I can remember.  Sometimes things work 
correctly if the application exits cleanly so that signals to terminate 
are passed around, but killing one of the applications or servers can 
cause the others to hang.

I'm not sure what the best solution here is.  Maybe we need to start 
trapping all of the signals on the nodes to make sure to shut the 
connections down. Another, potentially better, option is to rely on the 
mothership's connection brokering.  If the mothership loses a connection 
to anyone in the group, send a notification to the others that things 
should terminate.  The layers, other than tcp/sdp/udp, will need to 
timeout on waitrecv and test the mothership connection.  This might also 
fix the deadlock detect and hang problem that happens occasionally.

In general, this points to a  network layer rewrite that we have talked 
about off and on for the past two years.  There is some cruftyness in 
writing highspeed layers.  It seems that instead of choosing to make 
everything look like tcpip, we might want to find a more general 
abstraction and map the layers to that, something like a point2point 
layer (i.e. send this message to this system(s), without enforcing 
connection semantics).  For example, there is an MPI layer branched in 
CVS that is getting painful to get working in general cases, mainly 
because of reliance on special process creation semantics and the 
enforcement of connection based networking semantics.  The reason it's 
hard to get a network rewrite done is the shear time commitment involved 
and not everyone has access to all of the different network systems.  
Although, many interconnects are now supporting socket interfaces and 
*DAPL (*=u/k/s), which might make for more simplified porting efforts.

-Mike