From: Kevin P. <kt...@ie...> - 2003-05-15 03:25:01
|
Eric Barton wrote: >Guys, > >I've got a few observations I'd like to share and get feedback on. > >1/ I think MD handles (and if them, why not all) need to be "single shot" > in implementations that purport to be thread-safe. > > This is needed to ensure that PtlMDUnlink() can only unlink the intended > MD when it is racing with an incoming message that could unlink the same > MD. If PtlMDUnlink() loses the race, the caller should get PTL_INV_MD. > However if the handle happens to get re-used (say by a concurrent thread > doing PtlMDAttach()), completely the wrong thing will happen! > > If you concur, it might be a good idea to point this out in the Portals > spec so implementers get the picture. > I agree. This is something we didn't think of when trying to make Portals thread safe. Or at least I didn't think of it. By "single shot" you mean adding a random number or generation count to each handle, right? > >2/ I _do_ believe there is a need to allow PtlMDUnlink() to apply to the > MDs passed to PltMDGet() and PtlMDPut(PTL_ACK_REQ). We can't bound the > time that memory is "exposed" to the network otherwise. This has been > implemented in sandiaportals, but not documented in the spec. > I'm not sure about this one. Unless the target is down, a PtlPut() or PtlGet() should complete pretty quickly. If the target is down, the END event (in Portals 3.2/3.3) will eventually happen and indicate an error. When the END event happens is implementation specific but will probably be after a longish timeout when the message's retry limit is reached. > >3/ Why bother with specifying error return codes that _always_ mean the > programmer screwed up, rather than there was some resource shortage or > a lost race? > > For example it's highly unlikely there are any real programs that test > for PTL_NOINIT, and I bet most Portals implementations break if this has > to work "under fire" (i.e. comms racing with interfaces begin brought up > and down and the RC is being used to determine what state it's in). > > Why not core dump instead so (a) crap programmers get to know that they > _have_ screwed up and where, and (b) decent programmers don't have to > clutter their programs with unnecessary conditionals. > > Maybe this is a bit tongue in cheek, but some of the advertised error > codes can't be tested for efficiently, and if we did this, we could get > the number of return codes down to 3 or 4! > > This sounds logical to me. It's my understanding that Portals provides low-level building blocks that library writers can use to make higher-level, and friendlier, communication libraries (MPICH, Lustre, etc.). core dumping gets the message across and is easier to implement than things like returning PTL_NOINIT and handling PtlFini() and PtlNIFini() correctly. The Portals 3.3 draft that Ron is preparing contains the following section: 3.3 Return Codes The API specifies return codes that indicate success or failure of a function call. In the case where the failure is due to invalid arguments being passed into the function, the exact behavior of an implementation is undefined. The API suggests error codes that provide more detail about specific invalid parameters, but an implementation is not required to return these specific error codes. For example, an implementation is free to allow the caller to fault when given an invalid address, rather than return PTL_SEGV. In addition, an implementation is free to map these return codes to standard return codes where appropriate. For example, a Linux kernel-space implementation may want to map Portals return codes to POSIX-compliant return codes. |