Re: [X10-core] sockets backend serializes places in parallel 'at async'

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Josh,

This is indeed a subtle point.

The X10 runtime is a cooperative one. So today if every thread in a place
is running a long sequential computation, there is no opportunity for the
runtime to process incoming or outgoing network messages.
Moreover long message sends cannot typically be sent in full in one step
(details depend on transport).
So we have two choices here: either block until a long send is complete or
stage a long send in multiple chunks.
I believe we support both options (at least for PAMI and sockets).
For sockets, the behavior is controlled with env var X10_NOWRITEBUFFER.
For PAMI, it is different of course... X10RT_PAMI_BLOCKING_SEND.

The default is not to block (I suspect MPI blocks always but I have not
checked).
The rationale for this default is that it typically gives you more
communication computation overlap "for free."
The computation can resume even if the send is still in progress.
The drawback is what you have observed. The communication is automatically
done in multiple steps but not the computation.
With long computations *and* long sends, we do not get a good interleaving
of chunks of computation and communication for free.
We end up serializing computation at the sender and receiver(s) places.

The alternative avoids this problem by neither splitting the computation
nor the communication so the ordering is closer to the programmer's
intuition.

But there is a better fix (still assuming a purely cooperative runtime).
By periodically calling Runtime.x10rtProbe() in the body of the long
computation, you ensure that transport gets the opportunity to send more
chunks of the long messages.

We have thought about using the compiler to inject these probes
automatically (yield points) but have not decided to go implement this
feature yet.
We could also abandon our cooperative paradigm and permit the runtime to
preempt user threads or have threads of its own to handle such asynchronous
services.
Again, we have not committed to spend the time implementing such features
yet.

Typically one can also restructure the problematic loop into two separate
pieces: the first loop spreading the data, the second computing on it, with
a blocking finish around the first loop,
hence locally disabling the problematic overlap of communication and
computation.

Olivier

David Cunningham/Watson/IBM@IBMUS wrote on 12/10/2012 12:07:49 PM:

> From: David Cunningham/Watson/IBM@IBMUS
> To: X10 core design <x10...@li...>,
> Cc: "taw...@an..."
> <taw...@an...>, X10 core design <x10-
> co...@li...>
> Date: 12/10/2012 12:12 PM
> Subject: Re: [X10-core] sockets backend serializes places in
> parallel 'at async'
>
> Try setting X10_NOWRITEBUFFER=1
>
> Josh Milthorpe <jos...@an...> wrote on 12/10/2012 06:11:18
AM:
>
> > From: Josh Milthorpe <jos...@an...>
> > To: X10 core design <x10...@li...>,
> > Cc: "taw...@an..."
<taw...@an...>
> > Date: 12/10/2012 06:11 AM
> > Subject: [X10-core] sockets backend serializes places in parallel 'at
async'
> >
> > Hi,
> >
> > we came across an interesting 'gotcha' where the sockets implementation

> > of X10RT serializes the places in a parallel 'at async' construction.
> > We run the same (fairly large) closure at all places:
> >
> > for(place in Place.places()) at(place) async {
> >      // body
> > }
> >
> > We found that for a certain code, when run with X10_NTHREADS=1, Place 0

> > would run 'body' to completion, and then the other places would run.
> >
> > It seems that if 'body' has a large enough environment, the sockets
> > backend can't send it all to another place in a single write.  It
> > therefore saves the remaining data to be sent later (x10rt_sockets.cc
> > nonBlockingWrite 265--306) and continues on with other useful work,
> > which in this case is to complete 'body'.  Once 'body' is completed at
> > Place 0 leaving the worker thread idle, it sends the pending data to
the
> > other places, which then start their portions of the work.
> >
> > The attached example code (which implements a rather silly parallel sum

> > over an array of 1M elements) demonstrates the problem:
> >
> >          val largeArray = new Array[Double](n, (a:Int)=> a as Double);
> >          val sum = finish(new Reducible.SumReducer[Double]()) {
> >              for(place in Place.places()) at(place) async {
> >                  Console.OUT.println("starting at " + here);
> > Console.OUT.flush();
> >                  for (var i:Int=here.id; i<n; i+=Place.MAX_PLACES) {
> >                      offer largeArray(i);
> >                  }
> >                  Console.OUT.println("done at " + here);
> > Console.OUT.flush();
> >              }
> >          };
> >
> >
> > When run over sockets with a small closure environment (e.g. array size

> > 50000), all places run in parallel as expected, and we observe a
> > parallel speedup.  When run with a larger environment (default array
> > size of 1M) we observe parallel slowdown for 2 places, and it is
> > apparent that place 0 runs to completion before place 1:
> >
> > starting at Place(0)
> > done at Place(0)
> > starting at Place(1)
> > done at Place(1)
> >
> > When compiled with -x10rt mpi, we observe parallel speedup for 2 places

> > even for large array sizes.
> >
> > This is a kind of priority inversion, where the high priority task
> > (sharing the work among places) has to wait for the completion of a
> > lower priority task (completing the portion of the work assigned to
this
> > place).  Is there anything that can be done in this case to allow the
> > nonBlockingWrite to continue in parallel with other work? Or should
this
> > just be documented as a 'gotcha' for the sockets version of X10RT?
> >
> > Many thanks,
> >
> > Josh
> >
> >
> > [attachment "TestLargeAt.x10" deleted by David Cunningham/Watson/
> > IBM]
> >
>
------------------------------------------------------------------------------

> > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> > Remotely access PCs and mobile devices and provide instant support
> > Improve your efficiency, and focus on delivering more value-add
services
> > Discover what IT Professionals Know. Rescue delivers
> > http://p.sf.net/sfu/logmein_12329d2d
> > _______________________________________________
> > X10-core mailing list
> > X10...@li...
> > https://lists.sourceforge.net/lists/listinfo/x10-core
>
------------------------------------------------------------------------------

> LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial
> Remotely access PCs and mobile devices and provide instant support
> Improve your efficiency, and focus on delivering more value-add services
> Discover what IT Professionals Know. Rescue delivers
> http://p.sf.net/sfu/
> logmein_12329d2d_______________________________________________
> X10-core mailing list
> X10...@li...
> https://lists.sourceforge.net/lists/listinfo/x10-core

Re: [X10-core] sockets backend serializes places in parallel 'at async'

Performance and Productivity at Scale

Re: [X10-core] sockets backend serializes places in parallel 'at async'