[X10-core] sockets backend serializes places in parallel 'at async'

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 454-5900

Hi,

we came across an interesting 'gotcha' where the sockets implementation 
of X10RT serializes the places in a parallel 'at async' construction.  
We run the same (fairly large) closure at all places:

for(place in Place.places()) at(place) async {
     // body
}

We found that for a certain code, when run with X10_NTHREADS=1, Place 0 
would run 'body' to completion, and then the other places would run.

It seems that if 'body' has a large enough environment, the sockets 
backend can't send it all to another place in a single write.  It 
therefore saves the remaining data to be sent later (x10rt_sockets.cc 
nonBlockingWrite 265--306) and continues on with other useful work, 
which in this case is to complete 'body'.  Once 'body' is completed at 
Place 0 leaving the worker thread idle, it sends the pending data to the 
other places, which then start their portions of the work.

The attached example code (which implements a rather silly parallel sum 
over an array of 1M elements) demonstrates the problem:

         val largeArray = new Array[Double](n, (a:Int)=> a as Double);
         val sum = finish(new Reducible.SumReducer[Double]()) {
             for(place in Place.places()) at(place) async {
                 Console.OUT.println("starting at " + here); 
Console.OUT.flush();
                 for (var i:Int=here.id; i<n; i+=Place.MAX_PLACES) {
                     offer largeArray(i);
                 }
                 Console.OUT.println("done at " + here); 
Console.OUT.flush();
             }
         };

When run over sockets with a small closure environment (e.g. array size 
50000), all places run in parallel as expected, and we observe a 
parallel speedup.  When run with a larger environment (default array 
size of 1M) we observe parallel slowdown for 2 places, and it is 
apparent that place 0 runs to completion before place 1:

starting at Place(0)
done at Place(0)
starting at Place(1)
done at Place(1)

When compiled with -x10rt mpi, we observe parallel speedup for 2 places 
even for large array sizes.

This is a kind of priority inversion, where the high priority task 
(sharing the work among places) has to wait for the completion of a 
lower priority task (completing the portion of the work assigned to this 
place).  Is there anything that can be done in this case to allow the 
nonBlockingWrite to continue in parallel with other work? Or should this 
just be documented as a 'gotcha' for the sockets version of X10RT?

Many thanks,

Josh

[X10-core] sockets backend serializes places in parallel 'at async'

Performance and Productivity at Scale

[X10-core] sockets backend serializes places in parallel 'at async'