From: Josh M. <jos...@an...> - 2012-12-10 11:11:30
|
Hi, we came across an interesting 'gotcha' where the sockets implementation of X10RT serializes the places in a parallel 'at async' construction. We run the same (fairly large) closure at all places: for(place in Place.places()) at(place) async { // body } We found that for a certain code, when run with X10_NTHREADS=1, Place 0 would run 'body' to completion, and then the other places would run. It seems that if 'body' has a large enough environment, the sockets backend can't send it all to another place in a single write. It therefore saves the remaining data to be sent later (x10rt_sockets.cc nonBlockingWrite 265--306) and continues on with other useful work, which in this case is to complete 'body'. Once 'body' is completed at Place 0 leaving the worker thread idle, it sends the pending data to the other places, which then start their portions of the work. The attached example code (which implements a rather silly parallel sum over an array of 1M elements) demonstrates the problem: val largeArray = new Array[Double](n, (a:Int)=> a as Double); val sum = finish(new Reducible.SumReducer[Double]()) { for(place in Place.places()) at(place) async { Console.OUT.println("starting at " + here); Console.OUT.flush(); for (var i:Int=here.id; i<n; i+=Place.MAX_PLACES) { offer largeArray(i); } Console.OUT.println("done at " + here); Console.OUT.flush(); } }; When run over sockets with a small closure environment (e.g. array size 50000), all places run in parallel as expected, and we observe a parallel speedup. When run with a larger environment (default array size of 1M) we observe parallel slowdown for 2 places, and it is apparent that place 0 runs to completion before place 1: starting at Place(0) done at Place(0) starting at Place(1) done at Place(1) When compiled with -x10rt mpi, we observe parallel speedup for 2 places even for large array sizes. This is a kind of priority inversion, where the high priority task (sharing the work among places) has to wait for the completion of a lower priority task (completing the portion of the work assigned to this place). Is there anything that can be done in this case to allow the nonBlockingWrite to continue in parallel with other work? Or should this just be documented as a 'gotcha' for the sockets version of X10RT? Many thanks, Josh |