From: David C. <dc...@us...> - 2012-12-10 17:08:34
|
Try setting X10_NOWRITEBUFFER=1 Josh Milthorpe <jos...@an...> wrote on 12/10/2012 06:11:18 AM: > From: Josh Milthorpe <jos...@an...> > To: X10 core design <x10...@li...>, > Cc: "taw...@an..." <taw...@an...> > Date: 12/10/2012 06:11 AM > Subject: [X10-core] sockets backend serializes places in parallel 'at async' > > Hi, > > we came across an interesting 'gotcha' where the sockets implementation > of X10RT serializes the places in a parallel 'at async' construction. > We run the same (fairly large) closure at all places: > > for(place in Place.places()) at(place) async { > // body > } > > We found that for a certain code, when run with X10_NTHREADS=1, Place 0 > would run 'body' to completion, and then the other places would run. > > It seems that if 'body' has a large enough environment, the sockets > backend can't send it all to another place in a single write. It > therefore saves the remaining data to be sent later (x10rt_sockets.cc > nonBlockingWrite 265--306) and continues on with other useful work, > which in this case is to complete 'body'. Once 'body' is completed at > Place 0 leaving the worker thread idle, it sends the pending data to the > other places, which then start their portions of the work. > > The attached example code (which implements a rather silly parallel sum > over an array of 1M elements) demonstrates the problem: > > val largeArray = new Array[Double](n, (a:Int)=> a as Double); > val sum = finish(new Reducible.SumReducer[Double]()) { > for(place in Place.places()) at(place) async { > Console.OUT.println("starting at " + here); > Console.OUT.flush(); > for (var i:Int=here.id; i<n; i+=Place.MAX_PLACES) { > offer largeArray(i); > } > Console.OUT.println("done at " + here); > Console.OUT.flush(); > } > }; > > > When run over sockets with a small closure environment (e.g. array size > 50000), all places run in parallel as expected, and we observe a > parallel speedup. When run with a larger environment (default array > size of 1M) we observe parallel slowdown for 2 places, and it is > apparent that place 0 runs to completion before place 1: > > starting at Place(0) > done at Place(0) > starting at Place(1) > done at Place(1) > > When compiled with -x10rt mpi, we observe parallel speedup for 2 places > even for large array sizes. > > This is a kind of priority inversion, where the high priority task > (sharing the work among places) has to wait for the completion of a > lower priority task (completing the portion of the work assigned to this > place). Is there anything that can be done in this case to allow the > nonBlockingWrite to continue in parallel with other work? Or should this > just be documented as a 'gotcha' for the sockets version of X10RT? > > Many thanks, > > Josh > > > [attachment "TestLargeAt.x10" deleted by David Cunningham/Watson/ > IBM] > ------------------------------------------------------------------------------ > LogMeIn Rescue: Anywhere, Anytime Remote support for IT. Free Trial > Remotely access PCs and mobile devices and provide instant support > Improve your efficiency, and focus on delivering more value-add services > Discover what IT Professionals Know. Rescue delivers > http://p.sf.net/sfu/logmein_12329d2d > _______________________________________________ > X10-core mailing list > X10...@li... > https://lists.sourceforge.net/lists/listinfo/x10-core |