Re: [Py4j-users] Using Py4J with parallel programming
Status: Beta
Brought to you by:
barthe
From: Barthelemy D. <ba...@cs...> - 2012-03-09 11:23:52
|
Hi Stephen, Py4J cannot handle the starting of a JVM, but there are many workarounds. If the JVM is on the same machine, you could use Python to start the JVM (with subprocess or popen2 module… or better yet, the envoy library). Otherwise, if your JVM lives on another machine, you could just create a small java program accepting py4j requests that starts other JVMs. This is a feature requested by many users so I may try to implement something that ease this task in the future, but the way to start a JVM is so different in each case that I'm not sure how to make it generic enough… Barthélémy On 2012-03-07, at 2:26 PM, Stephen Molloy wrote: > Hi Bartélémy, > Your pointers were just what I needed to read, thanks. > > The problem, as you suggested, was that I was trying to use the same objects in each process, and so the multiprocessing module was trying to pickle everything to properly share it. I re-wrote the code so that each process communicates with its own dedicated JVM server so as to maximise the isolation between them, and everything is working perfectly now. > > On a related note, is it possible to start the JVM server running from within Python, and close it again when I am finished with it? If so, this would simplify the process of having my colleagues use the code, since I wouldn't have to provide them instructions with how to start and stop the necessary servers. > > Many thanks for your help. > > Steve > > On 7 Mar 2012, at 14:04, Barthelemy Dagenais wrote: > >> Hi Stephen, >> >> Here are a few pointers: >> >> 1. You should be able to use multiprocessing with py4j (multi-threading certainly works). Just make sure that you create a new JavaGateway instance in each process (don't reuse an instance that was created outside the process) and that you do not share JavaObject instances (objects returned by the Java side) across processes. I think you may be doing this in your code. Essentially, when you are sharing an instance across processes, you end up sharing a socket, which is tricky and probably won't work with the way sockets are initialized by py4j. >> >> 2. __getnewargs__ is usually used by python to pickle an object. If I remember correctly, pickle is used by multiprocessing to transfer/share objects across processes. >> >> 3. If you really want to "share" an object, you should probably save it somewhere on the Java side and request the object in each Python process. >> >> Hope this helps, >> Barthélémy >> >> On 2012-03-06, at 5:37 PM, Stephen Molloy wrote: >> >>> Hi all, >>> I have a working implementation of some Java code I inherited, wrapped up in Python using Py4J. Due to the long time for the calculations to complete, and the embarrassingly parallel nature of my problem, I am trying to use the Python multiprocessing module to speed things up. >>> >>> In particular, I create a pool of processes, and then submit jobs to them. Each job is almost identical, and Py4J to create a new Java object in the JVM server, and performs the necessary calculations. The problem is that it fails very quickly inside a Py4J call. >>> >>> There is a lot of code involved in this, but the following is the function that fails. The class "AJDiskAnalysis" contains the call to the JVM server that you can see in the klysOutputPower() function. >>> def klysOutputPower(Pin, AJDiskAnalObj): >>> """Returns the output power of the klystron for a given input power.""" >>> AJDiskAnalObj.klysanal.setInputPower(float(Pin)) >>> AJDiskAnalObj.simulate(disks=20) >>> return AJDiskAnalObj.klysanal.getSimulatedOutputPower() >>> >>> def powerScan(Pin, filename): >>> Pout, result, AJobj = [], [], [] >>> pool = mp.Pool(processes=12) >>> for power in Pin: >>> AJobj.append(AJDiskAnalysis(DSKParsObj=DSKParser(fname=filename))) >>> result.append(pool.apply_async(klysOutputPower, (power, AJobj[-1]))) >>> for i in result: >>> Pout.append(i.get()) >>> return Pout >>> >>> if __name__=="__main__": >>> Pin = arange(0, 150, 10) >>> Pout = powerScan(Pin, "ess-new-704.dsk") >>> >>> This results in the following error: >>> >>> Exception in thread Thread-1: >>> Traceback (most recent call last): >>> File "/usr/lib64/python2.6/threading.py", line 532, in __bootstrap_inner >>> self.run() >>> File "/usr/lib64/python2.6/threading.py", line 484, in run >>> self.__target(*self.__args, **self.__kwargs) >>> File "/usr/lib64/python2.6/multiprocessing/pool.py", line 225, in _handle_tasks >>> put(task) >>> File "/usr/lib/python2.6/site-packages/py4j-0.7-py2.6.egg/py4j/java_gateway.py", line 432, in __call__ >>> self.target_id, self.name) >>> File "/usr/lib/python2.6/site-packages/py4j-0.7-py2.6.egg/py4j/protocol.py", line 271, in get_return_value >>> raise Py4JError('An error occurred while calling %s%s%s. Trace:\n%s\n' % (target_id, '.', name, value)) >>> Py4JError: An error occurred while calling o19.__getnewargs__. Trace: >>> py4j.Py4JException: Method __getnewargs__([]) does not exist >>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:346) >>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:355) >>> at py4j.Gateway.invoke(Gateway.java:247) >>> at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:124) >>> at py4j.commands.CallCommand.execute(CallCommand.java:81) >>> at py4j.GatewayConnection.run(GatewayConnection.java:175) >>> at java.lang.Thread.run(Thread.java:662) >>> >>> Is there some problem that prevents Py4J being used alongside the multiprocessing module, or is there something I am doing wrong? >>> I would be very happy for any hints or tips, and will happily provide any more information if it is needed. >>> >>> Steve >>> >>> ------------------------------------------------------------------------------ >>> Keep Your Developer Skills Current with LearnDevNow! >>> The most comprehensive online learning library for Microsoft developers >>> is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3, >>> Metro Style Apps, more. Free future releases when you subscribe now! >>> http://p.sf.net/sfu/learndevnow-d2d_______________________________________________ >>> Py4j-users mailing list >>> Py4...@li... >>> https://lists.sourceforge.net/lists/listinfo/py4j-users >> > > > ------------------------------------------------------------------------------ > Virtualization & Cloud Management Using Capacity Planning > Cloud computing makes use of virtualization - but cloud computing > also focuses on allowing computing to be delivered as a service. > http://www.accelacomm.com/jaw/sfnl/114/51521223/ > _______________________________________________ > Py4j-users mailing list > Py4...@li... > https://lists.sourceforge.net/lists/listinfo/py4j-users |