I was hoping to bring good news but, erm, you know, its a bug squasher life...
I dont think the node are detecting correctly if they are still connected to the driver, or if they are, they hang somewhere. My progs block the nodes and I cant find out why... The node logs are empty.
On the driver side I get 0 nodes (of 40) after I restart the driver. And the nodes never connect back again The log is along this lines:
2006-04-28 18:45:45,224 [ERROR][org.jppf.server.node.JPPFNodeServer.exec(307)]: Connection reset by peer
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcher.read0(Native Method)
2006-04-28 18:45:45,264 [ERROR][org.jppf.server.JPPFNIOServer.go(137)]: java.nio.DirectByteBuffer
I can zip the whole log an send it if you want.
Well, thats all my feedback for now...
could you open a bug and attach the whole log in a zip file?
Got a new one...
Jppf 0.16.0 seems to be much more stable than the previous ones, supporting much better unstable nodes.
The matrix example runs fine, but im getting strange behaviour on my classes. All nodes, clients and driver are 0.16.
1) all the tasks seem to be uploaded to the driver.
2) all the tasks seem to be downloaded by the nodes.
3) the tasks never return. (they worked fine with previous jppf 0.13, with all the other bugs. Has anything changed as far as the client code concerns?)
4) the nodes apparently receive (or not, but the driver tags the bundles as 'downloaded') the tasks but immediatly go offline and come online again, with a EOF exception (the task doesnt read any file or input, except for that in the dataprovider)
5) the driver keeps throwing this exception :
Any toughts ?
Strange, I thought we'd got rid of that problem with veersion 0.16.0. My issue is it's very difficult to reproduce.
Could you send over some sample task code that reproduces the problem? Nothing confidential of course.
The way I see it, the nodes go offline because they probably detect their code is outdated, so they update it through the network classloader. The update includes the code that handles the socket connection with the server, which probably explains why they have to go "offline" then online again.
I know this isn't very efficient, I'm working on a way to do the update before a task bundle is first sent to the nodes. They'd still have to disconnect and reconnect, but at least the tasks wouldn't have to be resubmitted on the queue.
Sign up for the SourceForge newsletter:
You seem to have CSS turned off.
Please don't fill out this field.