I had to hack __init__.py and change MsgTimeout = 30 as it was timing out when loading a large file from disk in the setup routine. Given that setup() is supposed to be for long-ish procedures perhaps more control over the timeout is a good idea? I thought pulse_interval would have done something here but no cigar.
Last edit: Alexander Whillas 2015-07-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This patch should be applied to /usr/local/lib/python2.7/dist-packages/asyncoro/__init__.py under Linux for Python 2.7, for example.
Could you try that instead? Basically, send_msg / recv_msg have timeout that is meant for timeout for the whole message. If the message is too big (and/or network is slow), it can timeout, even though message is still being sent. I debated what the right fix is, but was not sure how to proceed. If the timeout for message is it is now, it is simple and in line with what one would expect timeout for the message is. If the timeout is as per the patch I mentioned above, then it is for the transfer of data (i.e., if no data is transferred at all within given timeout, then the function fails).
pulse_interval is simply to check for faults, and it wouldn't help in this case.
What is the size of the file being loaded? There is no need to change __init__; dispynode has an option to set this (command-line option --msg_timeout and at the client, you can set this in the client program; e.g., with
importdispydispy.MsgTimeout=30
Last edit: Giridhar Pemmasani 2015-07-11
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
This is very helpful. I am using the setup method to transfer multiple 30MB files to the node, and a longer dispy.MsgTimeout is required. I then have several thousand computations use the files copied over ONCE.
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
at line number 414 in function 'xfer_file' ininit.py file on the client (no need to copy this file to nodes, to ease testing). Below is edited patch so it shows correctly:
diff --git a/py2/dispy/__init__.py b/py2/dispy/__init__.pyindex 0f16dcd..4be603b 100755--- a/py2/dispy/__init__.py+++ b/py2/dispy/__init__.py@@ -411,6 +411,7 @@ class _Node(object):
data = fd.read(1024000)
if not data:
break
+ sock.settimeout(MsgTimeout)
yield sock.sendall(data)
fd.close()
resp = yield sock.recv_msg()
As mentioned above, timeout currently is implied for sending all data in file. Above fix resets it everytime 1MB data is sent. I am surprised that it takes more than 10 seconds (default MsgTimeout) to transfer 30MB. Is the network really slow?
Last edit: Giridhar Pemmasani 2016-01-20
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Thanks, I have added the patch to my init.py file. Our network is fast. I just have 30MB X 15 files to copy over, so only two files would copy before the timeout was reached.
For a little background on my use case:
The client and nodes all have access to a shared filesystem; however, the program that runs on the nodes performs much better if the input data is copied locally. I pass a list of absolute file paths to the setup function and use the shutil.copy2() function to perform the file copying.
Do you suggest I use a different method to copy the files to the nodes?
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
Attached files for Python 2.7+ should fix timeout problems when transferring files. Let me know how it goes.
If you have files on shared file system (say, NFS), likely the network is being saturated with files being read over network and sent to multiple nodes simultaneously. The timeout (default 5 seconds) is for one single file to be sent to a node and the node acknowleding (actually, after the last fragment of data sent and ack to be received). In your case it takes more than 5 seconds for a file (not all files combined) apparently.
If the files are on shared file system that the nodes also have access to, you can avoid sending them if that is okay. You can either have computations access files directly (you may have to change computations to do so), or you can set 'dest_path' to where files are on shared file system (however, dispynode will save job files etc. there, so they need write access to that file system). I think the latter approach may be too painful. First one is easy if you can update computations.
I am using py2.7 and upgraded to 4.6.5. To test the file copy, I started the node and client on the same machine, used JobCluster and passed a file path to depends. I am not sure if there is a timing issue or a file size issue, but if the text file is larger than 10Mb the compute method fails. If I decrease the file size to 9Mb, it is transferred properly. Any ideas?
2016-01-22 20:18:25,671 - asyncoro - poller: IOCP
2016-01-22 20:18:25,687 - dispy - dispy client at :51347
2016-01-22 20:18:25,717 - dispy - Storing fault recovery information in "_dispy_20160122201825"
2016-01-22 20:18:25,717 - dispy - Pending jobs: 0
2016-01-22 20:18:25,721 - dispy - Pending jobs: 1
2016-01-22 20:18:25,723 - dispy - Discovered 192.168.56.1:51348 (JTERHUNE) with 4 cpus
2016-01-22 20:18:25,732 - dispy - Transfer of computation "compute" to 192.168.56.1 failed: -1
2016-01-22 20:18:25,733 - dispy - Failed to setup 192.168.56.1 for compute "compute": -1
2016-01-22 20:18:25,734 - dispy - Closing node 192.168.56.1 for compute / 1453515505717
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
With 4.6.5 the timeout (default 5 seconds) is for transferring up to 1MB of data, so file size of 10MB vs 9MB shouldn't matter. Is it likely there are other issues, such disk space on the node? If you think timeout is the problem you can try increasing it (with dispy.MsgTimeout = 10 before creating cluster). If it works with bigger timeout, then you could test how long it takes to copy the file by other means? You can also look at debug log on the node that may give clues.
Last edit: Giridhar Pemmasani 2016-01-23
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I am not sure that the issue is due to timeout; if it is, you should ses error message ''Could not transfer "file" ...' in the log. It seems problem is in sending computation, not depends, based on log you posted. The message
Transfer of computation "compute" to 192.168.56.1 failed: -1
indicates server node refused to accept computation, likely because file size is more than max_file_size (default of 10MB). Restart node with --max_file_size 0 (or 100MB).
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
I have attached an example script. Hopefully you can reproduce the issue I am having. I am on windows 7 using python 2.7.
I am using the debug mode of pyCharm and it looks like it is failing before the file copying. I think I have tracked the problem down to the creation of the first coroutine on line 2152 of dispy \ init.py
I had to hack
__init__.py
and changeMsgTimeout = 30
as it was timing out when loading a large file from disk in the setup routine. Given thatsetup()
is supposed to be for long-ish procedures perhaps more control over the timeout is a good idea? I thoughtpulse_interval
would have done something here but no cigar.Last edit: Alexander Whillas 2015-07-11
I think the correct fix for this is in asyncoro with the patch https://github.com/pgiri/dispy/issues/19#issuecomment-107884541
This patch should be applied to
/usr/local/lib/python2.7/dist-packages/asyncoro/__init__.py
under Linux for Python 2.7, for example.Could you try that instead? Basically,
send_msg
/recv_msg
have timeout that is meant for timeout for the whole message. If the message is too big (and/or network is slow), it can timeout, even though message is still being sent. I debated what the right fix is, but was not sure how to proceed. If the timeout for message is it is now, it is simple and in line with what one would expect timeout for the message is. If the timeout is as per the patch I mentioned above, then it is for the transfer of data (i.e., if no data is transferred at all within given timeout, then the function fails).pulse_interval
is simply to check for faults, and it wouldn't help in this case.What is the size of the file being loaded? There is no need to change
__init__
; dispynode has an option to set this (command-line option--msg_timeout
and at the client, you can set this in the client program; e.g., withLast edit: Giridhar Pemmasani 2015-07-11
This is very helpful. I am using the setup method to transfer multiple 30MB files to the node, and a longer dispy.MsgTimeout is required. I then have several thousand computations use the files copied over ONCE.
Can you add line
at line number 414 in function 'xfer_file' ininit.py file on the client (no need to copy this file to nodes, to ease testing). Below is edited patch so it shows correctly:
As mentioned above, timeout currently is implied for sending all data in file. Above fix resets it everytime 1MB data is sent. I am surprised that it takes more than 10 seconds (default MsgTimeout) to transfer 30MB. Is the network really slow?
Last edit: Giridhar Pemmasani 2016-01-20
Thanks, I have added the patch to my init.py file. Our network is fast. I just have 30MB X 15 files to copy over, so only two files would copy before the timeout was reached.
For a little background on my use case:
The client and nodes all have access to a shared filesystem; however, the program that runs on the nodes performs much better if the input data is copied locally. I pass a list of absolute file paths to the setup function and use the shutil.copy2() function to perform the file copying.
Do you suggest I use a different method to copy the files to the nodes?
Attached files for Python 2.7+ should fix timeout problems when transferring files. Let me know how it goes.
If you have files on shared file system (say, NFS), likely the network is being saturated with files being read over network and sent to multiple nodes simultaneously. The timeout (default 5 seconds) is for one single file to be sent to a node and the node acknowleding (actually, after the last fragment of data sent and ack to be received). In your case it takes more than 5 seconds for a file (not all files combined) apparently.
If the files are on shared file system that the nodes also have access to, you can avoid sending them if that is okay. You can either have computations access files directly (you may have to change computations to do so), or you can set 'dest_path' to where files are on shared file system (however, dispynode will save job files etc. there, so they need write access to that file system). I think the latter approach may be too painful. First one is easy if you can update computations.
I have committed the fix to github. It may be easier for you to get the updated files from there.
Thanks, I will pull the files from github. Do I need to replace these files on the client and the nodes?
Version 4.6.5 has been released. Let me know if this does / doesn't fix the timeout issue.
I am using py2.7 and upgraded to 4.6.5. To test the file copy, I started the node and client on the same machine, used JobCluster and passed a file path to depends. I am not sure if there is a timing issue or a file size issue, but if the text file is larger than 10Mb the compute method fails. If I decrease the file size to 9Mb, it is transferred properly. Any ideas?
With 4.6.5 the timeout (default 5 seconds) is for transferring up to 1MB of data, so file size of 10MB vs 9MB shouldn't matter. Is it likely there are other issues, such disk space on the node? If you think timeout is the problem you can try increasing it (with
dispy.MsgTimeout = 10
before creating cluster). If it works with bigger timeout, then you could test how long it takes to copy the file by other means? You can also look at debug log on the node that may give clues.Last edit: Giridhar Pemmasani 2016-01-23
I am not sure that the issue is due to timeout; if it is, you should ses error message ''Could not transfer "file" ...' in the log. It seems problem is in sending computation, not depends, based on log you posted. The message
indicates server node refused to accept computation, likely because file size is more than max_file_size (default of 10MB). Restart node with
--max_file_size 0
(or 100MB).I have attached an example script. Hopefully you can reproduce the issue I am having. I am on windows 7 using python 2.7.
I am using the debug mode of pyCharm and it looks like it is failing before the file copying. I think I have tracked the problem down to the creation of the first coroutine on line 2152 of dispy \ init.py
Opps. Ignore my last message. I am able to transfer large files with depends after setting --max_file_size 0 on the node.
Thanks!