Menu

Setup timeout problem

2015-07-11
2016-01-23
  • Alexander Whillas

    I had to hack __init__.py and change MsgTimeout = 30 as it was timing out when loading a large file from disk in the setup routine. Given that setup() is supposed to be for long-ish procedures perhaps more control over the timeout is a good idea? I thought pulse_interval would have done something here but no cigar.

     

    Last edit: Alexander Whillas 2015-07-11
  • Giridhar Pemmasani

    I think the correct fix for this is in asyncoro with the patch https://github.com/pgiri/dispy/issues/19#issuecomment-107884541

    This patch should be applied to /usr/local/lib/python2.7/dist-packages/asyncoro/__init__.py under Linux for Python 2.7, for example.

    Could you try that instead? Basically, send_msg / recv_msg have timeout that is meant for timeout for the whole message. If the message is too big (and/or network is slow), it can timeout, even though message is still being sent. I debated what the right fix is, but was not sure how to proceed. If the timeout for message is it is now, it is simple and in line with what one would expect timeout for the message is. If the timeout is as per the patch I mentioned above, then it is for the transfer of data (i.e., if no data is transferred at all within given timeout, then the function fails).

    pulse_interval is simply to check for faults, and it wouldn't help in this case.

    What is the size of the file being loaded? There is no need to change __init__; dispynode has an option to set this (command-line option --msg_timeout and at the client, you can set this in the client program; e.g., with

    import dispy
    dispy.MsgTimeout = 30
    
     

    Last edit: Giridhar Pemmasani 2015-07-11
  • Jason Terhune

    Jason Terhune - 2016-01-19

    This is very helpful. I am using the setup method to transfer multiple 30MB files to the node, and a longer dispy.MsgTimeout is required. I then have several thousand computations use the files copied over ONCE.

     
  • Giridhar Pemmasani

    Can you add line

                    sock.settimeout(MsgTimeout)
    

    at line number 414 in function 'xfer_file' ininit.py file on the client (no need to copy this file to nodes, to ease testing). Below is edited patch so it shows correctly:

    diff --git a/py2/dispy/__init__.py b/py2/dispy/__init__.py
    index 0f16dcd..4be603b 100755
    --- a/py2/dispy/__init__.py
    +++ b/py2/dispy/__init__.py
    @@ -411,6 +411,7 @@ class _Node(object):
                         data = fd.read(1024000)
                         if not data:
                             break
    +      sock.settimeout(MsgTimeout)
                         yield sock.sendall(data)
                     fd.close()
                    resp = yield sock.recv_msg()
    

    As mentioned above, timeout currently is implied for sending all data in file. Above fix resets it everytime 1MB data is sent. I am surprised that it takes more than 10 seconds (default MsgTimeout) to transfer 30MB. Is the network really slow?

     

    Last edit: Giridhar Pemmasani 2016-01-20
  • Jason Terhune

    Jason Terhune - 2016-01-20

    Thanks, I have added the patch to my init.py file. Our network is fast. I just have 30MB X 15 files to copy over, so only two files would copy before the timeout was reached.

    For a little background on my use case:
    The client and nodes all have access to a shared filesystem; however, the program that runs on the nodes performs much better if the input data is copied locally. I pass a list of absolute file paths to the setup function and use the shutil.copy2() function to perform the file copying.

    Do you suggest I use a different method to copy the files to the nodes?

     
  • Giridhar Pemmasani

    Attached files for Python 2.7+ should fix timeout problems when transferring files. Let me know how it goes.

    If you have files on shared file system (say, NFS), likely the network is being saturated with files being read over network and sent to multiple nodes simultaneously. The timeout (default 5 seconds) is for one single file to be sent to a node and the node acknowleding (actually, after the last fragment of data sent and ack to be received). In your case it takes more than 5 seconds for a file (not all files combined) apparently.

    If the files are on shared file system that the nodes also have access to, you can avoid sending them if that is okay. You can either have computations access files directly (you may have to change computations to do so), or you can set 'dest_path' to where files are on shared file system (however, dispynode will save job files etc. there, so they need write access to that file system). I think the latter approach may be too painful. First one is easy if you can update computations.

     
  • Giridhar Pemmasani

    I have committed the fix to github. It may be easier for you to get the updated files from there.

     
  • Jason Terhune

    Jason Terhune - 2016-01-22

    Thanks, I will pull the files from github. Do I need to replace these files on the client and the nodes?

     
  • Giridhar Pemmasani

    Version 4.6.5 has been released. Let me know if this does / doesn't fix the timeout issue.

     
  • Jason Terhune

    Jason Terhune - 2016-01-23

    I am using py2.7 and upgraded to 4.6.5. To test the file copy, I started the node and client on the same machine, used JobCluster and passed a file path to depends. I am not sure if there is a timing issue or a file size issue, but if the text file is larger than 10Mb the compute method fails. If I decrease the file size to 9Mb, it is transferred properly. Any ideas?

    2016-01-22 20:18:25,671 - asyncoro - poller: IOCP
    2016-01-22 20:18:25,687 - dispy - dispy client at :51347
    2016-01-22 20:18:25,717 - dispy - Storing fault recovery information in "_dispy_20160122201825"
    2016-01-22 20:18:25,717 - dispy - Pending jobs: 0
    2016-01-22 20:18:25,721 - dispy - Pending jobs: 1
    2016-01-22 20:18:25,723 - dispy - Discovered 192.168.56.1:51348 (JTERHUNE) with 4 cpus
    2016-01-22 20:18:25,732 - dispy - Transfer of computation "compute" to 192.168.56.1 failed: -1
    2016-01-22 20:18:25,733 - dispy - Failed to setup 192.168.56.1 for compute "compute": -1
    2016-01-22 20:18:25,734 - dispy - Closing node 192.168.56.1 for compute / 1453515505717
    
     
  • Giridhar Pemmasani

    With 4.6.5 the timeout (default 5 seconds) is for transferring up to 1MB of data, so file size of 10MB vs 9MB shouldn't matter. Is it likely there are other issues, such disk space on the node? If you think timeout is the problem you can try increasing it (with dispy.MsgTimeout = 10 before creating cluster). If it works with bigger timeout, then you could test how long it takes to copy the file by other means? You can also look at debug log on the node that may give clues.

     

    Last edit: Giridhar Pemmasani 2016-01-23
  • Giridhar Pemmasani

    I am not sure that the issue is due to timeout; if it is, you should ses error message ''Could not transfer "file" ...' in the log. It seems problem is in sending computation, not depends, based on log you posted. The message

    Transfer of computation "compute" to 192.168.56.1 failed: -1
    

    indicates server node refused to accept computation, likely because file size is more than max_file_size (default of 10MB). Restart node with --max_file_size 0 (or 100MB).

     
  • Jason Terhune

    Jason Terhune - 2016-01-23

    I have attached an example script. Hopefully you can reproduce the issue I am having. I am on windows 7 using python 2.7.

    I am using the debug mode of pyCharm and it looks like it is failing before the file copying. I think I have tracked the problem down to the creation of the first coroutine on line 2152 of dispy \ init.py

     
  • Jason Terhune

    Jason Terhune - 2016-01-23

    Opps. Ignore my last message. I am able to transfer large files with depends after setting --max_file_size 0 on the node.
    Thanks!

     

Log in to post a comment.