Menu

#15 dispyscheduler not reliable

v1.0_(example)
open
nobody
None
1
2015-09-28
2015-09-20
Anonymous
No

We're trying to use dispyscheduler to manage multiple jobs but it is locking up and refusing to allocate jobs after a day or two of successful operation. We have around eight VMs running Lubuntu 14.04 with dispy 4.5.5 and asyncoro 3.5. One VM issues multiple parallel jobs and the rest service them. A dispyscheduler log is attached:

bmduser@lubuntu-box:~$ dispyscheduler.py -d -i hostname -I --max_file_size 2G
2015-09-16 12:20:18,303 - dispyscheduler - dispyscheduler version 4.5
2015-09-16 12:20:18,305 - asyncoro - poller: epoll
2015-09-16 12:20:18,306 - dispyscheduler - tcp server at 192.168.16.69:51347
2015-09-16 12:20:18,306 - dispyscheduler - scheduler at 192.168.16.69:51349
2015-09-16 12:20:18,306 - dispyscheduler - Pending jobs: 0
2015-09-16 12:33:04,932 - dispyscheduler - New computation 1442370018305: compute, /tmp/dispy/scheduler/192.168.16.19/compute_g2lHUc
2015-09-16 12:33:04,934 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_g2lHUc/job.tar (145776640)
2015-09-16 12:33:07,935 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_g2lHUc/job.tar
2015-09-16 12:33:07,939 - dispyscheduler - Discovered 192.168.16.99:51348 (LubuntuVM) with 1 cpus
2015-09-16 12:33:07,942 - dispyscheduler - Pending jobs: 1
2015-09-16 12:33:09,707 - dispyscheduler - Pending jobs: 1
2015-09-16 12:33:09,707 - dispyscheduler - Pending jobs: 0
2015-09-16 12:33:09,709 - dispyscheduler - Running job 140631016951392 on 192.168.16.99 (busy: 1 / 1)
2015-09-16 13:01:25,675 - dispyscheduler - New computation 1442370018306: compute, /tmp/dispy/scheduler/192.168.16.98/compute_FFrRi4
2015-09-16 13:01:25,677 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.98/compute_FFrRi4/job.tar (145510400)
2015-09-16 13:01:30,023 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.98/compute_FFrRi4/job.tar
2015-09-16 13:01:30,030 - dispyscheduler - Pending jobs: 1
2015-09-16 13:01:31,120 - dispyscheduler - Pending jobs: 1
2015-09-16 18:45:42,197 - dispyscheduler - Received reply for job 140631016951392 from 192.168.16.99
2015-09-16 18:45:42,212 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_g2lHUc"
2015-09-16 18:45:42,213 - dispyscheduler - Pending jobs: 1
2015-09-16 18:45:42,213 - dispyscheduler - Pending jobs: 0
2015-09-16 18:45:42,229 - dispyscheduler - Running job 140631016951032 on 192.168.16.99 (busy: 0 / 1)
2015-09-16 18:45:43,120 - dispyscheduler - Invalid computation "1442370018305" to cleanup ignored
2015-09-16 18:45:47,232 - dispyscheduler - Could not send job status to 192.168.16.98:41096
2015-09-16 20:14:33,446 - dispyscheduler - Received reply for job 140631016951032 from 192.168.16.99
2015-09-16 20:14:33,459 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.98/compute_FFrRi4"
2015-09-16 20:14:33,459 - dispyscheduler - Pending jobs: 0
2015-09-16 20:14:36,304 - dispyscheduler - Could not send reply for job 140631016951032 to 192.168.16.98:41096; saving it in "/tmp/dispy/scheduler/192.168.16.98/compute_FFrRi4/_dispy_job_reply_140631016951032"
2015-09-16 20:14:36,305 - dispyscheduler - Could not save reply for job 140631016951032
2015-09-16 20:14:36,305 - dispyscheduler - Could not send node status to 192.168.16.98:41096
2015-09-16 20:42:36,938 - dispyscheduler - New computation 1442370018307: compute, /tmp/dispy/scheduler/192.168.16.19/compute_FYpBrT
2015-09-16 20:42:36,941 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_FYpBrT/job.tar (145776640)
2015-09-16 20:42:39,730 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_FYpBrT/job.tar
2015-09-16 20:42:39,736 - dispyscheduler - Pending jobs: 1
2015-09-16 20:42:39,738 - dispyscheduler - Discovered 192.168.16.104:51348 (LubuntuVM) with 1 cpus
2015-09-16 20:42:41,016 - dispyscheduler - Pending jobs: 1
2015-09-16 20:42:41,016 - dispyscheduler - Pending jobs: 0
2015-09-16 20:42:41,018 - dispyscheduler - Running job 140631017045120 on 192.168.16.104 (busy: 1 / 1)
2015-09-16 20:42:44,264 - dispyscheduler - Pending jobs: 0
2015-09-17 00:32:05,150 - dispyscheduler - Received reply for job 140631017045120 from 192.168.16.104
2015-09-17 00:32:05,166 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_FYpBrT"
2015-09-17 00:32:05,167 - dispyscheduler - Pending jobs: 0
2015-09-17 00:32:06,391 - dispyscheduler - Invalid computation "1442370018307" to cleanup ignored
2015-09-17 08:24:00,891 - dispyscheduler - New computation 1442370018308: compute, /tmp/dispy/scheduler/192.168.16.19/compute_6togDG
2015-09-17 08:24:00,893 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_6togDG/job.tar (145776640)
2015-09-17 08:24:03,972 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_6togDG/job.tar
2015-09-17 08:24:03,978 - dispyscheduler - Pending jobs: 1
2015-09-17 08:24:08,997 - dispy - Could not connect to 192.168.16.99:51348, Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
yield sock.connect((self.ip_addr, self.port))
timeout: timed out

2015-09-17 08:24:08,997 - dispy - Could not connect to 192.168.16.99:51348, Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
yield sock.connect((self.ip_addr, self.port))
timeout: timed out

2015-09-17 08:24:08,997 - dispy - Transfer of computation "compute" to 192.168.16.99 failed: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
yield sock.connect((self.ip_addr, self.port))
timeout: timed out

2015-09-17 08:24:08,997 - dispy - Transfer of computation "compute" to 192.168.16.99 failed: Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
yield sock.connect((self.ip_addr, self.port))
timeout: timed out

2015-09-17 08:24:08,998 - dispyscheduler - Failed to setup 192.168.16.99 for computation "compute"
2015-09-17 08:24:10,304 - dispyscheduler - Pending jobs: 1
2015-09-17 08:24:10,304 - dispyscheduler - Pending jobs: 0
2015-09-17 08:24:10,308 - dispyscheduler - Running job 140631017047416 on 192.168.16.104 (busy: 1 / 1)
2015-09-17 08:24:14,000 - dispy - Could not connect to 192.168.16.99:51348, Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
yield sock.connect((self.ip_addr, self.port))
timeout: timed out

2015-09-17 08:24:14,000 - dispy - Could not connect to 192.168.16.99:51348, Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
yield sock.connect((self.ip_addr, self.port))
timeout: timed out

2015-09-17 08:30:35,136 - dispyscheduler - New computation 1442370018309: compute, /tmp/dispy/scheduler/192.168.16.19/compute_5hS98n
2015-09-17 08:30:35,138 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_5hS98n/job.tar (145776640)
2015-09-17 08:30:37,014 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_5hS98n/job.tar
2015-09-17 08:30:37,021 - dispyscheduler - Pending jobs: 1
2015-09-17 08:30:38,573 - dispyscheduler - Pending jobs: 1
2015-09-17 08:30:38,573 - dispyscheduler - Pending jobs: 0
2015-09-17 08:30:38,576 - dispyscheduler - Running job 140631024505872 on 192.168.16.99 (busy: 1 / 1)
2015-09-17 08:30:39,858 - dispyscheduler - Pending jobs: 0
2015-09-17 13:37:25,606 - dispyscheduler - Received reply for job 140631024505872 from 192.168.16.99
2015-09-17 13:37:25,619 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_5hS98n"
2015-09-17 13:37:25,620 - dispyscheduler - Pending jobs: 0
2015-09-17 13:37:25,682 - dispyscheduler - Invalid computation "1442370018309" to cleanup ignored
2015-09-17 13:48:46,728 - dispyscheduler - New computation 1442370018310: compute, /tmp/dispy/scheduler/192.168.16.19/compute_kokMG3
2015-09-17 13:48:46,731 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_kokMG3/job.tar (145776640)
2015-09-17 13:48:48,265 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_kokMG3/job.tar
2015-09-17 13:48:48,271 - dispyscheduler - Pending jobs: 1
2015-09-17 13:48:51,152 - dispyscheduler - Pending jobs: 1
2015-09-17 13:48:51,152 - dispyscheduler - Pending jobs: 0
2015-09-17 13:48:51,153 - dispyscheduler - Running job 140631017054528 on 192.168.16.99 (busy: 1 / 1)
2015-09-17 13:48:52,464 - dispyscheduler - Pending jobs: 0
2015-09-17 15:19:09,551 - dispyscheduler - Received reply for job 140631017047416 from 192.168.16.104
2015-09-17 15:19:09,566 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_6togDG"
2015-09-17 15:19:09,567 - dispyscheduler - Pending jobs: 0
2015-09-17 15:19:11,596 - dispyscheduler - Invalid computation "1442370018308" to cleanup ignored
2015-09-17 17:04:02,065 - dispyscheduler - New computation 1442370018311: compute, /tmp/dispy/scheduler/192.168.16.19/compute_0vhvSr
2015-09-17 17:04:02,067 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_0vhvSr/job.tar (145776640)
2015-09-17 17:04:03,982 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_0vhvSr/job.tar
2015-09-17 17:04:03,991 - dispyscheduler - Pending jobs: 1
2015-09-17 17:04:06,220 - dispyscheduler - Pending jobs: 1
2015-09-17 17:04:07,508 - dispyscheduler - Pending jobs: 1
2015-09-17 17:04:07,508 - dispyscheduler - Pending jobs: 0
2015-09-17 17:04:07,512 - dispyscheduler - Running job 140631017051152 on 192.168.16.104 (busy: 1 / 1)
2015-09-17 17:36:18,306 - dispyscheduler - Received reply for job 140631017054528 from 192.168.16.99
2015-09-17 17:36:18,346 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_kokMG3"
2015-09-17 17:36:18,346 - dispyscheduler - Pending jobs: 0
2015-09-17 17:36:23,171 - dispyscheduler - Invalid computation "1442370018310" to cleanup ignored
2015-09-17 18:14:30,136 - dispyscheduler - Received reply for job 140631017051152 from 192.168.16.104
2015-09-17 18:14:30,137 - dispyscheduler - Pending jobs: 0
2015-09-17 18:14:30,346 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_0vhvSr"
2015-09-17 20:37:33,679 - dispyscheduler - New computation 1442370018312: compute, /tmp/dispy/scheduler/192.168.16.19/compute_Nhactb
2015-09-17 20:37:33,686 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_Nhactb/job.tar (145776640)
2015-09-17 20:37:36,006 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_Nhactb/job.tar
2015-09-17 20:37:36,014 - dispyscheduler - Pending jobs: 1
2015-09-17 20:37:38,825 - dispyscheduler - Pending jobs: 1
2015-09-17 20:37:38,826 - dispyscheduler - Pending jobs: 0
2015-09-17 20:37:38,827 - dispyscheduler - Running job 140631024512728 on 192.168.16.99 (busy: 1 / 1)
2015-09-17 20:37:40,096 - dispyscheduler - Pending jobs: 0
2015-09-17 20:42:40,915 - dispyscheduler - New computation 1442370018313: compute, /tmp/dispy/scheduler/192.168.16.19/compute_3Vysm2
2015-09-17 20:42:40,917 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_3Vysm2/job.tar (145776640)
2015-09-17 20:42:42,373 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_3Vysm2/job.tar
2015-09-17 20:42:42,381 - dispyscheduler - Pending jobs: 1
2015-09-17 20:42:46,542 - dispyscheduler - Pending jobs: 1
2015-09-17 20:42:47,812 - dispyscheduler - Pending jobs: 1
2015-09-17 20:42:47,813 - dispyscheduler - Pending jobs: 0
2015-09-17 20:42:47,815 - dispyscheduler - Running job 140631017049216 on 192.168.16.104 (busy: 1 / 1)
2015-09-18 03:57:30,126 - dispyscheduler - Received reply for job 140631017049216 from 192.168.16.104
2015-09-18 03:57:30,137 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_3Vysm2"
2015-09-18 03:57:30,138 - dispyscheduler - Pending jobs: 0
2015-09-18 03:57:35,141 - dispyscheduler - Could not send reply for job 140631017049216 to 192.168.16.19:52148; saving it in "/tmp/dispy/scheduler/192.168.16.19/compute_3Vysm2/_dispy_job_reply_140631017049216"
2015-09-18 03:57:35,141 - dispyscheduler - Could not save reply for job 140631017049216
2015-09-18 04:58:27,958 - dispyscheduler - Received reply for job 140631024512728 from 192.168.16.99
2015-09-18 04:58:27,995 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_Nhactb"
2015-09-18 04:58:27,996 - dispyscheduler - Pending jobs: 0
2015-09-18 04:58:32,999 - dispyscheduler - Could not send reply for job 140631024512728 to 192.168.16.19:36217; saving it in "/tmp/dispy/scheduler/192.168.16.19/compute_Nhactb/_dispy_job_reply_140631024512728"
2015-09-18 04:58:32,999 - dispyscheduler - Could not save reply for job 140631024512728
2015-09-18 10:19:21,085 - dispyscheduler - New computation 1442370018314: compute, /tmp/dispy/scheduler/192.168.16.19/compute_T3tsJx
2015-09-18 10:19:21,090 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_T3tsJx/job.tar (145776640)
2015-09-18 10:19:22,490 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_T3tsJx/job.tar
2015-09-18 10:19:22,494 - dispyscheduler - node 192.168.16.99 rediscovered
2015-09-18 10:19:22,495 - dispy - Transfer of computation "compute" to 192.168.16.99 failed:
2015-09-18 10:19:22,495 - dispy - Transfer of computation "compute" to 192.168.16.99 failed:
2015-09-18 10:19:22,496 - dispyscheduler - Pending jobs: 0
2015-09-18 10:19:22,496 - dispyscheduler - Pending jobs: 1
2015-09-18 10:19:22,496 - dispyscheduler - Pending jobs: 0
2015-09-18 10:19:22,500 - dispyscheduler - Running job 140631017044520 on 192.168.16.99 (busy: 1 / 1)
2015-09-18 10:19:22,855 - dispyscheduler - Received reply for job 140631017044520 from 192.168.16.99
2015-09-18 10:19:22,857 - dispyscheduler - Pending jobs: 0
2015-09-18 10:19:22,872 - dispyscheduler - Removing "/tmp/dispy/scheduler/192.168.16.19/compute_T3tsJx"
2015-09-18 10:19:24,374 - dispyscheduler - Pending jobs: 0
2015-09-18 10:19:24,450 - asyncoro - uncaught exception in _schedule_jobs/140631016156024:
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1392, in _schedule_jobs
node = self.select_job_node()
File "/usr/local/bin/dispyscheduler.py", line 1332, in load_balance_schedule
if all((not self._clusters[cid]._jobs) for cid in node.clusters):
File "/usr/local/bin/dispyscheduler.py", line 1332, in <genexpr>
if all((not self._clusters[cid]._jobs) for cid in node.clusters):
KeyError: 1442370018314

2015-09-18 12:37:11,680 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-18 12:37:11,681 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-18 12:37:11,681 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-18 12:37:11,682 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 12:37:11,684 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-18 12:37:11,685 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:37:37,155 - dispyscheduler - New computation 1442370018315: compute, /tmp/dispy/scheduler/192.168.16.19/compute_6EVkgv
2015-09-18 20:37:37,160 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_6EVkgv/job.tar (145776640)
2015-09-18 20:37:38,491 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_6EVkgv/job.tar
2015-09-18 20:37:38,495 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-18 20:37:38,495 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-18 20:37:38,495 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:37:38,496 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-18 20:37:38,496 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-18 20:37:38,496 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:37:38,760 - dispyscheduler - node 192.168.16.99 rediscovered
2015-09-18 20:37:38,761 - dispyscheduler - Ignoring node 192.168.16.99
2015-09-18 20:37:38,761 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:37:38,763 - dispy - Transfer of computation "compute" to 192.168.16.99 failed:
2015-09-18 20:37:38,763 - dispy - Transfer of computation "compute" to 192.168.16.99 failed:
2015-09-18 20:37:38,768 - dispyscheduler - node 192.168.16.99 rediscovered
2015-09-18 20:37:38,769 - dispyscheduler - Ignoring node 192.168.16.99
2015-09-18 20:37:38,770 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:37:38,773 - dispy - Transfer of computation "compute" to 192.168.16.104 failed:
2015-09-18 20:37:38,773 - dispy - Transfer of computation "compute" to 192.168.16.104 failed:
2015-09-18 20:42:39,018 - dispyscheduler - New computation 1442370018316: compute, /tmp/dispy/scheduler/192.168.16.19/compute_uon2iy
2015-09-18 20:42:39,022 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_uon2iy/job.tar (145776640)
2015-09-18 20:42:40,385 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_uon2iy/job.tar
2015-09-18 20:42:40,388 - dispyscheduler - node 192.168.16.99 rediscovered
2015-09-18 20:42:40,388 - dispyscheduler - Ignoring node 192.168.16.99
2015-09-18 20:42:40,388 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:42:40,389 - dispy - Transfer of computation "compute" to 192.168.16.99 failed:
2015-09-18 20:42:40,389 - dispy - Transfer of computation "compute" to 192.168.16.99 failed:
2015-09-18 20:42:40,389 - dispyscheduler - node 192.168.16.99 rediscovered
2015-09-18 20:42:40,389 - dispyscheduler - Ignoring node 192.168.16.99
2015-09-18 20:42:40,389 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:42:40,390 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-18 20:42:40,390 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-18 20:42:40,391 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:42:40,391 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-18 20:42:40,392 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-18 20:42:40,392 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-18 20:42:40,392 - dispy - Transfer of computation "compute" to 192.168.16.104 failed:
2015-09-18 20:42:40,392 - dispy - Transfer of computation "compute" to 192.168.16.104 failed:
2015-09-21 09:43:59,990 - dispyscheduler - New computation 1442370018317: compute, /tmp/dispy/scheduler/192.168.16.19/compute_J8xcez
2015-09-21 09:43:59,995 - dispyscheduler - Copying file job.tar to /tmp/dispy/scheduler/192.168.16.19/compute_J8xcez/job.tar (145776640)
2015-09-21 09:44:01,537 - dispyscheduler - Copied file /tmp/dispy/scheduler/192.168.16.19/compute_J8xcez/job.tar
2015-09-21 09:44:01,541 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-21 09:44:01,541 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-21 09:44:01,541 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-21 09:44:01,543 - dispyscheduler - node 192.168.16.104 rediscovered
2015-09-21 09:44:01,543 - dispyscheduler - Ignoring node 192.168.16.104
2015-09-21 09:44:01,543 - dispyscheduler - Traceback (most recent call last):
File "/usr/local/bin/dispyscheduler.py", line 378, in tcp_task
yield self.add_node(info, coro=coro)
File "/usr/local/lib/python2.7/dist-packages/asyncoro/init.py", line 3226, in _schedule
retval = coro._generator.send(coro._value)
File "/usr/local/bin/dispyscheduler.py", line 1123, in add_node
cluster = self._clusters[cid]
KeyError: 1442370018314

2015-09-21 09:44:02,799 - dispy - Transfer of computation "compute" to 192.168.16.104 failed:
2015-09-21 09:44:02,799 - dispy - Transfer of computation "compute" to 192.168.16.104 failed:

Discussion

  • Giridhar Pemmasani

    It is difficult me to understand what is going on with this output, without appropriate explanation.

    From the log,

    2015-09-17 08:24:08,997 - dispy - Could not connect to 192.168.16.99:51348, Traceback (most recent call last):
    File "/usr/local/lib/python2.7/dist-packages/dispy/init.py", line 372, in send
    yield sock.connect((self.ip_addr, self.port))
    timeout: timed out

    But 192,168.16.99 appears to be a node, not scheduler. Then why is client sending data to node (and not scheduler)? Before that error, there is a warning from scheduler about not being able to send job status to client.

    2015-09-16 18:45:47,232 - dispyscheduler - Could not send job status to 192.168.16.98:41096

    Then there are issues about node restarting. I am guessing either you are having network issues or client/node being restarted,

     
  • Anonymous

    Anonymous - 2015-09-21

    But 192,168.16.99 appears to be a node, not scheduler. Then why is client sending data to node (and not
    scheduler)? Before that error, there is a warning from scheduler about not being able to send job status to
    client.

    Yes 192,168.16.99 is a node.
    The client did send data to the scheduler at 2015-09-17 08:24:00,893
    I assume that the error at 2015-09-17 08:24:08,997 was the scheduler trying to pass the job on to node 192,168.16.99

    Then there are issues about node restarting. I am guessing either you are having network issues or client/node
    being restarted,

    Yes, in our environment nodes start up automatically after hours and shut down automatically before work hours the next day. Is this something that dispyscheduler cannot handle ?

     
  • Giridhar Pemmasani

    For fault tolerance (nodes shutting down), the computation must be marked reentrant; i.e., SharedJobCluster must be passed parameter 'reentrant=True'. Otherwise, scheduler expects the node to finish the job submitted and get stuck if the node never sends the result. You also need to use 'zombie_interval' option to dispyscheduler. See documentation for details.

     
  • Anonymous

    Anonymous - 2015-09-28

    Thank you, those options help!

    The zombie_interval option isn't documented for dispyscheduler (at least not at http://dispy.sourceforge.net/dispyscheduler.html) but it does the job.

    Cheers

     
  • Giridhar Pemmasani

    Actually I should've said 'pulse_interval' instead of 'zombie_interval'. Setting 'zombie_itnerval' also works (it implies 'pulse_interval' so it worked for you), but in your case you want to use 'pulse_interval' it seems.

     

Log in to post a comment.