JobScheduler / Discussion / Help: Remote Execution Timeout: Z-REMOTE-118

Decursus - 2010-07-19

We have created a chain of shell jobs that execute remotely but we often receive timeouts that cause the jobs to stop. The job logs show the below error.

Z-REMOTE-118 Separate process pid=0: No response from new process within 60s

- What is the typically root cause for these errors - poor network connectivity?
- Is there a way to increase the timeout from 60 seconds?
- What other areas should we investigate as a potential cause?

Thanks

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stefan schaedlich - 2010-07-21

Typically root causes for error Z-REMOTE-118 are
- firewall configurations.
- the workload of the remote scheduler is near 100%.
- poor network connectivity.

Currently there is no way to configure the timeout of 60sec.

It's difficult for us to say what's the problem in your case. Perhaps we can resolve you problem if you give us some more information:

1. Please give us some information about your environment of both schedulers (OS, memory …) and your job-configuration.

2. Could you please provide complete log files from both Job Schedulers, from start until getting the error?
To create a more detailed log please start the scheduler in debug mode with command jobscheduler.sh debug

3. Could it be that the remote-computer is behind a firewall that causes the error? Could you deactivate the firewall(s) for test purposes?

4. Please check the workload of the remote scheduler if the error caused? Is it near 100%. Which process causes the workload?

regards
stefan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Darden - 2010-10-27

We're having this issue as well. A few extra details:
1. This occurs regularly on one server (every few days… about 1 out of every 4 times), but we've only seen it one other time on any other machine. We have ~4 machines running the Job Scheduler. Each has a similar configuration.
2. The machines sit idle waiting for the job to kick off.
3. The server where we see the problem are is Windows Server 2008 R2 64-bit, 16 cores, 76 GB RAM.
4. We can see traffic going between the servers even when the we see the timeout error.
5. We are executing a shell command via an Order Job that uses a Process Class. I can provide a sample if requested.
6. Where can I send debug logs

Thanks!

David

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

stefan schaedlich - 2010-10-28

Hi David,

please send you debug logs, configurations files, samples etc. to info@sos-berlin.com.

regards
Stefan

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

David Darden - 2010-12-28

After working with Stefan, I found a solution to this problem (at least in our environment). The anti-virus that was installed on the server was causing the issue. Excluding the Job Scheduler directories from the real-time scanning resolved the issue. Interestingly, this issue only appeared on 1 out of 4 servers with almost identical configurations.

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

P Field - 2011-02-18

Hi,

I too have this problem. I have 2 hosts, both running 1.3.9.1025, host dtraflocorh179 is the supervisor and host ttraflocorh174 is the workload scheduler. If I run a remote job on ttraflocorh174 using the process class method I get the error: Z-REMOTE-118 Separate process pid=0: No response from new process within 60s

Looking in the logs on ttraflocorh174 I see:

2011-02-18 13:02:28.211 (TCP connection to 172.21.6.23:17935) SCHEDULER-717 Remote commands are not allowed for the current licence-key(s)
2011-02-18 13:02:28.211 (TCP connection to 172.21.6.23:17935) SCHEDULER-353 No immediate response from command <remote_scheduler.start_remote_task>

I had previously evaluated 1.3.8 and this didn't seem to suffer the same issue. There are no firewalls involved either. Any ideas what the cause is?

Thanks in advance

Peter

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

P Field - 2011-03-07

Hi Rainer,

Thanks for the quick reply. That worked perfectly.

Thanks again

Peter

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi SOS-Team!

Sorry for resurrecting this thread. Though it fits exactly my issue. I have the same Z-REMOTE-118 error when trying to remote execute between two scheduler instances. The possible reasons i have found here are Antivirus-Programs, License-Key-Issues and JVM crashes. I have checked those, everything seems fine. The workload of the remote scheduler is far from 100% as there are no jobs configured so far (besides the standard ones after installation) . Network also seems to be fine as i could add the remote scheduler in the Managed Jobs GUI.

The entries in the scheduler.log on the host-instance look like this:

2011-08-09 08:25:33.005 [ERROR]  (Task remote_execution/run_script:9457) Z-REMOTE-118  Separate process pid=0: No response from new process within 60s [zschimmer::com::object_server::Connection::Connect_operation::async_check_error_]
2011-08-09 08:25:33.020 [WARN]   (Task remote_execution/run_script:9457) SCHEDULER-280  Process terminated with exit code 1 (0x1)
2011-08-09 08:25:33.020 [WARN]   (Task remote_execution/run_script:9457) SCHEDULER-845  Task ended without processing the order. The order remains in job's order queue in the same state

The last entries from the scheduler log on the remote-instance are:

2011-08-09 08:18:54.956 [info]   (TCP connection to 192.168.100.81:4338) SCHEDULER-965  Executing command <?xml version="1.0" encoding="ISO-8859-1"?><remote_scheduler.start_remote_task tcp_port="59999"/>
2011-08-09 08:18:54.956 [info]   (TCP connection to 192.168.100.81:4338) SCHEDULER-848  Task pid=3080 started for remote scheduler
2011-08-09 08:19:54.800 [info]   (TCP connection to 192.168.100.81:4338) SCHEDULER-965  Executing command <?xml version="1.0" encoding="ISO-8859-1"?><remote_scheduler.remote_task.close process_id="25"/>

Any ideas which other reasons could be the root for this error?

Thanks in advance!
Best regards

soskb - 2011-08-10

Hi,

does this problem appears persistent or from time to time?
did you check the security-configuration (e.g. allowed_hosts)?

we need the logs/scheduler.log file to analyse this issue more in detail.

regards
soskb

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

flexmatic - 2011-08-10

Hello,

This problem is persistent and I have entered the IP-adresses in the security-config.

I am going to send you the scheduler.log, scheduler.xml and the order-config to this adress:

Much appreciated and best regards

Last edit: SOSMP 2015-03-17

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rama R - 2015-03-16

Hello,

We are having this issue persistently, not able to run any scripts in the remote agent at all. Both main scheduler and agent are AWS EC2 instances, and there are no firewall restrictions on both sides. Added security <ALLOWED_HOST ..=""> in both sides. Following are the logs form scheduler.out

JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Job my_sample2) SCHEDULER-919 Task 2104 enqueued
JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Job my_sample2) SCHEDULER-930 Task 2104 started - cause: queue_at
JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Task my_sample2:2104) SCHEDULER-918 state=starting (at=2015-03-12 16:38:09.934-0700)
JobSchedulerLog4JAppender (system.out): WARN [main] (CppLogger.scala:23) - SCHEDULER-261 Nothing done, event=, operations=Socket_manager(156:I 157:I 160:Ie 175:Ie 176:Ie ), Directory_observer(...SCHEDULER PATH/config/live/), at 2015-03-12 23:38:48 UTC, Connection::Connect_operation(connecting...), object_server::Connection(), at 2015-03-12 23:39:10 UTC, Xml_client_connection(<SCHEDULER AGENT="" IP:PORT=""> waiting, sending "<?xml version="1.0" encoding="ISO-8859-1"?>...er.start_remote_task tcp_port="59999" kind="process"/>"),ERROR=SCHEDULER-224 Supervisor has closed the connection [10.... [Task my_sample2:2104 starting] []
JobSchedulerLog4JAppender (system.out): WARN [main] (CppLogger.scala:23) - SCHEDULER-261 Nothing done, event=, operations=Socket_manager(156:I 157:I 160:Ie 175:Ie 176:Ie ), Connection::Connect_operation(connecting...), object_server::Connection(), at 2015-03-12 23:39:10 UTC, Directory_observer(...SCHEDULER PATH/config/live/), at 2015-03-12 23:39:49 UTC, Xml_client_connection(<SCHEDULER AGENT="" IP:PORT=""> waiting, sending "<?xml version="1.0" encoding="ISO-8859-1"?>...er.start_remote_task tcp_port="59999" kind="process"/>"),ERROR=SCHEDULER-224 Supervisor has closed the connection [10.... [Task my_sample2:2104 starting] []
JobSchedulerLog4JAppender (system.out): ERROR [main] (CppLogger.scala:22) - (Task my_sample2:2104) Z-REMOTE-118 Separate process pid=0: No response from new process within 60s [zschimmer::com::object_server::Connection::Connect_operation::async_check_error_]
JobSchedulerLog4JAppender (system.out): ERROR [main] (CppLogger.scala:22) - (Task my_sample2:2104) SCHEDULER-280 Process terminated with exit code 1 (0x1)
JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Job my_sample2) SCHEDULER-931 state=stopping
JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Task my_sample2:2104) SCHEDULER-918 state=closed
JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Task my_sample2:2104) SCHEDULER-962 Protocol ends in ..SCHEDULER PATH/logs/task.my_sample2.log
JobSchedulerLog4JAppender (system.out): INFO [main] (CppLogger.scala:24) - (Job my_sample2) SCHEDULER-931 state=stopped

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rama R - 2015-03-16

Could any from the group help to resolve the above issue.

Thank you

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SOSMP - 2015-03-17

Hi Rama

Tell us which JobScheduler version / OS version you are using?

May be you ahve already checked it but , for AWS EC2 instance did you check the Security Group settings attached with the Elastic IP or VPC? there should be all the incoming and outgoing port ranges required by JobScheduler should be defined.

Check this KB article about remote execution
- https://kb.sos-berlin.com/x/ZoQ3
- https://kb.sos-berlin.com/x/TYM3

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Rama R - 2015-03-17
  
  Hi SOSMP,
  
  I have checked with our security team once again, that said that there is no security in place between the servers I am using. I also verified local firewall settings, there is rules blocking.
  
  Thank you,
  Rama
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rama R - 2015-03-17

Hi SOSMP, Thank you for your response.

The version of JobScheduler installed is Release 1.7.4321(1.7.4) (64 bit version). And the OS in both sides (Scheduler engine and Remote agent) is CentOS 6.0.

And with regards to AWS EC2 instances security, when checked with security team, I was told that there were no firewall restriction between those two servers since both are AWS servers. To double check based on you comments, I will verify those points as well.

Thank you,
Rama

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SOSMP - 2015-03-17

One more tip , check CentOS's internal Firewall settings, as per as my experience you have to "open" each and every port manually.
For CentOS running with Desktop has a GUI too, to manage the Firewall settings.

below is the reference URL for CentOS7 , but should also work with CentOS6 with some changes

http://serverfault.com/questions/616435/centos-7-firewall-configuration

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rama R - 2015-03-17

While I was testing, though I was told there were no restrictions, I have added the following rules to iptables --

ACCEPT tcp -- ip----.us-west-1.compute.internal anywhere tcp dpts:59990:59999
ACCEPT udp -- ip-**---**.us-west-1.compute.internal anywhere udp dpts:59990:59999

But this didn't help either.

Thank you,
Rama

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Grant Hopwood - 2015-03-18

Have you checked connectivity using telnet or netcat?

ie: telnet <agent_ip> <agent_port> (default was 4444)

Last edit: Grant Hopwood 2015-03-18

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Rama R - 2015-03-18
  
  Yes Grant, I have checked that and the connection is successful.
  But when I do telnet or netcat from agent to scheduler engine on port 59999 or lower the connection is getting refused even though there are no security groups or firewall restrictions. -- Not really sure this is a valid test here.
  Thank you,
  Rama
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Rama R - 2015-03-18

Yes Grant, I have checked that and the connection is successful.

But when I do telnet or netcat from agent to scheduler engine on port 59999 or lower the connection is getting refused even though there are no security groups or firewall restrictions. -- Not really sure this is a valid test here.

Thank you,
Rama

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

SOSMP - 2015-03-19

Hi Rama

Can you please share your scheduler.xml file from Master and Agent.
Did you check if IPTABLE or firewall is also configured for JobScheduler's port i.e. 4444

Regards
Mahendra

If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Hi Mahendra,

Please find the shceduler.xml for engine and agent --

At scheduler engine --

            <config mail_xslt_stylesheet="config/scheduler_mail.xsl" port="8080"
                    tcp_port="8300"
            >

                            <params>
                                            <param name="scheduler.variable_name_prefix" value="SCHEDULER_PARAM_"/>
                                            <param name="scheduler.order.keep_order_content_on_reschedule" value="false"/>
                            </params>

                            <security ignore_unknown_hosts="yes">
                                            <allowed_host host="localhost" level="all"/>
                                            <allowed_host host="<IP hosting scheduler engine>" level="all"/>
                                            <allowed_host host="<IP of another machine>" level="all"/>

                                            <allowed_host host="<IP of remote agent machine>" level="all"/>

                            </security>

                            <plugins>
                                    <!--    <plugin java_class="com.sos.scheduler.engine.plugins.jetty.JettyPlugin">
                                                            <plugin.config/>
                                            </plugin> -->
                                            <plugin java_class="com.sos.scheduler.engine.plugins.webservice.WebServicePlugin"/>
                                            <plugin java_class="com.sos.jobscheduler.tools.webservices.SOSCommandSecurityPlugin"/>
                            </plugins>

                            <process_classes>
                                            <process_class max_processes="30"/>
                                            <process_class max_processes="10" name="single"/>
                                            <process_class max_processes="10" name="multi"/>
                            </process_classes>

                              <http_server>
                                                    <http.authentication>
                                                                           <http.users>
                                                                                                       <http.user name="<user name>" password_md5="<password>"/>
                                                                                                                              </http.users>
                                                                                                                                              </http.authentication>

  </http_server>

            </config>

</spooler>

At remote agent --

            <config supervisor="<IP of scheduler enige>:8300"
                    mail_xslt_stylesheet = "config/scheduler_mail.xsl"
                    ip_address="<IP of remote agnet>"
                            port="<Port at which remote agent listens>"
                            tcp_port="<Port at which remote agent listens>">

                            <params>
                                            <param name="scheduler.variable_name_prefix" value="SCHEDULER_PARAM_"/>
                                            <param name="scheduler.order.keep_order_content_on_reschedule" value="false"/>
                            </params>

                            <security ignore_unknown_hosts = "yes">
                                            <allowed_host host="localhost" level="all"/>
                                            <allowed_host host = "<IP where the remote agent installed>" level = "all"/>
                                             <allowed_host host = "<IP of scheduler engine>" level = "all"/>

                            </security>

                            <plugins>
                            <!--            <plugin java_class="com.sos.scheduler.engine.plugins.jetty.JettyPlugin">
                                                            <plugin.config/>
                                            </plugin> -->
                                            <plugin java_class="com.sos.scheduler.engine.plugins.webservice.WebServicePlugin"/>
                                            <plugin java_class="com.sos.jobscheduler.tools.webservices.SOSCommandSecurityPlugin"/>
                            </plugins>

                            <process_classes>
                                            <process_class max_processes = "30"/>
                                            <process_class max_processes = "10" name = "single"/>
                                            <process_class max_processes = "10" name = "multi"/>
                            </process_classes>

            </config>

</spooler>

We have used the 8300 instead of 4444 as configured in the scheduler.xml (hope this is not the issue, I found in one of the knowledge base articles that we have freedom of choosing this port) and we don't have any firewall settings between these two machines.

SOSMP - 2015-03-19

Hi Rama

Just for testing , please try following security settings ( allowed everything), once connection test is okay , you can replace it with MORE secure settings.

<security ignore_unknown_hosts = "yes"> <allowed_host host = "localhost" level = "all"/> <allowed_host host = "0.0.0.0" level = "all"/> </security>

Change the security settings,restart Master and Agent, run the remote job ( Can you please share how you are executing jobs on remote Agent process_class, scheduler.remote_host parameter or SSH)

Check scheduler.log and scheduler-instance log at agent, if you see some connection refused etc then at-least Master and Agent can talk to each other, if there are no messages in Agent's scheduler.log, then its OS and Networking Issue preventing communication between Master Agent

Just a thought,may be some AWS settings are preventing connection between Servers on the same VPC. I can remember such problem where server are accessible from there public ip but to each other they can not talk. The security group attached with VPC should allow all traffic from INSTANCE ID. It just a thought you AWS Gurus should be able to answer this question.

regards
SOSMP
If you would like to refer to this comment somewhere else in this project, copy and paste the following link:
- Rama R - 2015-03-19
  
  Hi SOSMP,
  
  I have tested with above mentioned changes to the security, but still there wasn't any luck.
  
  I am using process class that is defined in /config/live folder to run the jobs (shell scripts) in remote agent machine --
  
  Process class definition:
  <process_class name="devremote" max_processes="2" remote_scheduler="<IP of Remote agent>:<port that is defined in scheuler.xml at agent"/>
  Job definition:
  <job name="my_sample2" process_class="devremote"> <script language="shell"> </script>
  <run_time time_zone="MST" repeat="00:00:30" let_run="yes"> </run_time>
  
  </job>
  
  -- And in case agent's scheduler.log there are no entries about in coming connections from scheduler engine, so I suspect this as OS/network issue, but after doing extensive testing/verification found that there were no firewall or AWS restrictions. In AWS both these machines are under same group allowing communication among all the computers in that network, As test I am able to telnet with the specified ports in the configuration.
  
  Thank you,
  Rama
  
  If you would like to refer to this comment somewhere else in this project, copy and paste the following link:

Remote Execution Timeout: Z-REMOTE-118

JobScheduler workload automation to execute jobs and workflows

Forums

Help

Remote Execution Timeout: Z-REMOTE-118

</spooler>

</spooler>

Remote Execution Timeout: Z-REMOTE-118

JobScheduler workload automation to execute jobs and workflows

Forums

Help

Remote Execution Timeout: Z-REMOTE-118 document.SUBSCRIPTION_OPTIONS = { "thing": "topic", "subscribed": false, "url": "subscribe", "icon": { "css": "fa fa-envelope-o" } };

</spooler>

</spooler>

Remote Execution Timeout: Z-REMOTE-118