Menu

#265 hu.sztaki.lpds.submitter.service.ssh.sshService

3.6.2
open
9 - High
2016-12-13
2016-12-09
No

Dear Zoltan,

We've been running gUSE for few years now, and we have this critical problem(described below) that when it occurs it doesn't allow the user to submit jobs to the custer any more. The ssh works perfectly between the server hosting GUSE and the cluster, manually tried to ssh and no problems there. Our developers identified a helper class in gUSE (hu.sztaki.lpds.submitter.service.ssh.sshService). This helper class is an adapter to jsch and it does retries (maximum 6 times). This class hasn't been changed from gUSE 3.6.5 to 3.7.0.1 thus the problem cannot be solved by a migration to gUSE 3.7.0.1

The only workaround that we currently applying is to gUSE, but this is very annoying because it kills the current worflows/jobs and users need to rescue and time is lost.
Any advice would be much appreciated

The error we get in the catalina.log:
2016-12-08 17:12:57,164 ERROR [lsf-SUBMIT/3d45479c-196a-4228-9b58-3fe50aff731e] (sshService.java:120) - sshService.sshExec failed. Sleep. numOfRetry:1
com.jcraft.jsch.JSchException: channel is not opened.
at com.jcraft.jsch.Channel.sendChannelOpen(Channel.java:670)
at com.jcraft.jsch.Channel.connect(Channel.java:151)
at com.jcraft.jsch.Channel.connect(Channel.java:145)
at hu.sztaki.lpds.submitter.service.ssh.sshService.sshExecWithCommandResult(sshService.java:116)
at hu.sztaki.lpds.submitter.service.ssh.sshService.sshExec(sshService.java:166)
at hu.sztaki.lpds.submitter.grids.Grid_lsf.submit(Grid_lsf.java:208)
at hu.sztaki.lpds.dcibridge.threads.Submit.run(Submit.java:42)
java.lang.Exception: Could not authenticate to server aaa.domain.com with user abcd. The connection is broken.

Discussion

  • Uwe Schmitt

    Uwe Schmitt - 2016-12-09

    The com.jcraft.jsch.JSchException: channel is not opened. is the critical line. sshSevirce seems to hold ssh channels which break if the ssh process on the cluster machine dies. After this the user can not submit any workflow until we restart guse.

     
  • imi

    imi - 2016-12-09

    Hi!

    This problem is fixed in version 3.7.0.2. Please upgrade your gUSE.

     
  • Scristian Alexander

    Thank you so much for the quick answer, much appreciated. We will try to follow your advice.

     

Log in to post a comment.

MongoDB Logo MongoDB