From: <Pet...@cs...> - 2010-07-09 04:40:11
|
Hi Shree, Sorry I was wrong, it is working with the change below! I hadn't restarted the SSM and after restarting it, now it's behaving itself. Peter Hi Shree, Unfortunately the change didn't help. With the change in place, in addition to the segfault get a message such as: Jul 9 13:11:03 rviz1 slurmd[rviz1][29984]: done with job But the job is still left in the queue. Creating a script named scancel reveals that it is still being called with -q <job #> I'm using a workaround in the script I created substituting the parameters as -Q <job #> and it is working for now. Regards, Peter ________________________________________ From: Kumar, Shree [shr...@hp...] Sent: Thursday, 8 July 2010 9:58 PM To: viz...@li... Subject: Re: [vizstack-users] Vizstack and SLURM issue Hi Peter, Changing the following lines in slurmlauncher.py ---- try: p = subprocess.Popen(["scancel"]+["-q", str(self.schedId)], stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True) except OSError, e: raise SLURMError(e.__str__) if(p.returncode == 1): raise SLURMError(p.communicate()[1]) ---- To ---- try: p = subprocess.Popen(["scancel"]+[ str(self.schedId)], stdout=subprocess.PIPE, stderr=subprocess.PIPE, close_fds=True) except OSError, e: raise SLURMError(e.__str__) p.communicate() ---- Should be sufficient for all cases I think. This basically disregards any errors that happen during cleanup, but assume that proper cleanup will be done by SLURM. Can you try this change ? Thanks -- Shree -----Original Message----- From: Kumar, Shree Sent: Wednesday, July 07, 2010 8:55 PM To: viz...@li... Subject: Re: [vizstack-users] Vizstack and SLURM issue Hi Peter, Thanks for this investigation, and the information about SCANCEL_VERBOSE. That looks like a good way to keep the code independent of explicit version requirements. If you have any SSM crash logs, that would be useful to understand any other issues in the code. Regards -- Shree -----Original Message----- From: Pet...@cs... [mailto:Pet...@cs...] Sent: Wednesday, July 07, 2010 12:24 PM To: viz...@li... Subject: Re: [vizstack-users] Vizstack and SLURM issue Hi Shree, It appears that the segfault is due to a syntax change in scancel in the current version of SLURM. VizStack uses, "scancel -q <jobid>" syntax which would kill a job in quiet mode in SLURM 2.0.9 and prior, but now -q relates to qos. -Q is the current parameter for quiet mode. Both these situations could be covered by using the SCANCEL_VERBOSE environment variable. I think the SSM was crashing due to an excess of jobs in the queue, I'll see how it goes with jobs being cancelled as they should. Regards, Peter ________________________________________ From: Kumar, Shree [shr...@hp...] Sent: Tuesday, 6 July 2010 2:37 PM To: viz...@li... Subject: Re: [vizstack-users] Vizstack and SLURM issue Hi Peter, That's the first time I have seen a SLURM segfault with VizStack. As part of the cleanup sequence, we issue a single scancel command that cancels the job. Looks like it is the first time you are seeing this error. Is it reproducible using the same steps that you have mentioned ?? Can you also check the following ? - Start the SSM - Start the viz-tvnc script - Let the session startup (You may connect to the session to verify) - Lookup the SLURM queue using "sinfo" - Cancel the SLURM job using "scancel" - Do you see a similar segfault ? I am trying to simulate the things the vsapi does here. Also, can you send me the SSM log when it terminates ? ( /var/log/vs-ssm.log ) I don't like SSM crashes, since it's a single point of failure ! Regards -- Shree -----Original Message----- From: Pet...@cs... [mailto:Pet...@cs...] Sent: Monday, July 05, 2010 2:09 PM To: viz...@li... Subject: [vizstack-users] Vizstack and SLURM issue Hello, I'm seeing an issue with Vizstack 1.1-2 and Slurm running under Ubuntu 10.04. I currently have two nodes running with the distro provided SLURM (2.1.0) configured as per the Vizstack manual. I can start viz-tvnc fine and an X server will be started on a node/GPU as expected but when terminating the session, the job remains in the SLURM queue and a message such as the following in the syslog: scancel[5701]: segfault at 0 ip 00007f6c22edc376 sp 00007fff1341ca08 error 4 in libc-2.11.1.so[7f6c22db6000+178000] I can manually clear the jobs with the scancel command. The SSM seems prone to terminating when there are such jobs left the the queue. Any ideas? Regards, Peter Peter Tyson CSIRO IM&T - Advanced Scientific Computing Gate 5 Normanby Road Clayton Vic 3168 Ph +61 3 9545 2021 ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ ------------------------------------------------------------------------------ This SF.net email is sponsored by Sprint What will you do first with EVO, the first 4G phone? Visit sprint.com/first -- http://p.sf.net/sfu/sprint-com-first _______________________________________________ vizstack-users mailing list viz...@li... https://lists.sourceforge.net/lists/listinfo/vizstack-users |