From: Robert W. L. <rw...@bu...> - 2013-06-25 16:43:26
|
Hi, Alright. I have a question. Please forgive me if this turns out to be another dumb mistake. I'm still converting my script that previously used blcr to checkpoint. When I run my job stats routine in the parent script, it sleeps 10 minutes, samples a few things, then sleeps another 10 and so on until either the job finishes or the checkpoint time is reached. If my job monitor subroutine detects that the job has "disappeared" before the expected total time before checkpointing (by way of it no longer showing up in top or ps), it returns, and another sub determines the job status and decides what to do (report error, checkpoint, etc). In blcr, this strategy seemed to work OK because when the process finishes, it no longer shows up in top. However, in testing out dmtcp, I'm still seeing the process in top. It appears as "<defunct>". The coordinator is still running. Here's what I see in ps -ef: rwleach 12552 1 0 Jun24 ? 00:00:00 dmtcp_coordinator --port 0 --background --port-file /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/VCaP_control-VCaP_input_summits.bed.pad20k-std.summit-orient-to-genes.pad150.formeme.memeout-summits.ckpt2.port --ckptdir /panfs/panfs.ccr.buffalo.edu/projects/ccrstaff/rwleach/PROJECT/CRPC/MACS/VCaP_control-VCaP_input_summits.bed.pad20k-std.summit-orient-to-genes.pad150.formeme.memeout-summits.ckpt2 --tmpdir /panasas/scratch/rwleach/tmp rwleach 12560 12539 12 Jun24 ? 02:27:57 [DMTCP:meme.bin] <defunct> I just checked to see what dmtcp's status of the job is: dmtcp_command --port 42749 --status DMTCP-1.2.7 (+ MTCP), Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "--quiet" to hide this message.) Port: 42749 Status... NUM_PEERS=0 RUNNING=no Perhaps instead of using ps/top to check the status, I should be running `dmtcp_command --port 42749 --status` to determine whether to finish up. That's probably the way to go. However, how should I handle the defunct process? Does the "--exit-on-last" option clean it up or do I need to detect the defunct status on my own and issue the command `dmtcp_command --quit`? I just tried manually quitting the coordinator and my defunct process is still hanging around: k10n23a[Jun 25 11:53:04]:>dmtcp_command --port 42749 --quit DMTCP-1.2.7 (+ MTCP), Copyright (C) 2006-2011 Jason Ansel, Michael Rieker, Kapil Arya, and Gene Cooperman This program comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it under certain conditions; see COPYING file for details. (Use flag "--quiet" to hide this message.) k10n23a[Jun 25 DING!]:>ps -ef | grep -i DMTCP rwleach 10111 7810 0 12:00 pts/0 00:00:00 grep -i DMTCP rwleach 12560 12539 11 Jun24 ? 02:27:57 [DMTCP:meme.bin] <defunct> Is this what's supposed to happen? If I do a ps on the parent process ID, I see my own PBS script that was copied to the node: ps 12539 PID TTY STAT TIME COMMAND 12539 ? S 0:00 /usr/bin/perl /var/spool/pbs/mom_priv/jobs/4048325.d15n41.ccr.buffalo.edu.SC I wasn't expecting that. I thought it would belong to the dmtcp coordinator. What should my parent script be doing with this defunct process? Do I need to detect the defunct status and call waitpid? Would that clean up the defunct listing in ps/top? Could the process be waiting on my script to grab some of its output so it can flush its buffer and exit or something? I'm not sure how to handle this situation. So the way everything stands right now is, my script is waiting for the process to disappear from the top/ps output, but it's not going away because of... something - and I don't know what that something is. Thanks, Rob |