From: Robert W. L. <rw...@bu...> - 2013-11-01 19:37:40
|
Hi, I am so glad that dmtcp is around because my blcr jobs no longer seem to be working. I have finally finished integrating dmtcp into my batch job manager and I have some hopefully useful feedback. Some of my code relies on customizations to our copy of dmtcp. I'm not sure if they've gotten incorporated into the distribution, but if they have, then you may ignore this: 1. Ability to put the PID into a file (supporting restarts as well when not restoring the PID). 2. Ability to take STDOUT and STDERR files as parameters so that the STDOUT and STDERR of the dmtcp software can be kept separate from the outputs of the process. 3. For my own tracking purposes, I supply a new checkpoint directory upon restart. However in order for me to supply the correct .dmtcp files, I need to parse them out of the script stored in the previous checkpoint directory. It would be nice if instead, the .dmtcp checkpoint files were kept in the directory I provide upon restart. If 1 is implemented, this will us to monitor progress of the job and end early if it is determined to have finished before the checkpoint time. Also, using this in combination with lsof, it allows us to automatically track the growth of output files. If 2 is implemented, I imagine this would mean that you would no longer have to store separate output files and merge them after completion. Restarts would no longer need to supply the output file(s) for the running process. I can see why some people would prefer separate output files 3 is not as important to me. But it would also allow me to delete previous checkpoint directories if I decide I don't need them anymore. I also wouldn't have to worry about keeping my software up to date if the format of the .sh script changed - breaking my parser. I suppose an alternative to this would be to supply the .dmtcp file locations associated with each run in another output file akin to the port and PID files. Rob |