Menu

#24 Restart without user intervention

open
nobody
2013-09-21
2011-08-29
No

When restarting a computation for which open files were stored as part of the image, one gets the following error message:
[10274] ERROR at connection.cpp:1054 in restore; REASON='JASSERT(jalib::Filesystem::FileExists(_path) == false) failed'
_path = /tmp/out.txt
Message:
**** File already exists! Checkpointed copy can't be restored.
****Delete the existing file and try again!
dmtcp_restart (10274): Terminating...

Although this is clear, and it is obvious for the user what to do, it would nevertheless be convenient if dmtcp_restart (and hence the restart script that is automatically generated) had anoption to let the user indicate that dmtcp can actually remove the file.

The context for this request is that if dmtcp is integrated into a queuing/scheduling system, a restart should proceed completely unattended. (This "integration" is done on the level of job scripts, not in the actual queue system and scheduler.)

Alternatively, it would be convenient to have, e.g., dmtcp_prepare_restart that would list all conflicting files, and optionally remove them.

Of course, it would be possible to parse dmtcp_restart's output, and take appropriate action based on that, but this approach is much more fragile.

Discussion

  • Gene Cooperman

    Gene Cooperman - 2011-09-02

    All of the developers are travelling at the momentt. We'll get back to this next week.
    Best wishes,
    - the DMTCP developers

     
  • Kapil Arya

    Kapil Arya - 2013-09-21

    Ticket moved from /p/dmtcp/feature-requests/2/

     

Log in to post a comment.