When using 'nice' together with 'dmtcp_checkpoint' it crashes on 'Checkpoint all nodes' (c) with:
 ERROR at connectionmanager.cpp:142 in retrieve; REASON='JASSERT(i != _table.end()) failed'
fd = 5
device = anon_inode:[timerfd]
_table.size() = 3
Message: failed to find connection for fd
I have a very long running job from OpenCV and wanted to use dmtcp in order to be able to recover (of course). The binary is 'opencv_haartraining'. However since this produces a high CPU load I tried to use it along with 'nice -n 20'. When using 'c' to store checkpoints the process crashes as mentioned. The full command is:
'bin/dmtcp_checkpoint nice -n 20 opencv_haartraining -data haarcascade -vec samples.vec -bg negatives.dat -nstages 20 -nsplits 2 -minhitrate 0.999 -maxfalsealarm 0.5 -npos 7000 -nneg 3019 -w 20 -h 20 -nonsym -mem 512 -mode ALL'
The system I am running is fedora 17: Linux <censored> 3.4.6-2.fc17.x86_64 #1 SMP Thu Jul 19 22:54:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
May be this is of some interest and helps improving the software.
Log in to post a comment.