Work at SourceForge, help us to make it a better place! We have an immediate need for a Support Technician in our San Francisco or Denver office.

Close

#17 Checkpoint all nodes fails when using nice

open
nobody
DMTCP (6)
2012-08-08
2012-08-08
Dr. Trigon
No

When using 'nice' together with 'dmtcp_checkpoint' it crashes on 'Checkpoint all nodes' (c) with:

[26950] ERROR at connectionmanager.cpp:142 in retrieve; REASON='JASSERT(i != _table.end()) failed'
fd = 5
device = anon_inode:[timerfd]
_table.size() = 3
Message: failed to find connection for fd

I have a very long running job from OpenCV and wanted to use dmtcp in order to be able to recover (of course). The binary is 'opencv_haartraining'. However since this produces a high CPU load I tried to use it along with 'nice -n 20'. When using 'c' to store checkpoints the process crashes as mentioned. The full command is:

'bin/dmtcp_checkpoint nice -n 20 opencv_haartraining -data haarcascade -vec samples.vec -bg negatives.dat -nstages 20 -nsplits 2 -minhitrate 0.999 -maxfalsealarm 0.5 -npos 7000 -nneg 3019 -w 20 -h 20 -nonsym -mem 512 -mode ALL'

The system I am running is fedora 17: Linux <censored> 3.4.6-2.fc17.x86_64 #1 SMP Thu Jul 19 22:54:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

May be this is of some interest and helps improving the software.
Greetings
DrTrigon

Discussion

  • Dr. Trigon
    Dr. Trigon
    2012-08-08

    • summary: Checkpoint all nodes fails with nice --> Checkpoint all nodes fails when using nice