From: Nate E T. <nat...@uw...> - 2020-03-27 17:48:21
|
Hello again. I forgot to note before that I get warnings after the second restart, before the segmentation fault: [59000] WARNING at fileconnection.cpp:355 in refill; REASON='JWARNING(false) failed' Message: Size of file smaller than what we expected Appreciate any help or work you do to fix this issue! -- Nate TeBlunthuis PhD Candidate, Department of Communication, Community Data Science Collective University of Washington https://teblunthuis.cc ________________________________ From: Nate E TeBlunthuis Sent: Monday, March 23, 2020 1:12 PM To: dmt...@li... <dmt...@li...> Subject: Segmentation faults with R. Greetings, I am fitting models using the rstanarm package (which is part of the mc-stan system for statistical modeling). I'm trying to checkpoint my models using dmtcp under a slurm scheduler. I tested checkpointing fitting toy models with dmtcp and it seemed to work just fine. I can checkpoint and resume multiple times and get a valid model in the end. But when I try to fit larger models that use around 24G of memory, I have problems with multiple checkpoints and resumes. Strangely, I can successfully checkpoint and resume once, but after resuming from the second checkpoint, a subsequent attempt to checkpoint fails with a segmentation fault. I am using dmtcp 3.0 installed by the managers of my cluster. I have tried using R 3.5.2 compiled with gcc as well as R 3.6.0 compiled with icc. I'm running dmtcp_launch -p 2020 --rm --no-gzip --checkpoint-open-files --allow-file-overwrite $my_command I also tried this with and without the --disable-dl-plugin flag. Since dmtcp works fine with toy models that don't use much ram, I wonder if address space randomization could be a factor. I'm more than happy to provide more information if it can help. -- Nate TeBlunthuis PhD Candidate, Department of Communication, Community Data Science Collective University of Washington https://teblunthuis.cc |