From: Jonathan P. <jo...@as...> - 2016-06-29 07:58:45
|
... and sorry, I forgot to answer your other question. Checkpointing does work well. On 29 June 2016 08:08:49 BST, Jonathan Patterson <jo...@as...> wrote: > >Great, thank you. >We're just using TCP over gigabit ethernet for the network. >Slurm is 15.08.6, but it's not doing the checkpointing. I'm doing that >manually. As fas as slurm is concerned, there is no checkpointing. >I'm not starting MPI jobs with dmtcp_launch - I'm not aiming to >checkpoint the MPI jobs, it's the 1-core, low-memory simple jobs that I >want to checkpoint, so these can be moved around to make way for the >more complex jobs. So I'm thinking we can leave MPI out of this. >Most jobs are running with no problems, it's just the dmtcp ones that >*occasionally* have a problem. >Some of the failing jobs (specific ones) complain about libdl.so (see >below), but not all of them, if that helps. Maybe we should deal with >that issue first? >The other failing jobs fail simply with the message I posted before. > >[43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret) >failed' > filename = libirml.so.1 > flag = 1 >Message: dlopen failed. You may also see a message 'ERROR: ld.so:' >from libdl.so. If this happens only under DMTCP, then consider setting >the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'. >If the problem persists, please write to the DMTCP developers. > >[43000] NOTE at processinfo.cpp:199 in growStack; REASON='bottom-most >page of stack (page with highest address) was > invisible in /proc/self/maps. It is made visible again now.' >[43000] WARNING at dlwrappers.cpp:75 in dlopen; REASON='JWARNING(ret) >failed' > filename = libcilkrts.so > flag = 1 >Message: dlopen failed. You may also see a message 'ERROR: ld.so:' >from libdl.so. If this happens only under DMTCP, then consider setting >the environment variable DMTCP_DL_PLUGIN to "0" before 'dmtcp_launch'. >If the problem persists, please write to the DMTCP developers. > >[43000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; >REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) failed' > _magicBits = >Message: read invalid message, _magicBits mismatch. Did DMTCP >coordinator die uncleanly? > > > >On 28/06/16 22:59, Jiajun Cao wrote: >> Hi Jonathan, >> >> Thanks for writing to us. We're definitely glad to help you with the >> problem. Can you provide us the following info: >> >> What's the interconnect of the cluster, InfiniBand, TCP? >> >> What versions of Slurm and MPI do you use? >> >> Aside from the failure jobs, are the remaining jobs successful? Can >they >> checkpoint/restart successfully? >> >> The log you sent is very general: it tells only that the client >cannot >> connect to the coordinator somehow. There can be various reasons for >> that. We'll need to dig further. >> >> Best, >> Jiajun >> >> On Tue, Jun 28, 2016 at 11:45 AM, Jonathan Patterson ><jo...@as... >> <mailto:jo...@as...>> wrote: >> >> >> Hello! >> I'm running v. 2.4.4 on CentOS 6.8, kernel >> 2.6.32-431.20.3.el6.x86_64 >> This is a cluster, with ~ 100 compute nodes, running >slurm. >> Jobs are started with dmtcp_launch --rm. The idea is >that >> jobs can be checkpointed as needed, to move them around between >> machines to fit jobs together to make room for high >memory/specific >> MPI geometry jobs. This has worked well, but... >> Out of ~ 45,000 jobs that have run so far, ~ 100 have >> errors as below. I cannot find a common compute node, time, job >> type, user, memory usage, or any other factor - it seems that >dmtcp >> is just randomly generating this error. This stops the job, which >is >> a bit of a problem. No checkpointing was attempted on these jobs. >> Any ideas where I should look for the problem, anybody? >> Anything I can do to get some more debugging info? Is it the >> coordinator, or the dmtcp library wrapped around the running >program >> that's generating this error? >> Thanks in advance... >> >> [47000] ERROR at dmtcpmessagetypes.cpp:65 in assertValid; >> REASON='JASSERT(strcmp ( DMTCP_MAGIC_STRING,_magicBits ) == 0) >failed' >> _magicBits = >> Message: read invalid message, _magicBits mismatch. Did DMTCP >> coordinator die uncleanly? >> main-PYTHIA8-lhef (47000): Terminating... >> [40000] ERROR at coordinatorapi.cpp:601 in >> createNewConnectionBeforeFork; >> REASON='JASSERT(_coordinatorSocket.isValid()) failed' >> bash (40000): Terminating... >> >> >> >------------------------------------------------------------------------------ >> Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park >in San >> Francisco, CA to explore cutting-edge tech and listen to tech >luminaries >> present their vision of the future. This family event has >something for >> everyone, including kids. Get more information and register >today. >> http://sdm.link/attshape >> _______________________________________________ >> Dmtcp-forum mailing list >> Dmt...@li... >> <mailto:Dmt...@li...> >> https://lists.sourceforge.net/lists/listinfo/dmtcp-forum >> >> > >------------------------------------------------------------------------------ >Attend Shape: An AT&T Tech Expo July 15-16. Meet us at AT&T Park in San >Francisco, CA to explore cutting-edge tech and listen to tech >luminaries >present their vision of the future. This family event has something for >everyone, including kids. Get more information and register today. >http://sdm.link/attshape >_______________________________________________ >Dmtcp-forum mailing list >Dmt...@li... >https://lists.sourceforge.net/lists/listinfo/dmtcp-forum |