From: Grigory S. <sha...@gm...> - 2020-11-23 09:44:17
|
>From the last message: "FINISHED: runRelionStep, step 2" there's no problem with any queue system, the problem is whether relion actually produced the output data since detected last iteration in self._getFileName('model', iter=iteration)) is None. Best regards, Grigory -------------------------------------------------------------------------------- Grigory Sharov, Ph.D. MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK. tel. +44 (0) 1223 267228 e-mail: gs...@mr... On Mon, Nov 23, 2020 at 9:33 AM Lugmayr, Wolfgang <w.l...@uk...> wrote: > > hi, > > i compared the template posted below with mine and i have the following differences in my template: > MANDATORY = False > #SBATCH --ntasks %_(JOB_NODES)s > > your template may need also to change: > #SBATCH --nodes %_(NODES)s > > on our cluster the amount of mpi ntasks defines how many nodes you get. so i do not use the --nodes parameter. > > cheers, > wolfgang > > > ________________________________ > From: "Pablo Conesa" <pc...@cn...> > To: "Mailing list for Scipion users" <sci...@li...> > Sent: Monday, 23 November, 2020 09:49:29 > Subject: Re: [scipion-users] Questions about host.conf for Scipion on slurp cluster > > I think job went in! > > I see here more an issue when loading the starfile. Maybe iteration is None? > > https://github.com/scipion-em/scipion-em-relion/blob/support/relion/protocols/protocol_classify3d.py#L166 > > Grigory, Jose Miguel? > > > On 23/11/20 9:32, Yangyang Yi wrote: > > Sorry for the late reply. Here’s the log from the job begin to the job end. > run.stdout: > > 00001: RUNNING PROTOCOL ----------------- > 00002: HostName: headnode.cm.cluster > 00003: PID: 209177 > 00004: Scipion: v2.0 (2019-04-23) Diocletian > 00005: currentDir: /ddn/users/spadm/ScipionUserData/projects/relion_benchmark > 00006: workingDir: Runs/000347_ProtRelionClassify3D > 00007: runMode: Continue > 00008: MPI: 5 > 00009: threads: 2 > 00010: len(steps) 3 len(prevSteps) 0 > 00011: Starting at step: 1 > 00012: Running steps > 00013: STARTED: convertInputStep, step 1 > 00014: 2020-11-12 13:46:13.708639 > 00015: Converting set from 'Runs/000002_ProtImportParticles/particles.sqlite' into 'Runs/000347_ProtRelionClassify3D/input_particles.star' > 00016: convertBinaryFiles: creating soft links. > 00017: Root: Runs/000347_ProtRelionClassify3D/extra/input -> Runs/000002_ProtImportParticles/extra > 00018: FINISHED: convertInputStep, step 1 > 00019: 2020-11-12 13:46:48.416238 > 00020: STARTED: runRelionStep, step 2 > 00021: 2020-11-12 13:46:48.438416 > 00022: ** Submiting to queue: 'sbatch /ddn/users/spadm/ScipionUserData/projects/relion_benchmark/Runs/000347_ProtRelionClassify3D/logs/347-0-1.job' > 00023: launched job with id 2552 > 00024: FINISHED: runRelionStep, step 2 > 00025: 2020-11-12 13:46:48.524619 > 00026: STARTED: createOutputStep, step 3 > 00027: 2020-11-12 13:46:48.973668 > 00028: Traceback (most recent call last): > 00029: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/executor.py", line 151, in run > 00030: self.step._run() # not self.step.run() , to avoid race conditions > 00031: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 237, in _run > 00032: resultFiles = self._runFunc() > 00033: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 233, in _runFunc > 00034: return self._func(*self._args) > 00035: File "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", line 77, in createOutputStep > 00036: self._fillClassesFromIter(classes3D, self._lastIter()) > 00037: File "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", line 176, in _fillClassesFromIter > 00038: self._loadClassesInfo(iteration) > 00039: File "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", line 166, in _loadClassesInfo > 00040: self._getFileName('model', iter=iteration)) > 00041: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 841, in _getFileName > 00042: return self.__filenamesDict[key] % kwargs > 00043: TypeError: %d format: a number is required, not NoneType > 00044: Protocol failed: %d format: a number is required, not NoneType > 00045: FAILED: createOutputStep, step 3 > 00046: 2020-11-12 13:46:48.991279 > 00047: *** Last status is failed > 00048: ------------------- PROTOCOL FAILED (DONE 3/3) > > run.log: > 2020-11-12 13:46:48.438416 > 00020: 2020-11-12 13:46:48,972 INFO: FINISHED: runRelionStep, step 2 > 00021: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.524619 > 00022: 2020-11-12 13:46:48,973 INFO: STARTED: createOutputStep, step 3 > 00023: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.973668 > 00024: 2020-11-12 13:46:49,485 ERROR: Protocol failed: %d format: a number is required, not NoneType > 00025: 2020-11-12 13:46:49,508 INFO: FAILED: createOutputStep, step 3 > 00026: 2020-11-12 13:46:49,508 INFO: 2020-11-12 13:46:48.991279 > 00027: 2020-11-12 13:46:49,570 INFO: ------------------- PROTOCOL FAILED (DONE 3/3) > > > > 2020年11月12日 上午2:19,Grigory Sharov <sha...@gm...> 写道: > > Hi Yangyang, > > I've tried your config with Scipion2 and it seems to work fine. The only problem I found was using curly quotes (“) instead of straight ones (") in the queues dictionary. Did you get the error message after the job was submitted and started to run or before? > > Best regards, > Grigory > > -------------------------------------------------------------------------------- > Grigory Sharov, Ph.D. > > MRC Laboratory of Molecular Biology, > Francis Crick Avenue, > Cambridge Biomedical Campus, > Cambridge CB2 0QH, UK. > tel. +44 (0) 1223 267228 > e-mail: gs...@mr... > > > On Wed, Nov 11, 2020 at 9:20 AM Yangyang Yi <yy...@si...> wrote: >> >> Dear Scipion users & devs, >> >> I am kindly asking for your advice. >> >> Now we are trying to set Scipion-2.0 on a slurm cluster. It could run on single machine but failed to submit the jobs to queue. Slurm cluster works well and running scipion on single node works. >> >> Here’s our settings for host.conf: >> >> host.conf: >> [localhost] >> PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode %_(COMMAND)s >> NAME = SLURM >> MANDATORY = 0 >> SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s >> CANCEL_COMMAND = scancel %_(JOB_ID)s >> CHECK_COMMAND = squeue -j %_(JOB_ID)s >> SUBMIT_TEMPLATE = #!/bin/bash >> ####SBATCH --export=ALL >> #SBATCH -p %_(JOB_QUEUE)s >> #SBATCH -J %_(JOB_NAME)s >> #SBATCH -o %_(JOB_SCRIPT)s.out >> #SBATCH -e %_(JOB_SCRIPT)s.err >> #SBATCH --time=%_(JOB_TIME)s:00:00 >> #SBATCH --nodes=1 >> #SBATCH --ntasks=%_(JOB_NODES)d >> #SBATCH --cpus-per-task=%_(JOB_THREADS)d >> WORKDIR=$SLURM_JOB_SUBMIT_DIR >> export XMIPP_IN_QUEUE=1 >> cd $WORKDIR >> # Make a copy of node file >> echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s >> ### Display the job context >> echo Running on host `hostname` >> echo Time is `date` >> echo Working directory is `pwd` >> echo $SLURM_JOB_NODELIST >> echo CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES >> ################################# >> %_(JOB_COMMAND)s >> find "$SLURM_SUBMIT_DIR" -type f -user $USER -perm 644 -exec chmod 664 {} + >> QUEUES = { >> “a": [["JOB_TIME", "48", "Time (hours)", "Select the time expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual jobs to queue"]], >> “b": [["JOB_TIME", "48", "Time (hours)", "Select the time expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual jobs to queue"]], >> “c": [["JOB_MEMORY", "8192", "Memory (MB)", "Select amount of memory (in megabytes) for this job"], >> ["JOB_TIME", "48", "Time (hours)", "Select the time expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual jobs to queue"]] >> } >> JOB_DONE_REGEX = >> >> And the Scipion reports: >> typeerror: %d format: a number is required, not nonetype >> >> And suggestions about how to solve the problem? Thanks! >> _______________________________________________ >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users > > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > > > > > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > > -- > Pablo Conesa - Madrid Scipion team > > > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users |