From: Grigory S. <sha...@gm...> - 2020-11-26 11:13:23
|
It could be also that job scripts starting with a number are not allowed by Slurm: Runs/000429_ProtRelionClassify2D/logs/429 In this case SUBMIT_PREFIX is required: https://scipion-em.github.io/docs/docs/scipion-modes/host-configuration.html Best regards, Grigory -------------------------------------------------------------------------------- Grigory Sharov, Ph.D. MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK. tel. +44 (0) 1223 267228 <+44%201223%20267228> e-mail: gs...@mr... On Thu, Nov 26, 2020 at 10:58 AM Pablo Conesa <pc...@cn...> wrote: > Just an idea!! > > Maybe the "working directory" is not been passed to the node, thus working > on home and failing? > On 26/11/20 11:02, Grigory Sharov wrote: > > Hi, > > the place where you start scipion is not relevant, does SCIPION_USER_DATA > variable in the config file point to a shared writable location? > > Best regards, > Grigory > > > -------------------------------------------------------------------------------- > Grigory Sharov, Ph.D. > > MRC Laboratory of Molecular Biology, > Francis Crick Avenue, > Cambridge Biomedical Campus, > Cambridge CB2 0QH, UK. > tel. +44 (0) 1223 267228 <+44%201223%20267228> > e-mail: gs...@mr... > > > On Thu, Nov 26, 2020 at 8:57 AM Yangyang Yi <yy...@si...> wrote: > >> I have check the job more carefully and also tried other jobs. >> For the relion Class3D job, (I used relion_benchmark dataset) it reported >> “Output Directory not exist” in Runs/000429_ProtRelionClassify2D/logs/429: >> ERROR: >> ERROR: output directory does not exist! >> /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] >> /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() >> [0x436a8f] >> ================== >> ERROR: >> ERROR: output directory does not exist! >> /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) >> [0x447f91] >> /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11MlOptimiser17initialiseGeneralEi+0x248f) >> [0x5ada9f] >> /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x998) >> [0x4689f8] >> /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(main+0xb2f) >> [0x4336ff] >> /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] >> /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() >> [0x436a8f] >> >> I have run some other job and they reported similar error, such as >> “Cannot open tif file” (in MotionCor2) or "forrtl: severe (24): end-of-file >> during read, unit 5, file /proc/226944/fd/0” (in unblur). For >> cisTEM-unblur, it reports: >> IOError: [Errno 2] No such file or >> directory:'Runs/000250_CistemProtUnblur/extra/May08_03.05.02_shifts.txt’, >> but I found these files in my home directory where I opened scipion. >> >> I am suspecting the raw data’s location matters. Since my testing data >> was located in our cluster shared storage, the permission is root and all >> the people could read but cannot modify (like /data/tutorial_data/). Those >> data has been used for software teaching or testing before, I’m sure that >> all the users could process them outside scipion. But I started scipion in >> my home directory which was located in cluster shared storage (like >> /data/users/xxxlab/xxx). Is there anything I should take care about? >> >> I will also try scipion-3.0 to see if it works. >> >> 2020年11月23日 下午4:32,Yangyang Yi <yy...@si...> 写道: >> >> Sorry for the late reply. Here’s the log from the job begin to the job >> end. >> run.stdout: >> >> 00001: RUNNING PROTOCOL ----------------- >> 00002: HostName: headnode.cm.cluster >> 00003: PID: 209177 >> 00004: Scipion: v2.0 (2019-04-23) Diocletian >> 00005: currentDir: >> /ddn/users/spadm/ScipionUserData/projects/relion_benchmark >> 00006: workingDir: Runs/000347_ProtRelionClassify3D >> 00007: runMode: Continue >> 00008: MPI: 5 >> 00009: threads: 2 >> 00010: len(steps) 3 len(prevSteps) 0 >> 00011: Starting at step: 1 >> 00012: Running steps >> 00013: STARTED: convertInputStep, step 1 >> 00014: 2020-11-12 13:46:13.708639 >> 00015: Converting set from >> 'Runs/000002_ProtImportParticles/particles.sqlite' into >> 'Runs/000347_ProtRelionClassify3D/input_particles.star' >> 00016: convertBinaryFiles: creating soft links. >> 00017: Root: Runs/000347_ProtRelionClassify3D/extra/input -> >> Runs/000002_ProtImportParticles/extra >> 00018: FINISHED: convertInputStep, step 1 >> 00019: 2020-11-12 13:46:48.416238 >> 00020: STARTED: runRelionStep, step 2 >> 00021: 2020-11-12 13:46:48.438416 >> 00022: ** Submiting to queue: 'sbatch >> /ddn/users/spadm/ScipionUserData/projects/relion_benchmark/Runs/000347_ProtRelionClassify3D/logs/347-0-1.job' >> 00023: launched job with id 2552 >> 00024: FINISHED: runRelionStep, step 2 >> 00025: 2020-11-12 13:46:48.524619 >> 00026: STARTED: createOutputStep, step 3 >> 00027: 2020-11-12 13:46:48.973668 >> 00028: Traceback (most recent call last): >> 00029: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/executor.py", line 151, in >> run >> 00030: self.step._run() # not self.step.run() , to avoid race >> conditions >> 00031: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 237, in >> _run >> 00032: resultFiles = self._runFunc() >> 00033: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 233, in >> _runFunc >> 00034: return self._func(*self._args) >> 00035: File >> "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", >> line 77, in createOutputStep >> 00036: self._fillClassesFromIter(classes3D, self._lastIter()) >> 00037: File >> "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", >> line 176, in _fillClassesFromIter >> 00038: self._loadClassesInfo(iteration) >> 00039: File >> "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", >> line 166, in _loadClassesInfo >> 00040: self._getFileName('model', iter=iteration)) >> 00041: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 841, in >> _getFileName >> 00042: return self.__filenamesDict[key] % kwargs >> 00043: TypeError: %d format: a number is required, not NoneType >> 00044: Protocol failed: %d format: a number is required, not NoneType >> 00045: FAILED: createOutputStep, step 3 >> 00046: 2020-11-12 13:46:48.991279 >> 00047: *** Last status is failed >> 00048: ------------------- PROTOCOL FAILED (DONE 3/3) >> >> run.log: >> 2020-11-12 13:46:48.438416 >> 00020: 2020-11-12 13:46:48,972 INFO: FINISHED: runRelionStep, step 2 >> 00021: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.524619 >> 00022: 2020-11-12 13:46:48,973 INFO: STARTED: createOutputStep, step 3 >> 00023: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.973668 >> 00024: 2020-11-12 13:46:49,485 ERROR: Protocol failed: %d format: a >> number is required, not NoneType >> 00025: 2020-11-12 13:46:49,508 INFO: FAILED: createOutputStep, step 3 >> 00026: 2020-11-12 13:46:49,508 INFO: 2020-11-12 13:46:48.991279 >> 00027: 2020-11-12 13:46:49,570 INFO: ------------------- PROTOCOL >> FAILED (DONE 3/3) >> >> >> >> 2020年11月12日 上午2:19,Grigory Sharov <sha...@gm...> 写道: >> >> Hi Yangyang, >> >> I've tried your config with Scipion2 and it seems to work fine. The only >> problem I found was using curly quotes (“) instead of straight ones (") >> in the queues dictionary. Did you get the error message after the job was >> submitted and started to run or before? >> >> Best regards, >> Grigory >> >> >> -------------------------------------------------------------------------------- >> Grigory Sharov, Ph.D. >> >> MRC Laboratory of Molecular Biology, >> Francis Crick Avenue, >> Cambridge Biomedical Campus, >> Cambridge CB2 0QH, UK. >> tel. +44 (0) 1223 267228 <+44%201223%20267228> >> e-mail: gs...@mr... >> >> >> On Wed, Nov 11, 2020 at 9:20 AM Yangyang Yi <yy...@si...> >> wrote: >> >>> Dear Scipion users & devs, >>> >>> I am kindly asking for your advice. >>> >>> Now we are trying to set Scipion-2.0 on a slurm cluster. It could run on >>> single machine but failed to submit the jobs to queue. Slurm cluster works >>> well and running scipion on single node works. >>> >>> Here’s our settings for host.conf: >>> >>> host.conf: >>> [localhost] >>> PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode %_(COMMAND)s >>> NAME = SLURM >>> MANDATORY = 0 >>> SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s >>> CANCEL_COMMAND = scancel %_(JOB_ID)s >>> CHECK_COMMAND = squeue -j %_(JOB_ID)s >>> SUBMIT_TEMPLATE = #!/bin/bash >>> ####SBATCH --export=ALL >>> #SBATCH -p %_(JOB_QUEUE)s >>> #SBATCH -J %_(JOB_NAME)s >>> #SBATCH -o %_(JOB_SCRIPT)s.out >>> #SBATCH -e %_(JOB_SCRIPT)s.err >>> #SBATCH --time=%_(JOB_TIME)s:00:00 >>> #SBATCH --nodes=1 >>> #SBATCH --ntasks=%_(JOB_NODES)d >>> #SBATCH --cpus-per-task=%_(JOB_THREADS)d >>> WORKDIR=$SLURM_JOB_SUBMIT_DIR >>> export XMIPP_IN_QUEUE=1 >>> cd $WORKDIR >>> # Make a copy of node file >>> echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s >>> ### Display the job context >>> echo Running on host `hostname` >>> echo Time is `date` >>> echo Working directory is `pwd` >>> echo $SLURM_JOB_NODELIST >>> echo CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES >>> ################################# >>> %_(JOB_COMMAND)s >>> find "$SLURM_SUBMIT_DIR" -type f -user $USER -perm 644 -exec >>> chmod 664 {} + >>> QUEUES = { >>> “a": [["JOB_TIME", "48", "Time (hours)", "Select the time >>> expected (in hours) for this job"], >>> ["NODES","1", "Nodes", "How many nodes required for all the >>> nodes"], >>> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual >>> jobs to queue"]], >>> “b": [["JOB_TIME", "48", "Time (hours)", "Select the time >>> expected (in hours) for this job"], >>> ["NODES","1", "Nodes", "How many nodes required for all the >>> nodes"], >>> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual >>> jobs to queue"]], >>> “c": [["JOB_MEMORY", "8192", "Memory (MB)", "Select amount of >>> memory (in megabytes) for this job"], >>> ["JOB_TIME", "48", "Time (hours)", "Select the time expected (in >>> hours) for this job"], >>> ["NODES","1", "Nodes", "How many nodes required for all the >>> nodes"], >>> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual >>> jobs to queue"]] >>> } >>> JOB_DONE_REGEX = >>> >>> And the Scipion reports: >>> typeerror: %d format: a number is required, not nonetype >>> >>> And suggestions about how to solve the problem? Thanks! >>> _______________________________________________ >>> scipion-users mailing list >>> sci...@li... >>> https://lists.sourceforge.net/lists/listinfo/scipion-users >>> >> _______________________________________________ >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users >> >> >> >> _______________________________________________ >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users >> > > > _______________________________________________ > scipion-users mailing lis...@li...https://lists.sourceforge.net/lists/listinfo/scipion-users > > -- > Pablo Conesa - *Madrid Scipion <http://scipion.i2pc.es> team* > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > |