From: Yangyang Yi <yy...@si...> - 2020-11-26 08:57:36
|
I have check the job more carefully and also tried other jobs. For the relion Class3D job, (I used relion_benchmark dataset) it reported “Output Directory not exist” in Runs/000429_ProtRelionClassify2D/logs/429: ERROR: ERROR: output directory does not exist! /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() [0x436a8f] ================== ERROR: ERROR: output directory does not exist! /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x447f91] /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11MlOptimiser17initialiseGeneralEi+0x248f) [0x5ada9f] /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x998) [0x4689f8] /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(main+0xb2f) [0x4336ff] /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() [0x436a8f] I have run some other job and they reported similar error, such as “Cannot open tif file” (in MotionCor2) or "forrtl: severe (24): end-of-file during read, unit 5, file /proc/226944/fd/0” (in unblur). For cisTEM-unblur, it reports: IOError: [Errno 2] No such file or directory:'Runs/000250_CistemProtUnblur/extra/May08_03.05.02_shifts.txt’, but I found these files in my home directory where I opened scipion. I am suspecting the raw data’s location matters. Since my testing data was located in our cluster shared storage, the permission is root and all the people could read but cannot modify (like /data/tutorial_data/). Those data has been used for software teaching or testing before, I’m sure that all the users could process them outside scipion. But I started scipion in my home directory which was located in cluster shared storage (like /data/users/xxxlab/xxx). Is there anything I should take care about? I will also try scipion-3.0 to see if it works. > 2020年11月23日 下午4:32,Yangyang Yi <yy...@si...> 写道: > > Sorry for the late reply. Here’s the log from the job begin to the job end. > run.stdout: > > 00001: RUNNING PROTOCOL ----------------- > 00002: HostName: headnode.cm.cluster > 00003: PID: 209177 > 00004: Scipion: v2.0 (2019-04-23) Diocletian > 00005: currentDir: /ddn/users/spadm/ScipionUserData/projects/relion_benchmark > 00006: workingDir: Runs/000347_ProtRelionClassify3D > 00007: runMode: Continue > 00008: MPI: 5 > 00009: threads: 2 > 00010: len(steps) 3 len(prevSteps) 0 > 00011: Starting at step: 1 > 00012: Running steps > 00013: STARTED: convertInputStep, step 1 > 00014: 2020-11-12 13:46:13.708639 > 00015: Converting set from 'Runs/000002_ProtImportParticles/particles.sqlite' into 'Runs/000347_ProtRelionClassify3D/input_particles.star' > 00016: convertBinaryFiles: creating soft links. > 00017: Root: Runs/000347_ProtRelionClassify3D/extra/input -> Runs/000002_ProtImportParticles/extra > 00018: FINISHED: convertInputStep, step 1 > 00019: 2020-11-12 13:46:48.416238 > 00020: STARTED: runRelionStep, step 2 > 00021: 2020-11-12 13:46:48.438416 > 00022: ** Submiting to queue: 'sbatch /ddn/users/spadm/ScipionUserData/projects/relion_benchmark/Runs/000347_ProtRelionClassify3D/logs/347-0-1.job' > 00023: launched job with id 2552 > 00024: FINISHED: runRelionStep, step 2 > 00025: 2020-11-12 13:46:48.524619 > 00026: STARTED: createOutputStep, step 3 > 00027: 2020-11-12 13:46:48.973668 > 00028: Traceback (most recent call last): > 00029: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/executor.py", line 151, in run > 00030: self.step._run() # not self.step.run() , to avoid race conditions > 00031: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 237, in _run > 00032: resultFiles = self._runFunc() > 00033: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 233, in _runFunc > 00034: return self._func(*self._args) > 00035: File "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", line 77, in createOutputStep > 00036: self._fillClassesFromIter(classes3D, self._lastIter()) > 00037: File "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", line 176, in _fillClassesFromIter > 00038: self._loadClassesInfo(iteration) > 00039: File "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", line 166, in _loadClassesInfo > 00040: self._getFileName('model', iter=iteration)) > 00041: File "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 841, in _getFileName > 00042: return self.__filenamesDict[key] % kwargs > 00043: TypeError: %d format: a number is required, not NoneType > 00044: Protocol failed: %d format: a number is required, not NoneType > 00045: FAILED: createOutputStep, step 3 > 00046: 2020-11-12 13:46:48.991279 > 00047: *** Last status is failed > 00048: ------------------- PROTOCOL FAILED (DONE 3/3) > > run.log: > 2020-11-12 13:46:48.438416 > 00020: 2020-11-12 13:46:48,972 INFO: FINISHED: runRelionStep, step 2 > 00021: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.524619 > 00022: 2020-11-12 13:46:48,973 INFO: STARTED: createOutputStep, step 3 > 00023: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.973668 > 00024: 2020-11-12 13:46:49,485 ERROR: Protocol failed: %d format: a number is required, not NoneType > 00025: 2020-11-12 13:46:49,508 INFO: FAILED: createOutputStep, step 3 > 00026: 2020-11-12 13:46:49,508 INFO: 2020-11-12 13:46:48.991279 > 00027: 2020-11-12 13:46:49,570 INFO: ------------------- PROTOCOL FAILED (DONE 3/3) > > > >> 2020年11月12日 上午2:19,Grigory Sharov <sha...@gm... <mailto:sha...@gm...>> 写道: >> >> Hi Yangyang, >> >> I've tried your config with Scipion2 and it seems to work fine. The only problem I found was using curly quotes (“) instead of straight ones (") in the queues dictionary. Did you get the error message after the job was submitted and started to run or before? >> >> Best regards, >> Grigory >> >> -------------------------------------------------------------------------------- >> Grigory Sharov, Ph.D. >> >> MRC Laboratory of Molecular Biology, >> Francis Crick Avenue, >> Cambridge Biomedical Campus, >> Cambridge CB2 0QH, UK. >> tel. +44 (0) 1223 267228 <tel:+44%201223%20267228> >> e-mail: gs...@mr... <mailto:gs...@mr...> >> >> >> On Wed, Nov 11, 2020 at 9:20 AM Yangyang Yi <yy...@si... <mailto:yy...@si...>> wrote: >> Dear Scipion users & devs, >> >> I am kindly asking for your advice. >> >> Now we are trying to set Scipion-2.0 on a slurm cluster. It could run on single machine but failed to submit the jobs to queue. Slurm cluster works well and running scipion on single node works. >> >> Here’s our settings for host.conf: >> >> host.conf: >> [localhost] >> PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode %_(COMMAND)s >> NAME = SLURM >> MANDATORY = 0 >> SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s >> CANCEL_COMMAND = scancel %_(JOB_ID)s >> CHECK_COMMAND = squeue -j %_(JOB_ID)s >> SUBMIT_TEMPLATE = #!/bin/bash >> ####SBATCH --export=ALL >> #SBATCH -p %_(JOB_QUEUE)s >> #SBATCH -J %_(JOB_NAME)s >> #SBATCH -o %_(JOB_SCRIPT)s.out >> #SBATCH -e %_(JOB_SCRIPT)s.err >> #SBATCH --time=%_(JOB_TIME)s:00:00 >> #SBATCH --nodes=1 >> #SBATCH --ntasks=%_(JOB_NODES)d >> #SBATCH --cpus-per-task=%_(JOB_THREADS)d >> WORKDIR=$SLURM_JOB_SUBMIT_DIR >> export XMIPP_IN_QUEUE=1 >> cd $WORKDIR >> # Make a copy of node file >> echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s >> ### Display the job context >> echo Running on host `hostname` >> echo Time is `date` >> echo Working directory is `pwd` >> echo $SLURM_JOB_NODELIST >> echo CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES >> ################################# >> %_(JOB_COMMAND)s >> find "$SLURM_SUBMIT_DIR" -type f -user $USER -perm 644 -exec chmod 664 {} + >> QUEUES = { >> “a": [["JOB_TIME", "48", "Time (hours)", "Select the time expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual jobs to queue"]], >> “b": [["JOB_TIME", "48", "Time (hours)", "Select the time expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual jobs to queue"]], >> “c": [["JOB_MEMORY", "8192", "Memory (MB)", "Select amount of memory (in megabytes) for this job"], >> ["JOB_TIME", "48", "Time (hours)", "Select the time expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual jobs to queue"]] >> } >> JOB_DONE_REGEX = >> >> And the Scipion reports: >> typeerror: %d format: a number is required, not nonetype >> >> And suggestions about how to solve the problem? Thanks! >> _______________________________________________ >> scipion-users mailing list >> sci...@li... <mailto:sci...@li...> >> https://lists.sourceforge.net/lists/listinfo/scipion-users <https://lists.sourceforge.net/lists/listinfo/scipion-users> >> _______________________________________________ >> scipion-users mailing list >> sci...@li... <mailto:sci...@li...> >> https://lists.sourceforge.net/lists/listinfo/scipion-users > |