From: Grigory S. <sha...@gm...> - 2020-11-26 10:02:35
|
Hi, the place where you start scipion is not relevant, does SCIPION_USER_DATA variable in the config file point to a shared writable location? Best regards, Grigory -------------------------------------------------------------------------------- Grigory Sharov, Ph.D. MRC Laboratory of Molecular Biology, Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge CB2 0QH, UK. tel. +44 (0) 1223 267228 <+44%201223%20267228> e-mail: gs...@mr... On Thu, Nov 26, 2020 at 8:57 AM Yangyang Yi <yy...@si...> wrote: > I have check the job more carefully and also tried other jobs. > For the relion Class3D job, (I used relion_benchmark dataset) it reported > “Output Directory not exist” in Runs/000429_ProtRelionClassify2D/logs/429: > ERROR: > ERROR: output directory does not exist! > /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() > [0x436a8f] > ================== > ERROR: > ERROR: output directory does not exist! > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) > [0x447f91] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11MlOptimiser17initialiseGeneralEi+0x248f) > [0x5ada9f] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x998) > [0x4689f8] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(main+0xb2f) > [0x4336ff] > /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() > [0x436a8f] > > I have run some other job and they reported similar error, such as “Cannot > open tif file” (in MotionCor2) or "forrtl: severe (24): end-of-file during > read, unit 5, file /proc/226944/fd/0” (in unblur). For cisTEM-unblur, it > reports: > IOError: [Errno 2] No such file or > directory:'Runs/000250_CistemProtUnblur/extra/May08_03.05.02_shifts.txt’, > but I found these files in my home directory where I opened scipion. > > I am suspecting the raw data’s location matters. Since my testing data was > located in our cluster shared storage, the permission is root and all the > people could read but cannot modify (like /data/tutorial_data/). Those data > has been used for software teaching or testing before, I’m sure that all > the users could process them outside scipion. But I started scipion in my > home directory which was located in cluster shared storage (like > /data/users/xxxlab/xxx). Is there anything I should take care about? > > I will also try scipion-3.0 to see if it works. > > 2020年11月23日 下午4:32,Yangyang Yi <yy...@si...> 写道: > > Sorry for the late reply. Here’s the log from the job begin to the job end. > run.stdout: > > 00001: RUNNING PROTOCOL ----------------- > 00002: HostName: headnode.cm.cluster > 00003: PID: 209177 > 00004: Scipion: v2.0 (2019-04-23) Diocletian > 00005: currentDir: > /ddn/users/spadm/ScipionUserData/projects/relion_benchmark > 00006: workingDir: Runs/000347_ProtRelionClassify3D > 00007: runMode: Continue > 00008: MPI: 5 > 00009: threads: 2 > 00010: len(steps) 3 len(prevSteps) 0 > 00011: Starting at step: 1 > 00012: Running steps > 00013: STARTED: convertInputStep, step 1 > 00014: 2020-11-12 13:46:13.708639 > 00015: Converting set from > 'Runs/000002_ProtImportParticles/particles.sqlite' into > 'Runs/000347_ProtRelionClassify3D/input_particles.star' > 00016: convertBinaryFiles: creating soft links. > 00017: Root: Runs/000347_ProtRelionClassify3D/extra/input -> > Runs/000002_ProtImportParticles/extra > 00018: FINISHED: convertInputStep, step 1 > 00019: 2020-11-12 13:46:48.416238 > 00020: STARTED: runRelionStep, step 2 > 00021: 2020-11-12 13:46:48.438416 > 00022: ** Submiting to queue: 'sbatch > /ddn/users/spadm/ScipionUserData/projects/relion_benchmark/Runs/000347_ProtRelionClassify3D/logs/347-0-1.job' > 00023: launched job with id 2552 > 00024: FINISHED: runRelionStep, step 2 > 00025: 2020-11-12 13:46:48.524619 > 00026: STARTED: createOutputStep, step 3 > 00027: 2020-11-12 13:46:48.973668 > 00028: Traceback (most recent call last): > 00029: File > "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/executor.py", line 151, in > run > 00030: self.step._run() # not self.step.run() , to avoid race > conditions > 00031: File > "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 237, in > _run > 00032: resultFiles = self._runFunc() > 00033: File > "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 233, in > _runFunc > 00034: return self._func(*self._args) > 00035: File > "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", > line 77, in createOutputStep > 00036: self._fillClassesFromIter(classes3D, self._lastIter()) > 00037: File > "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", > line 176, in _fillClassesFromIter > 00038: self._loadClassesInfo(iteration) > 00039: File > "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", > line 166, in _loadClassesInfo > 00040: self._getFileName('model', iter=iteration)) > 00041: File > "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", line 841, in > _getFileName > 00042: return self.__filenamesDict[key] % kwargs > 00043: TypeError: %d format: a number is required, not NoneType > 00044: Protocol failed: %d format: a number is required, not NoneType > 00045: FAILED: createOutputStep, step 3 > 00046: 2020-11-12 13:46:48.991279 > 00047: *** Last status is failed > 00048: ------------------- PROTOCOL FAILED (DONE 3/3) > > run.log: > 2020-11-12 13:46:48.438416 > 00020: 2020-11-12 13:46:48,972 INFO: FINISHED: runRelionStep, step 2 > 00021: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.524619 > 00022: 2020-11-12 13:46:48,973 INFO: STARTED: createOutputStep, step 3 > 00023: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.973668 > 00024: 2020-11-12 13:46:49,485 ERROR: Protocol failed: %d format: a > number is required, not NoneType > 00025: 2020-11-12 13:46:49,508 INFO: FAILED: createOutputStep, step 3 > 00026: 2020-11-12 13:46:49,508 INFO: 2020-11-12 13:46:48.991279 > 00027: 2020-11-12 13:46:49,570 INFO: ------------------- PROTOCOL > FAILED (DONE 3/3) > > > > 2020年11月12日 上午2:19,Grigory Sharov <sha...@gm...> 写道: > > Hi Yangyang, > > I've tried your config with Scipion2 and it seems to work fine. The only > problem I found was using curly quotes (“) instead of straight ones (") > in the queues dictionary. Did you get the error message after the job was > submitted and started to run or before? > > Best regards, > Grigory > > > -------------------------------------------------------------------------------- > Grigory Sharov, Ph.D. > > MRC Laboratory of Molecular Biology, > Francis Crick Avenue, > Cambridge Biomedical Campus, > Cambridge CB2 0QH, UK. > tel. +44 (0) 1223 267228 <+44%201223%20267228> > e-mail: gs...@mr... > > > On Wed, Nov 11, 2020 at 9:20 AM Yangyang Yi <yy...@si...> wrote: > >> Dear Scipion users & devs, >> >> I am kindly asking for your advice. >> >> Now we are trying to set Scipion-2.0 on a slurm cluster. It could run on >> single machine but failed to submit the jobs to queue. Slurm cluster works >> well and running scipion on single node works. >> >> Here’s our settings for host.conf: >> >> host.conf: >> [localhost] >> PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode %_(COMMAND)s >> NAME = SLURM >> MANDATORY = 0 >> SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s >> CANCEL_COMMAND = scancel %_(JOB_ID)s >> CHECK_COMMAND = squeue -j %_(JOB_ID)s >> SUBMIT_TEMPLATE = #!/bin/bash >> ####SBATCH --export=ALL >> #SBATCH -p %_(JOB_QUEUE)s >> #SBATCH -J %_(JOB_NAME)s >> #SBATCH -o %_(JOB_SCRIPT)s.out >> #SBATCH -e %_(JOB_SCRIPT)s.err >> #SBATCH --time=%_(JOB_TIME)s:00:00 >> #SBATCH --nodes=1 >> #SBATCH --ntasks=%_(JOB_NODES)d >> #SBATCH --cpus-per-task=%_(JOB_THREADS)d >> WORKDIR=$SLURM_JOB_SUBMIT_DIR >> export XMIPP_IN_QUEUE=1 >> cd $WORKDIR >> # Make a copy of node file >> echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s >> ### Display the job context >> echo Running on host `hostname` >> echo Time is `date` >> echo Working directory is `pwd` >> echo $SLURM_JOB_NODELIST >> echo CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES >> ################################# >> %_(JOB_COMMAND)s >> find "$SLURM_SUBMIT_DIR" -type f -user $USER -perm 644 -exec >> chmod 664 {} + >> QUEUES = { >> “a": [["JOB_TIME", "48", "Time (hours)", "Select the time >> expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the >> nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual >> jobs to queue"]], >> “b": [["JOB_TIME", "48", "Time (hours)", "Select the time >> expected (in hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the >> nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual >> jobs to queue"]], >> “c": [["JOB_MEMORY", "8192", "Memory (MB)", "Select amount of >> memory (in megabytes) for this job"], >> ["JOB_TIME", "48", "Time (hours)", "Select the time expected (in >> hours) for this job"], >> ["NODES","1", "Nodes", "How many nodes required for all the >> nodes"], >> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send individual >> jobs to queue"]] >> } >> JOB_DONE_REGEX = >> >> And the Scipion reports: >> typeerror: %d format: a number is required, not nonetype >> >> And suggestions about how to solve the problem? Thanks! >> _______________________________________________ >> scipion-users mailing list >> sci...@li... >> https://lists.sourceforge.net/lists/listinfo/scipion-users >> > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > > > > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users > |