From: Pablo C. <pc...@cn...> - 2020-11-26 10:58:07
|
Just an idea!! Maybe the "working directory" is not been passed to the node, thus working on home and failing? On 26/11/20 11:02, Grigory Sharov wrote: > Hi, > > the place where you start scipion is not relevant, does > SCIPION_USER_DATA variable in the config file point to a shared > writable location? > > Best regards, > Grigory > > -------------------------------------------------------------------------------- > Grigory Sharov, Ph.D. > > MRC Laboratory of Molecular Biology, > Francis Crick Avenue, > Cambridge Biomedical Campus, > Cambridge CB2 0QH, UK. > tel. +44 (0) 1223 267228 <tel:+44%201223%20267228> > e-mail: gs...@mr... <mailto:gs...@mr...> > > > On Thu, Nov 26, 2020 at 8:57 AM Yangyang Yi <yy...@si... > <mailto:yy...@si...>> wrote: > > I have check the job more carefully and also tried other jobs. > For the relion Class3D job, (I used relion_benchmark dataset) it > reported “Output Directory not exist” in > Runs/000429_ProtRelionClassify2D/logs/429: > ERROR: > ERROR: output directory does not exist! > /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() > [0x436a8f] > ================== > ERROR: > ERROR: output directory does not exist! > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) > [0x447f91] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN11MlOptimiser17initialiseGeneralEi+0x248f) > [0x5ada9f] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi10initialiseEv+0x998) > [0x4689f8] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi(main+0xb2f) > [0x4336ff] > /lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab935e555] > /cm/shared/apps/scipion/2.0/software/em/relion-3.0/bin/relion_refine_mpi() > [0x436a8f] > > I have run some other job and they reported similar error, such as > “Cannot open tif file” (in MotionCor2) or "forrtl: severe (24): > end-of-file during read, unit 5, file /proc/226944/fd/0” (in > unblur). For cisTEM-unblur, it reports: > IOError: [Errno 2] No such file or > directory:'Runs/000250_CistemProtUnblur/extra/May08_03.05.02_shifts.txt’, > but I found these files in my home directory where I opened scipion. > > I am suspecting the raw data’s location matters. Since my testing > data was located in our cluster shared storage, the permission is > root and all the people could read but cannot modify (like > /data/tutorial_data/). Those data has been used for software > teaching or testing before, I’m sure that all the users could > process them outside scipion. But I started scipion in my home > directory which was located in cluster shared storage (like > /data/users/xxxlab/xxx). Is there anything I should take care about? > > I will also try scipion-3.0 to see if it works. > >> 2020年11月23日 下午4:32,Yangyang Yi <yy...@si... >> <mailto:yy...@si...>> 写道: >> >> Sorry for the late reply. Here’s the log from the job begin to >> the job end. >> run.stdout: >> >> 00001: RUNNING PROTOCOL ----------------- >> 00002: HostName: headnode.cm.cluster >> 00003: PID: 209177 >> 00004: Scipion: v2.0 (2019-04-23) Diocletian >> 00005: currentDir: >> /ddn/users/spadm/ScipionUserData/projects/relion_benchmark >> 00006: workingDir: Runs/000347_ProtRelionClassify3D >> 00007: runMode: Continue >> 00008: MPI: 5 >> 00009: threads: 2 >> 00010: len(steps) 3 len(prevSteps) 0 >> 00011: Starting at step: 1 >> 00012: Running steps >> 00013: STARTED: convertInputStep, step 1 >> 00014: 2020-11-12 13:46:13.708639 >> 00015: Converting set from >> 'Runs/000002_ProtImportParticles/particles.sqlite' into >> 'Runs/000347_ProtRelionClassify3D/input_particles.star' >> 00016: convertBinaryFiles: creating soft links. >> 00017: Root: Runs/000347_ProtRelionClassify3D/extra/input -> >> Runs/000002_ProtImportParticles/extra >> 00018: FINISHED: convertInputStep, step 1 >> 00019: 2020-11-12 13:46:48.416238 >> 00020: STARTED: runRelionStep, step 2 >> 00021: 2020-11-12 13:46:48.438416 >> 00022: ** Submiting to queue: 'sbatch >> /ddn/users/spadm/ScipionUserData/projects/relion_benchmark/Runs/000347_ProtRelionClassify3D/logs/347-0-1.job' >> 00023: launched job with id 2552 >> 00024: FINISHED: runRelionStep, step 2 >> 00025: 2020-11-12 13:46:48.524619 >> 00026: STARTED: createOutputStep, step 3 >> 00027: 2020-11-12 13:46:48.973668 >> 00028: Traceback (most recent call last): >> 00029: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/executor.py", >> line 151, in run >> 00030: self.step._run() # not self.step.run() , to avoid >> race conditions >> 00031: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", >> line 237, in _run >> 00032: resultFiles = self._runFunc() >> 00033: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", >> line 233, in _runFunc >> 00034: return self._func(*self._args) >> 00035: File >> "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", >> line 77, in createOutputStep >> 00036: self._fillClassesFromIter(classes3D, self._lastIter()) >> 00037: File >> "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", >> line 176, in _fillClassesFromIter >> 00038: self._loadClassesInfo(iteration) >> 00039: File >> "/cm/shared/apps/scipion/2.0/software/lib/python2.7/site-packages/relion/protocols/protocol_classify3d.py", >> line 166, in _loadClassesInfo >> 00040: self._getFileName('model', iter=iteration)) >> 00041: File >> "/cm/shared/apps/scipion/2.0/pyworkflow/protocol/protocol.py", >> line 841, in _getFileName >> 00042: return self.__filenamesDict[key] % kwargs >> 00043: TypeError: %d format: a number is required, not NoneType >> 00044: Protocol failed: %d format: a number is required, not >> NoneType >> 00045: FAILED: createOutputStep, step 3 >> 00046: 2020-11-12 13:46:48.991279 >> 00047: *** Last status is failed >> 00048: ------------------- PROTOCOL FAILED (DONE 3/3) >> >> run.log: >> 2020-11-12 13:46:48.438416 >> 00020: 2020-11-12 13:46:48,972 INFO: FINISHED: runRelionStep, >> step 2 >> 00021: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.524619 >> 00022: 2020-11-12 13:46:48,973 INFO: STARTED: >> createOutputStep, step 3 >> 00023: 2020-11-12 13:46:48,973 INFO: 2020-11-12 13:46:48.973668 >> 00024: 2020-11-12 13:46:49,485 ERROR: Protocol failed: %d >> format: a number is required, not NoneType >> 00025: 2020-11-12 13:46:49,508 INFO: FAILED: createOutputStep, >> step 3 >> 00026: 2020-11-12 13:46:49,508 INFO: 2020-11-12 13:46:48.991279 >> 00027: 2020-11-12 13:46:49,570 INFO: ------------------- >> PROTOCOL FAILED (DONE 3/3) >> >> >> >>> 2020年11月12日 上午2:19,Grigory Sharov <sha...@gm... >>> <mailto:sha...@gm...>> 写道: >>> >>> Hi Yangyang, >>> >>> I've tried your config with Scipion2 and it seems to work fine. >>> The only problem I found was using curly quotes (“) instead of >>> straight ones (") in the queues dictionary. Did you get the >>> error message after the job was submitted and started to run or >>> before? >>> >>> Best regards, >>> Grigory >>> >>> -------------------------------------------------------------------------------- >>> Grigory Sharov, Ph.D. >>> >>> MRC Laboratory of Molecular Biology, >>> Francis Crick Avenue, >>> Cambridge Biomedical Campus, >>> Cambridge CB2 0QH, UK. >>> tel. +44 (0) 1223 267228 <tel:+44%201223%20267228> >>> e-mail: gs...@mr... <mailto:gs...@mr...> >>> >>> >>> On Wed, Nov 11, 2020 at 9:20 AM Yangyang Yi >>> <yy...@si... <mailto:yy...@si...>> wrote: >>> >>> Dear Scipion users & devs, >>> >>> I am kindly asking for your advice. >>> >>> Now we are trying to set Scipion-2.0 on a slurm cluster. It >>> could run on single machine but failed to submit the jobs to >>> queue. Slurm cluster works well and running scipion on >>> single node works. >>> >>> Here’s our settings for host.conf: >>> >>> host.conf: >>> [localhost] >>> PARALLEL_COMMAND = mpirun -np %_(JOB_NODES)d -bynode >>> %_(COMMAND)s >>> NAME = SLURM >>> MANDATORY = 0 >>> SUBMIT_COMMAND = sbatch %_(JOB_SCRIPT)s >>> CANCEL_COMMAND = scancel %_(JOB_ID)s >>> CHECK_COMMAND = squeue -j %_(JOB_ID)s >>> SUBMIT_TEMPLATE = #!/bin/bash >>> ####SBATCH --export=ALL >>> #SBATCH -p %_(JOB_QUEUE)s >>> #SBATCH -J %_(JOB_NAME)s >>> #SBATCH -o %_(JOB_SCRIPT)s.out >>> #SBATCH -e %_(JOB_SCRIPT)s.err >>> #SBATCH --time=%_(JOB_TIME)s:00:00 >>> #SBATCH --nodes=1 >>> #SBATCH --ntasks=%_(JOB_NODES)d >>> #SBATCH --cpus-per-task=%_(JOB_THREADS)d >>> WORKDIR=$SLURM_JOB_SUBMIT_DIR >>> export XMIPP_IN_QUEUE=1 >>> cd $WORKDIR >>> # Make a copy of node file >>> echo $SLURM_JOB_NODELIST > %_(JOB_NODEFILE)s >>> ### Display the job context >>> echo Running on host `hostname` >>> echo Time is `date` >>> echo Working directory is `pwd` >>> echo $SLURM_JOB_NODELIST >>> echo CUDA_VISIBLE_DEVICES: $CUDA_VISIBLE_DEVICES >>> ################################# >>> %_(JOB_COMMAND)s >>> find "$SLURM_SUBMIT_DIR" -type f -user $USER -perm >>> 644 -exec chmod 664 {} + >>> QUEUES = { >>> “a": [["JOB_TIME", "48", "Time (hours)", "Select the >>> time expected (in hours) for this job"], >>> ["NODES","1", "Nodes", "How many nodes required for all the >>> nodes"], >>> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send >>> individual jobs to queue"]], >>> “b": [["JOB_TIME", "48", "Time (hours)", "Select the >>> time expected (in hours) for this job"], >>> ["NODES","1", "Nodes", "How many nodes required for all the >>> nodes"], >>> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send >>> individual jobs to queue"]], >>> “c": [["JOB_MEMORY", "8192", "Memory (MB)", "Select >>> amount of memory (in megabytes) for this job"], >>> ["JOB_TIME", "48", "Time (hours)", "Select the time expected >>> (in hours) for this job"], >>> ["NODES","1", "Nodes", "How many nodes required for all the >>> nodes"], >>> ["QUEUE_FOR_JOBS", "N", "Use queue for jobs", "Send >>> individual jobs to queue"]] >>> } >>> JOB_DONE_REGEX = >>> >>> And the Scipion reports: >>> typeerror: %d format: a number is required, not nonetype >>> >>> And suggestions about how to solve the problem? Thanks! >>> _______________________________________________ >>> scipion-users mailing list >>> sci...@li... >>> <mailto:sci...@li...> >>> https://lists.sourceforge.net/lists/listinfo/scipion-users >>> >>> _______________________________________________ >>> scipion-users mailing list >>> sci...@li... >>> <mailto:sci...@li...> >>> https://lists.sourceforge.net/lists/listinfo/scipion-users >> > > _______________________________________________ > scipion-users mailing list > sci...@li... > <mailto:sci...@li...> > https://lists.sourceforge.net/lists/listinfo/scipion-users > > > > _______________________________________________ > scipion-users mailing list > sci...@li... > https://lists.sourceforge.net/lists/listinfo/scipion-users -- Pablo Conesa - *Madrid Scipion <http://scipion.i2pc.es> team* |