From: Hoover, D. (NIH/C. [E] <hoo...@hp...> - 2021-01-22 13:51:00
|
Hi all, I am trying to launch a MotionCorr job using Scipion across multiple nodes with multiple GPUs per node. If I leave GPU IDs blank, it attempts to run the tasks, but fails because the gpu ids are missing (command has -Gpu %(GPU)s). If I set GPU IDs to "0", it finishes some of the tasks (where the command includes -Gpu 0.0), but it fails on other tasks with non-integer values for the gpu (e.g. -gpu 2.5). If I set GPU IDs to "0 1" (there are 2 gpus per node), the entire job fails with this error: Traceback (most recent call last): File "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/apps/pw_protocol_mpirun.py", line 54, in <module> runProtocolMainMPI(projectPath, dbPath, protId, comm) File "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", line 2229, in runProtocolMainMPI executor = MPIStepExecutor(hostConfig, protocol.numberOfMpi.get() - 1, File "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/protocol/executor.py", line 342, in __init__ ThreadStepExecutor.__init__(self, hostConfig, nMPI, **kwargs) File "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/protocol/executor.py", line 177, in __init__ self.gpuDict[node] = list(self.gpuList[i*chunk:(i+1)*chunk]) TypeError: slice indices must be integers or None or have an __index__ method How exactly should one designate the GPU IDs for such a situation? For example, 2 nodes, each with 2 GPU? David -- David Hoover, Ph.D. Computational Biologist High Performance Computing Services, Center for Information Technology, National Institutes of Health 12 South Dr., Rm 2N207 Bethesda, MD 20892, USA TEL: (+1) 301-435-2986 Email: hoo...@hp... |