Re: [scipion-users] correct use of GPU IDs for multinode, multigpu job

SourceForge Headquarters 225 Broadway Suite 1600 San Diego, CA 92101 +1 (858) 422-6466

I believe it should be:

GPUs: 0 1 0 1

MPI: 3

I also think there is a bug when using this case we missed when 
migrating to python 3:

https://github.com/scipion-em/scipion-pyworkflow/blob/devel/pyworkflow/protocol/executor.py#L175

                chunk = nGpu / nThreads

should be

                chunk = int(nGpu / nThreads)

I have to test this and release a fix.

On 22/1/21 14:34, Hoover, David (NIH/CIT) [E] via scipion-users wrote:
> Hi all,
>
> I am trying to launch a MotionCorr job using Scipion across multiple 
> nodes with multiple GPUs per node.
>
> If I leave GPU IDs blank, it attempts to run the tasks, but fails 
> because the gpu ids are missing (command has -Gpu %(GPU)s).
>
> If I set GPU IDs to "0", it finishes some of the tasks (where the 
> command includes -Gpu 0.0), but it fails on other tasks with 
> non-integer values for the gpu (e.g. -gpu 2.5).
>
> If I set GPU IDs to "0 1" (there are 2 gpus per node), the entire job 
> fails with this error:
>
> Traceback (most recent call last):
>   File 
> "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/apps/pw_protocol_mpirun.py", 
> line 54, in <module>
>     runProtocolMainMPI(projectPath, dbPath, protId, comm)
>   File 
> "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", 
> line 2229, in runProtocolMainMPI
>     executor = MPIStepExecutor(hostConfig, protocol.numberOfMpi.get() 
> - 1,
>   File 
> "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/protocol/executor.py", 
> line 342, in __init__
>     ThreadStepExecutor.__init__(self, hostConfig, nMPI, **kwargs)
>   File 
> "/usr/local/apps/scipion/3.0.6/anaconda/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/protocol/executor.py", 
> line 177, in __init__
>     self.gpuDict[node] = list(self.gpuList[i*chunk:(i+1)*chunk])
> TypeError: slice indices must be integers or None or have an __index__ 
> method
>
> How exactly should one designate the GPU IDs for such a situation?  
> For example, 2 nodes, each with 2 GPU?
>
> David
-- 
Pablo Conesa - *Madrid Scipion <http://scipion.i2pc.es> team*

Re: [scipion-users] correct use of GPU IDs for multinode, multigpu job

Image processing framework to integrate EM software packages.

Re: [scipion-users] correct use of GPU IDs for multinode, multigpu job