ParallelStencil empowers domain scientists to write architecture-agnostic high-level code for parallel high-performance stencil computations on GPUs and CPUs. Performance similar to CUDA C / HIP can be achieved, which is typically a large improvement over the performance reached when using only CUDA.jl or AMDGPU.jl GPU Array programming. For example, a 2-D shallow ice solver presented at JuliaCon 2020 [1] achieved a nearly 20 times better performance than a corresponding GPU Array programming implementation; in absolute terms, it reached 70% of the theoretical upper performance bound of the used Nvidia P100 GPU, as defined by the effective throughput metric, T_eff. ParallelStencil relies on the native kernel programming capabilities of CUDA.jl and AMDGPU.jl and on Base.Threads for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with ImplicitGlobalGrid.jl, which renders the distributed parallelization of stencil-based GPU and CPU apps.
Features
- Parallelization and optimization with one macro call
- Stencil computations with math-close notation
- Seamless interoperability with communication packages and hiding communication
- Support for architecture-agnostic low level kernel programming
- Module documentation callable from the Julia REPL / IJulia
- Concise single/multi-xPU miniapps