... implementation; in absolute terms, it reached 70% of the theoretical upper performance bound of the used Nvidia P100 GPU, as defined by the effective throughput metric, T_eff. ParallelStencil relies on the native kernel programming capabilities of CUDA.jl and AMDGPU.jl and on Base.Threads for high-performance computations on GPUs and CPUs, respectively. It is seamlessly interoperable with ImplicitGlobalGrid.jl, which renders the distributed parallelization of stencil-based GPU and CPU apps.