BlockSparse
Efficient GPU kernels for block-sparse matrix multiplication
... patterns to scale better. The repo implements both blocksparse and blockwise convolution/transpose-convolution primitives, with support for preparing, executing, and verifying those ops on NVIDIA GPUs. In addition to low-level kernels, it includes wrapper code for integrating with TensorFlow, example scripts (e.g. a transformer on the enwik8 dataset), transformer logic that uses blocksparse operations, and debugging helpers.