BlockSparse
Efficient GPU kernels for block-sparse matrix multiplication
...In addition to low-level kernels, it includes wrapper code for integrating with TensorFlow, example scripts (e.g. a transformer on the enwik8 dataset), transformer logic that uses blocksparse operations, and debugging helpers.