...TimeSformer was influential in showing that pure transformer architectures—without convolutional backbones—can perform strongly on video classification tasks. Its flexible attention design allows experimenting with different factoring (spatial-then-temporal, joint, etc.) to trade off compute, memory, and accuracy.
Non-local Neural Networks for Video Classification
...Efficient implementations keep memory and compute manageable so the blocks can be added without rewriting the entire backbone. The result is a practical, drop-in mechanism for upgrading purely local video models into context-aware networks with strong benchmark performance.