video-nonlocal-net implements Non-local Neural Networks for video understanding, adding long-range dependency modeling to 2D/3D ConvNet backbones. Non-local blocks compute attention-like responses across all positions in space-time, allowing a feature at one frame and location to aggregate information from distant frames and regions. This formulation improves action recognition and spatiotemporal reasoning, especially for classes requiring context beyond short temporal windows. The repo provides training recipes and models for standard datasets, as well as ablations that show how many non-local blocks to insert and at which stages. Efficient implementations keep memory and compute manageable so the blocks can be added without rewriting the entire backbone. The result is a practical, drop-in mechanism for upgrading purely local video models into context-aware networks with strong benchmark performance.
Features
- Non-local blocks for long-range space-time dependency modeling
- Integrations with popular 2D/3D backbones for action recognition
- Reference training scripts and ablation configurations
- Memory-aware implementations suitable for multi-GPU training
- Evaluation tools for common video datasets and metrics
- Modular layers that drop into existing ConvNet architectures