DeiT (Data-efficient Image Transformers) shows that Vision Transformers can be trained competitively on ImageNet-1k without external data by using strong training recipes and knowledge distillation. Its key idea is a specialized distillation strategy—including a learnable “distillation token”—that lets a transformer learn effectively from a CNN or transformer teacher on modest-scale datasets. The project provides compact ViT variants (Tiny/Small/Base) that achieve excellent accuracy–throughput trade-offs, making transformers practical beyond massive pretraining regimes. Training involves carefully tuned augmentations, regularization, and optimization schedules to stabilize learning and improve sample efficiency. The repo offers pretrained checkpoints, reference scripts, and ablation studies that clarify which ingredients matter most for data-efficient ViT training.
Features
- Data-efficient ViT training that works on ImageNet-1k from scratch
- Knowledge distillation with a dedicated distillation token
- Compact model zoo (Tiny/Small/Base) with strong accuracy–speed balance
- Clear training recipes with augmentations and regularization schedules
- Pretrained checkpoints and reproducible reference scripts
- Ablations and guidelines to adapt DeiT to new datasets and tasks