MAE (Masked Autoencoders) is a self-supervised learning framework for visual representation learning using masked image modeling. It trains a Vision Transformer (ViT) by randomly masking a high percentage of image patches (typically 75%) and reconstructing the missing content from the remaining visible patches. This forces the model to learn semantic structure and global context without supervision. The encoder processes only the visible patches, while a lightweight decoder reconstructs the full image—making pretraining computationally efficient. After pretraining, the encoder serves as a powerful backbone for downstream tasks like image classification, segmentation, and detection, achieving top performance with minimal fine-tuning. The repository provides pretrained models, fine-tuning scripts, evaluation protocols, and visualization tools for reconstruction quality and learned features.
Features
- Masked image modeling with random high-ratio patch masking
- Efficient pretraining via encoder-decoder separation (encoder sees only visible patches)
- Scalable Vision Transformer backbone for downstream vision tasks
- Pretrained models and fine-tuning scripts for classification, detection, and segmentation
- Visualization tools for reconstruction and representation analysis
- Self-supervised training paradigm requiring no labeled data