JEPA
PyTorch code and models for V-JEPA self-supervised learning from video
JEPA (Joint-Embedding Predictive Architecture) captures the idea of predicting missing high-level representations rather than reconstructing pixels, aiming for robust, scalable self-supervised learning. A context encoder ingests visible regions and predicts target embeddings for masked regions produced by a separate target encoder, avoiding low-level reconstruction losses that can overfit to texture. This makes learning focus on semantics and structure, yielding features that transfer well with simple linear probes and minimal fine-tuning. The repository provides training recipes, data pipelines, and evaluation utilities for image JEPA variants and often includes ablations that illuminate which masking and architectural choices matter. ...