JEPA
PyTorch code and models for V-JEPA self-supervised learning from video
...A context encoder ingests visible regions and predicts target embeddings for masked regions produced by a separate target encoder, avoiding low-level reconstruction losses that can overfit to texture. This makes learning focus on semantics and structure, yielding features that transfer well with simple linear probes and minimal fine-tuning. The repository provides training recipes, data pipelines, and evaluation utilities for image JEPA variants and often includes ablations that illuminate which masking and architectural choices matter. ...