iJEPA
Official codebase for I-JEPA
...A context encoder sees visible regions of an image and predicts target embeddings for masked regions produced by a slowly updated target encoder, focusing learning on semantics instead of texture. This objective sidesteps generative pixel losses and avoids heavy negative sampling, producing features that transfer strongly with linear probes and minimal fine-tuning. The design scales naturally with Vision Transformer backbones and flexible masking strategies, and it trains stably at large batch sizes. i-JEPA’s predictions are made in embedding space, which is computationally efficient and better aligned with downstream discrimination tasks. ...