Masked prediction tasks force models to learn semantic relationships between image regions

Masked prediction is a self-supervised learning strategy where the model must predict hidden portions of input from visible portions. In I-JEPA, this forces the model to learn semantic relationships rather than merely memorizing low-level textures or colors.

The masking strategy in I-JEPA differs critically from pixel-level reconstruction approaches like MAE (Masked Autoencoders):

MAE: Masks patches and reconstructs exact pixels → model learns texture, color, local patterns
I-JEPA: Masks patches and predicts latent representations → model learns semantic content, object relationships

Because Predicting abstract representations reduces computational waste compared to pixel-level prediction, I-JEPA’s predictor network operates entirely in representation space: $L = \sum_{i \in masked} ∥ P (E_{c} (x_{visible}))_{i} - E_{t} (x_{masked})_{i} ∥^{2}$

The masking pattern is crucial—I-JEPA uses multi-block masking where large contiguous regions are masked, forcing the model to reason about objects and scenes rather than just filling in edges. If you mask individual scattered patches, the model can predict missing content from immediate neighbors without understanding semantics.

This approach leverages how Vision Transformers process images by treating patches as sequential tokens—the self-attention mechanism in the context encoder must identify which visible patches are semantically relevant for predicting each masked patch. For example, to predict a masked dog’s head, the model learns to attend to visible body parts rather than background patches.

The semantic learning emerges because:

Contrastive structure: Similar images should have nearby representations even with different masking, while Contrastive learning prevents model collapse by pushing apart positive and negative examples
Information bottleneck: Predicting high-level representations rather than pixels forces the model to extract semantic features that compress well
Contextual inference: Large masked regions can only be predicted accurately if the model understands object semantics and scene composition

I-JEPA uses EMA for the target encoder, ensuring the prediction targets remain stable while the context encoder and predictor learn. This allows the model to develop increasingly sophisticated understanding of semantic relationships without the instabilities that plague end-to-end training.

The learned representations can then be evaluated using Linear probing tests whether learned representations are linearly separable to verify they capture semantic structure useful for downstream tasks.

🪴 Satwik Panigrahi

Explorer

Masked prediction tasks force models to learn semantic relationships between image regions

Graph View

Backlinks

Recent Notes

Welcome to my Garden!

JEPA separates world understanding into three specialized components: context encoder, target encoder, and predictor

JEPA training requires balancing four objectives to prevent trivial solutions