JEPA separates world understanding into three specialized components: context encoder, target encoder, and predictor

JEPA decomposes world modeling into three specialized neural networks that work together to learn representations through prediction. This architectural separation allows the model to learn robust features while avoiding collapse.

The context encoder $E_{c}$ processes the visible parts of the input (e.g., unmasked image patches or past video frames) into a latent representation. This encoder must extract meaningful features from partial observations that enable prediction of missing information.

The target encoder $E_{t}$ processes the masked or future portions of the input separately. Critically, this encoder uses EMA of the context encoder’s parameters to create stable learning targets: $θ_{t}^{target} = α θ_{t}^{context} + (1 - α) θ_{t - 1}^{target}$

The predictor network $P$ takes the context encoder’s output and attempts to predict the target encoder’s representations. By Predicting abstract representations reduces computational waste compared to pixel-level prediction, the predictor learns to model relationships in latent space rather than reconstructing raw pixels.

This three-component design enables Self-supervised learning generates training signals from data structure itself because the model generates its own supervision signal by predicting one part of the data from another, without requiring labeled examples.

🪴 Satwik Panigrahi

Explorer

JEPA separates world understanding into three specialized components: context encoder, target encoder, and predictor

Graph View

Backlinks

Recent Notes

Welcome to my Garden!

JEPA separates world understanding into three specialized components: context encoder, target encoder, and predictor

JEPA training requires balancing four objectives to prevent trivial solutions