Training JEPA architectures requires carefully balancing multiple objectives to prevent the model from finding trivial solutions that minimize loss without learning useful representations. This challenge is fundamental to self-supervised learning.
1. Prediction Accuracy: The primary objective is to minimize the distance between predicted and target representations in latent space. For context and target , with predictor and encoders :
2. Informative Representations: Without constraints, encoders could collapse to trivial solutions like outputting constant vectors for all inputs, which achieves zero prediction error but learns nothing. Regularization techniques like variance preservation force representations to use their full capacity:
3. Invariance to Nuisance Factors: The model should learn representations invariant to irrelevant transformations (e.g., lighting changes, camera angles) while preserving semantic content. This is achieved through data augmentation and contrastive objectives, though Contrastive learning prevents model collapse by pushing apart positive and negative examples has limitations.
4. Prevent Mode Collapse: Latent variables in JEPA enable single models to represent multiple plausible futures, but the model might ignore latent variables and always predict the mean outcome. Regularization terms encourage the model to utilize the full latent space: where is the entropy of the latent distribution.
The final training objective combines these:
The use of EMA for the target encoder provides additional stability, preventing oscillations that can occur when both encoders update simultaneously. Finding the right balance of these objectives remains an active area of research in world modeling.