Linear probing is an evaluation methodology that tests whether representations learned through self-supervised learning have organized semantic information in a geometrically structured way. It serves as a diagnostic tool for assessing representation quality in models like I-JEPA.
The procedure is straightforward: freeze the pre-trained encoder and train only a single linear layer on top for a downstream task: where is the frozen encoder output, and only and are learned. If this simple linear classifier achieves high accuracy, the representations must be linearly separable—meaning different semantic categories occupy distinct regions in the embedding space.
Why Linear Probing Matters:
Linear separability indicates that masked prediction successfully learned semantic structure rather than merely memorizing surface statistics. If similar objects cluster together in embedding space while different objects are well-separated, downstream tasks become dramatically simpler.
For I-JEPA representations, high linear probing accuracy demonstrates that:
- The model learned to encode semantic content (object categories, scene types) rather than just low-level features (edges, textures)
- Patch representations carry class-discriminative information
- The representation space is well-organized, with smooth manifolds for each category
Comparison to Fine-tuning:
Full fine-tuning allows updating all encoder parameters, potentially achieving higher accuracy but at risk of overfitting and losing generality. Linear probing provides a more stringent test—if representations are truly high-quality, minimal additional computation should suffice.
The gap between linear probing and fine-tuning accuracy is diagnostic:
- Small gap (< 2-3%): Representations are excellent, encoder learned proper abstractions
- Large gap (> 10%): Representations are suboptimal, encoder needs significant adaptation
Connection to JEPA Architecture:
Linear separability emerges naturally from JEPA training requires balancing four objectives to prevent trivial solutions—the variance preservation objective ensures representations use the full embedding space rather than collapsing, while the prediction objective encourages semantic structure. Additionally, because Predicting abstract representations reduces computational waste compared to pixel-level prediction, the model focuses on high-level features that are naturally more linearly separable than pixel-level patterns.
Linear probing thus serves as both an evaluation tool and a design signal—architectures and training procedures that produce linearly separable representations have successfully distilled semantic knowledge into a computationally efficient form for downstream applications.