BYOL is a type of SSL that uses two networks: an online network and a target network. The key thing is that Self-supervised learning generates training signals from data structure itself, so the networks are sampled from the same data distribution (see Contrastive learning prevents model collapse by pushing apart positive and negative examples). This is good because it prevents model collapse through asymmetric prediction and slow updates.

The online network consists of an encoder, projector, and predictor. The target network shares the encoder and projector but lacks a predictor and updates via EMA of the online network’s parameters because Exponential Moving Average creates stable learning targets in self-supervised systems. Both process differently masked views of the same input image, creating predictions in latent space rather than pixels which is beneficial because Predicting abstract representations reduces computational waste compared to pixel-level prediction.

GPT came up with this example that helped me understand it:

Consider a single image of a bouncing ball, augmented into two views:  (cropped left, color-distorted) and  (flipped, blurred).

  • Forward Pass 1: Feed to online network (): encoder , projector , predictor . Feed to target network (): encoder , projector .
  • Loss Computation: Minimize normalized L2 distance, forcing online prediction to match target’s stable representation.
  • Symmetric Pass: Swap inputs— to online,  to target—and average losses: 
  • Updates: Online  optimizes via gradient descent; target where based on Exponential Moving Average creates stable learning targets in self-supervised systems could have ensuring slow, stable evolution.