Extracts from BYOL

BYOL is the abbr. of Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning.

Cookbook for BYOL

First of all, BYOL method probably the first one removes the clustering step, introduces a predictor and projector network, fefines the continuous targets as the output of a momentum network, renormalize each sample representation by its $l_2$-norm and leverage positive pairs. The predictor acts as a whitening operator preventing collapse, and momentum network can be applied only to the projector.

It first introduced self-distillation as a means to avoid collapse. It usees two networks along with a predictor to map the outputs of one network to the other. The network predicting the output is called the online or student network while the network producing the target is called the target or teacher network. Each network receives a different view of the same image formed by image transformations including random resizing, cropping, color jittering, and brightness alterations. The student network is updated throughout training using gradient descent. The teacher network is updated with an exponential moving average (EMA) updates of the weights of the online network. The slow updates induced by exponential moving average creates an asymmetry that is crucial to BYOL’s success.

Original Paper

Please check the original paper here BYOL.

Abstract

BYOL, a new apporach to self-supervised learning.
BYOL relies on two neural networks, referred to as the online network and the target network, that interact and learn from each other.
From an augmented view of an image, we train the online network to predict the target network representation of the same image under a different augmented view.
At the same time, we update the target network with a slow-moving average of the online network.
(Back then), while state-of-the-art methods rely on negative pairs, BYOL achieves a new state of the art without them.

Online first updates, updates of target come from online by EMA.

Without negative pairs can be competative as well.

Train the online to predict the target network representation.

Introduction

(Back then,) state-of-the-art contrastive methods are trained by reducing the distance between representations of different augmented views of the same image (‘positive pairs’), and increasing the distance between representations of augmented views from different images (‘negative pairs’).
These methods need careful treatment of negative pairs by either relying on large batch sizes, memory banks or customized mining strategies to retrieve the negative pairs.
In addition, their performance critically depends on the choice of image augmentations.

BYOL can overcome the previous shortages from the previous methods, which are: no need negative pairs, careful treatment of negative pairs, choice of hyper-parameters, and choice of augmentations.

BYOL achieves higher performance than state-of-the-art contrastive methods without using negative pairs.
It iteratively bootstraps the outputs of a network to serve as targets for an enhanced representation.
BYOL is more robust to the choice of image augmentations than contrastive methods; we suspect that not relying on negative pairs is one of the leading reasons for its improved robustness.
We propose to directly bootstrap the representations. ==> Different from previous methods.
In particular, BYOL uses two neural networks, referred to as online and target networks, that interatct and learn from each other.
Starting from an augmented view of an image, BYOL trains its online network to predict the target network’s representation of another augmented view of the same image.
While this objective admits collapsed solutions, e.g., outputting the same vector for all images, we empirically show that BYOL does not converage to such solutions.
We hypothesize that the combination of (i) the addition of a predictor to the online network and (ii) the use of a slow-moving average of the online parameters as the target network encourages encoding more and more information within the online projection and avoids collapsed solutions.

BYOL achieves state-of-the-art results under the linear evaluation protocol on ImageNet without using negative pairs.
Our learned representation outperforms the state of the art on semi-supervised and transfer benchmarks.
BYOL is more resilient to changes in the batch size and in the set of image augmentations compared to its contrastive counterparts.
In particular, BYOL sufferes a much smaller performance drop than SimCLR, a strong contrastive basline, when only using random crops as image augmentations.

Most unsupervised methods for representation learning can be categorized as either generative or discriminative.
Generative approaches to representation learning build a distribution over data and latent embedding and use the learned embeddings as image representations.
Many of these apporaches rely either on auto-encoding of images or on adversarial learning, jointly modelling data and representation.
Generative methods typically operate directly in pixel space.
This however is computationally expensive, and the high level of detail required for image generation may not be necessary for representation learning.

The intro and drawbacks of generative methods.

Among discriminative methods, contrastive methods currently achieves state-of-the-art performance in self-supervised learning.
Contrastive methods often require comparing each example with many other examples to work well, prompting the question of whether using negative pairs is necessary.

Contrastive methods belongs to discriminative methods, and are best, while prompting the question of negative pairs.

DeepCluster partially answers this question.
While avoiding the use of negative pairs, this requires a costly clustering phase and specific precautions to avoid collapsing to trivial solutions.

Might not necessary for the use of negative pairs if under suitable settings.

Some self-supervised methods are not contrastive but rely on using auxiliary handcrafted prediction tasks to learn their representation.
Yet, even with suitable architectures, these methods are being outperformed by contrastive methods.

Some other methods might better than contrastive methods.

Our approach has some similarities with Predictions of Bootstrapped Latents (PBL).
PBL jointly trains the agent’s history representation and an encoding of future observations.
Unlike PBL, BYOL uses a slow-moving average of its representation to provide its targets, and does not require a second network.

BYOL stems from PBL but unlike PBL.

The idea of using a slow-moving average target network to produce stable targets for the online network was inspired by deep RL.
Target networks stabilize the bootstrapping updates provided by the Bellman equation, making them appealing to stabilise the bootstrap mechanism in BYOL.
While most RL methods use fixed target networks, BYOL uses a weighted moving average of previous networks in order to provide smoother changes in the target representation.

The origin of the moving weights.

In the semi-supervised …. Among these methods, mean teacher also uses a slow-moving average network, called teacher, to produce targets for an online network, called student.
In contrast, BYOL introduces an additional predictor on top of the online network, which prevents collapse.

The use of predictor.

Finally, in self-supervised learning, MoCo uses a slow-moving average network (momentum encoder) to maintain consistent representations of negative pairs drawn from a memory bank.
Instead, BYOL uses a moving average network to produce prediction targets as a means of stabilizing the bootstrap step.

Method

Many such approaches cast the prediction problem directly in representation space: the representation of an augmented view of an image should be predictive of the representaion of another augmented view of the same image.
However, predicting directly in represntation space can lead to collapsed representations.
Contrastive methods circumvent this problem by reformulating the prediction problem into one of discrimination. … and the representations of augmented views of different images.
In this work, we thus tasked ourselves to find out whether these negative examples are indispensable to prevent collapsing while preserving high performance.

cross-view 能：图像的增强视图的表征能预测同一图的另一个增强的表征，但是代价是总能预测自己来导致 trivial solutions（坍塌）。所以我们的方法来验证能否保持高性能的同时，负样本的存在是否必须。

To prevent collapse, a straightforward solution is to use a fixed randomly initialized network to poduce the targets for our predictions.
While avoidng collapse, it empircally does not result in very good representations.
It is interesting to note that the representation obtained using this procedure can already be much better than the initial fixed represntation.
This experimental finding is the core motivation for BYOL: from a given representation, referred to as target, we can train a new, potentially enhanced representation, referred to as online, by predicting the target representation.
We can expect to build a sequence of representations of increasing quality by iterating this procedure, using subsequent online networks as new target networks for further training.
In practice, BYOL generalizes this bootstrapping procedure by iteratively refining its representation, but using a slowly moving exponential average of the online network as the target network instead of fixed checkpoints.

经验觉得 fixed 不行，但是已经效果不错了，所以我们基于此设计了新的。从一个给定的表征（target）训练一个新的加强的表征（online）来预测 target 的表征。

Description of BYOL

Figure 1. BYOL's architectures

BYOL’s goal is to learn a representation $y_\theta$ which can then be used for downstream tasks. ==> representation.
BYOL uses two neural networks to learn: the online and target networks.
The online network is defined by a set of weights $\theta$ and is comprised of three stages: an encoder $f_\theta$, a projector $g_\theta$ and a predictor $q_\theta$.
The target network has the same architecture as the online network, but uses a different set of weights $\xi$.
The target network provides the regression targets to train the online network, and its parameters $\xi$ are an exponential moving average of the online parameters $\theta$.
Given a target decay rate $\tau\in [0, 1]$, after each training step we perform the following update,
\[\xi \leftarrow \tau\xi + (1-\tau)\theta.\]

Online: encoder $encoder(f_\theta) + prejector(g_\theta) + predictor(q_\theta)$, target has the same architecture but with different parameters comes from the above mentioned equation based on online.

Given a set of images $D$, an image $x\sim D$ sampled uniformly from $D$, and two distributions of image augmentations $\mathcal{T}$ and $\mathcal{T'}$, BYOL produces two augmented views $v \triangleq t(x)$ and $v'\triangleq t'(x)$ from $x$ by applying respectively image augmentations $t \sim \mathcal{T}$ and $t' \sim \mathcal{T'}$. ==> 从两个相同流程的增强策略上随机选两套，因为参数大概率不同，所以两套流程的参数不同，所以是两个不同的 view。
From the first augmented view $v$, the online network outputs a representation $y_\theta \triangleq f_\theta(v)$ and a projection $z_\theta\triangleq g_\theta(y)$.
The target network outputs $y'_\xi \triangleq f_\xi(v')$ and the target projection $z'_\xi \triangleq g_\xi(y')$ from the second augmented view $v'$.
We then output a prediction $q_\theta(z_\theta)$ of $z'_\xi$ and $l_2$-normalize both $q_\theta(z_\theta)$ and $z'_\xi$ to
\[\overline{q_\theta}(z_\theta) \triangleq q_\theta(z_\theta)/\vert\vert q_\theta(z_\theta)\vert\vert_2\]
and
\[\overline{z'_\xi}/\vert\vert z'_\xi\vert\vert_2^2.\]
Note that this predictor is only applied to the online branch, making the architecture asymmetric between the online and target pipeline.
Finally we dine the following mean squared error between the normalized predictions and target projections,
\[\mathcal{L}_{\theta, \xi} \triangleq \vert\vert{\overline{q_\theta}(z_\theta) - \overline{z'}_\xi}\vert\vert_2^2 = 2 - 2\cdot \frac{<q_\theta(z_\theta), z'_\xi>}{\vert\vert q_\theta(z_\theta)\vert\vert_2 \cdot \vert\vert z'_\xi\vert\vert_2}.\]
We symmetrize the loss above by separaetely feeding $v'$ to the online network and $v$ to the target network to compute the other one. ==> 对称的换一下两个网络的 view，然后将两个相加形成最终的 loss。
At each training step, we perform a stochastic optimization step to minimize
\[\mathcal{L}_{\theta, \xi}^{\text{BYOL}} = \mathcal{L}_{\theta, \xi} + \widetilde{\mathcal{L}}_{\theta, \xi}\]
with respect to $\theta$ only, but not $\xi$.
\[\theta \leftarrow \text{optimizer}(\theta, \nabla_\theta\mathcal{L}_{\theta, \xi}^{\text{BYOL}}, \eta),\\ \xi \leftarrow \tau\xi + (1-\tau)\theta\]
where optimizer is an optimzer and $\eta$ is a learning rate.
At the end of training, we only keep the encoder $f_{\theta}$.

Implementation details

Image augmentations:
- BYOL uses the same set of image augmentations as in SimCLR.
- First a random patch of the image is selected and resized to $224\times 224$ with a random horizontal flip, followed by a color distrotion, consisting of a random sequence of brightness, contrast, saturation, hue adjustments, and an optional grayscale conversion. Finally, Gaussian blur and solarization are applied to the patches.
Architecture
- ResNet-50 as our base parametric encoders $f_\theta$ and $f_\xi$
- The representation $y$ corresopnds to the output of the final average pooling layer, which has a feature dimension of $2048$.
- Contrary to SimCLR, the output of this MLP is not batch normalized. The predictor $q_\theta$ uses the same architecture as $g_\theta$.
Optimization
- LARS optimizer with a cosine decay learning rate schedule, without restarts, over 1000 epochs, with a warm-up period of 10 epochs.
- the base learning rate to $0.2$, scaled linearly with the batch size (Learning rate = $0.2 \times \text{BatchSize}/256$)
- A global weight decay parameter of $1.5\cdot 10^{-6}$ while excluding the biases and batch normalization parameters from both LARS adaptation and weight decay.
- A batch size of 4096