# Review of Mixup: Beyond Empirical Risk Minimization

Technical review of Mixup: Beyond Empirical Risk Minimization

Link to original paper: https://arxiv.org/pdf/1710.09412.pdf

Good to know terms:

(i) **Empirical Risk Minimization (ERM)** → The hypothesis which minimizes our error based on a sample set, over the domain set. This is based on this article.

(ii) **Regularization** → It is a technique used to reduce the errors by fitting the function appropriately on the given training set and avoid overfitting.

(iii) **Vicinal Risk Minimization (VRM)** → In VRM, human knowledge is required to describe a vicinity or neighborhood around each example in the training data.

**SUMMARY**:

There have been many breakthroughs in the field of computer vision, speech recognition and reinforcement learning with the help of large neural networks. They are majorly trained to reduce the average error over the trained data through a technique known as ERM. There are two major issues with large deep neural networks, memorization and sensitivity to adversarial examples. The *mixup* is a technique introduced to alleviate these concerns. This paper aims to resolve these issues.

Learning theory states that the convergence of ERM is guaranteed as long as the size of the train data set does not increase with the number of training data. However, this is not the case with our typical neural network as the size of the neural networks scales linearly with the number of training examples.

This contradiction challenges the suitability of ERM to train our neural networks. ERM does have some advantages like, it allows the neural network to memorize the training data even in the presence of strong regularization. But the disadvantage is that when a neural network trained with ERM is used to evaluate predictions on examples just outside of training distribution, also known as *adversarial examples*.

The alternative, *data augmentation, *the method of choice to train on similar but different examples (based on VRM principle). In VRM, human knowledge is required to describe the vicinity around each example in the training data. For example, in an image classification, we can define the vicinity of one image as the set of horizontal reflections, slight rotation of an image. Data augmentation also assumes that the examples in the vicinity of an image share the same class, and does not model vicinity relation across examples from different classes.

In layman’s terms, let’s say an image of a cat is 0 and an image of a dog is 1. In *mixup*, the distribution is changed from binary {0,1} to a continuous variable [0,1]. Now, the neural network would no longer predict if the image is of a cat or a dog instead it will predict the percentage of how much the image looks like a cat or a dog. To summarize, in *mixup*, linear interpolation of feature vectors should lead to linear interpolation of the associated targets.