Where do we implement tricks to help train very deep neural network?

Posted by Chun-Min Jen on August 24, 2020

Introduction to Convolutional Neural Network (CNN)

In this blog post, I will introduce a great variety of tricks that can help with training really deep networks.

To begin with, lets discuss activation functions. We already known about Sigmoid activation, that takes an x as an input, and outputs are Sigmoid(x). What will happen, when we apply back propagation to Sigma(x)? We have to get the gradient of the loss with respect to the output of Sigma function, which is dL dSigma(x). Then, using a chain rule, we can calculate dL dx which is dL d Sigma multiplied by d Sigma dx. For a Sigmoid function, the derivative of Sigmoid function, i.e. d Sigma dx, is Sigma of x multiplied by 1 minus sigma of x. What’s wrong with that? Actually, the problem is that when the value of Sigma is close to either zero or one, our gradients will vanish. What does the vanishing mean? When d Sigma dx will be very close to zero, parameters, multiplied by d Sigma dx using a chain rule, turn into zero for all of the previous neural layers. That is the problem of so-called vanishing gradients. Another problem is that the output of Sigmoid is not zero-centered. Remember that neural networks prefer inputs with zero-mean and standard variance. That is, the neural networks requre inputs be normalized. The last problem is that the exponent of x is computationally expensive, when one very deep neural network contains millions of neurons. There is another activation function which is hyperbolic tangent or tanh. Despite that tanh is zero-centered, which is a plus, tanh is still pretty much like Sigmoid. Therefore, tanh won’t help that much, when we replace Sigmoid with tanh. There is one more different activation function, which is called ReLU or rectified linear unit. ReLu works like taking a maximum of x and zero. The computation of ReLU is rather fast, and its gradients do not vanish for positive x’s, and in practice it provides faster convergence. However, ReLU has problems too. The first one is that ReLU is not zero-centered. And the second one is the fact that ReLU activation function is zero for a negative axis. Which means that if we’re unlucky during the initialization of our neurons, you can have such waits that will give us a zero activation. And if we’re unlucky enough, then this neuron will never update, because for that part where x is less than zero, we have a zero gradient. So it’s so-called the problem of dying ReLU neuron. The neuron is not activated and never updated via the non-zero gradient of the activation function. Don’t feel discouraged, because we can easily alter this disappointing situation. How can we handle this dying ReLU issue? We can have a Leaky ReLU activation, which adds a little bit of slope in a negative axis, where we had a zero activation in the regular ReLU. Now, leaky ReLU looks like max(ax, x). With leaky ReLU, the neuron won’t die. Even if we’re unlucky with a close-to-zero initialization, we will still have a small non-zero gradient that will provide us with varying weights, so that the neuron can keep alive. Another problem is the slope parameter, i.e. a, in leaky ReLU. The slope, a, should not be as large as one (or even greater than one), because a slope of one will make leaky ReLU be a linear activation. Stacking linear activations will give us a linear function, so it doesn’t work. Okay, now we know how activation functions work.

Let’s look at weights initialization. Perhaps we can start with all zeros. Let’s look at this simple example where we have four inputs and three neurons, Sigma 1, Sigma 2, and Sigma 3, which have Sigmoid activation function. Lets look at how back propagation works, dL dw2, which is a gradient of the loss with a respect to w2. First, we take the gradient of the loss with a respect to the output neuron, which is Sigma 1, to get dL d Sigma 1. Then, we must take the derivative of activation function, Sigmoid, with respect to x, so that we obtain Sigma 1 multiplied by 1 minus Sigma 1. Finally, we have to take the derivative of a weighted sum of inputs, from x1 to x4. Likewise, the derivative with respect to w2 is Sigma 2. If we look at the same update rule for w3 weight, we will see that the derivative with respect to w3 stays the same as that to w2. The only difference is that Sigma 2 is now replaced with Sigma 3. What does it mean? It means that, first, Sigma 2 and Sigma 3 are the same, when we have zero initialization, they just crunch the same numbers. That means that w2 as well as w3 which are initially zero, they will have the same updates. Furthermore, that means that both w2 and w3 will change in the same way. And if we continue that and using a chain rule, we can actually derive that Sigma 2 and Sigma 3 will always get the same updates for their sizes. That means that we will have two neurons, which are exactly the same. Moreover, we won’t be able to learn complex representations from them. It is called a symmetry problem, and we need to break that symmetry.

How can we break the symmetry among multiple neurons? Maybe we can start with some smaller random numbers, right? But how small? Let’s draw a variable from the normal distribution, and we multiply this variable by 0.03. Will it be small enough? We already known that linear models work best, when inputs are normalized. Neuron is a linear combination of inputs plus activation. Every neuron output will be used by consecutive layers. That means that it will be great, if we normalize the outputs of the neuron. Let’s take a neuron output before activation. A neuron output is a linear combination of inputs which are axis. The expected value of input, x, is zero, because one zero input, which is normalized as anticipated, is provided on purpose, and we are able to get a zero mean, accordingly. Then, we generate weights independently from our inputs, and the expected value of our linear combination is zero as well.

The variance, on the other hand, is a different story, because it can grow with consecutive layers. And, a growing variance can become a problem, when we stack a lot of network layers. Empirically, a growing variance hurts the convergence for deep networks. Let’s get back to our linear combination as mentioned previously and then take a look at the associated variance with this linear combination. The variance of one linear combination can be splitted up into a sum of variances provided that we have identically independently distributed weights, because we generated them in such way, and we assume that we have mostly uncorrelated x’s. Then, we can use the fact that we generate weights independently from inputs which are x’s, and that can split the variance of the product into a sum of these three summands. And, let’s notice that the first one and the second one actually turn to zero, because we generate weights in such a way that we have a zero mean. It concludes that we have zero mean inputs as well, because we normalize our inputs. That means that we have a sum of the product of variance of x and variance of w. If we assume that all our inputs and all our weights have the same variance, we will replace a sum of products of xw variances with the following thing: variance x multiplied by n variance w.

Now, lets summarize. We had a variance of output, and it actually translates into a variance of input, multiplied by n and by variance of weight. If the product of n variance w is greater than one, and we have a lot o hidden layers, the variance of our outputs in each consecutive layer will grow. So what do we want? We want this product to be one. How do we make it be one? Let’s try to replace the variance of aw is with a squared variance w. To have n variance aw be one, we need to multiply our weight which have drawn from standard normal distribution with a variance of one by a, which is one over square root of n. This way, using the first rule, we will be able to require n multiplied by variance of aw be equal to one. Actually, it is called Xavier initialization. The product would be multiplying weights by square root of 2 divided by square root of number of inputs plus number of outputs of your hidden layer. The initialization for ReLU neurons uses multiplication by square root of 2 over square root of number of inputs.

We know how to initialize our network in order to constrain the variance. But what if it grows during the backpropagation? We don’t control the variance anymore, because anything can happen. There is a technique known as batch normalization that controls mean and variance of outputs before activations. Let’s normalize the neuron output before applying the activation, which is denoted as h. The first thing we do is that we provide a zero mean as well as one unit of standard variance which are obtained via subtracting the mean value of activation of neuron outputs and then dividing the zero mean by a square root of variance. Afterward, we multiply a square root of the standard variance by gamma which gives us a new variance of gamma squared. Then, we add beta, so that we have a new mean which will be beta. Where do mu and sigma come from? We can estimate them having a current training batch. And we can do that on every step of the backpropagation. But what do we do when we have a testing part? During testing we will use an exponential moving average over train batches. How does it work? We have current values of mu and sigma squared. We multiply them by 1 minus alpha, and alpha is some small number ranging from 0 to 1. And we add the current batch mean or variance multiplied by alpha. When we do that over all training batches at the end, we will have a moving average of these values, and they in practice work better at testing. What about gamma and beta? Normalization is a differentiable operation, and we can apply backpropagation here. So it’s not a problem.

There is one more regularization technique which is known as dropout. It is used to reduce overfitting. It works like the following. We keep some neurons active, where gradients of the loss with respect to their weights are non-zero, with probability p. If we sample independently our neurons, some of them will become inactive with probability one minus p. Through this way, we sample the network during training, and change only a subset of the training set’s parameters on every duration. During testing, all neurons are present, but their outputs are multiplied by p to maintain the scale of inputs of the consecutive layers. Why does it work like that? During training, we had a neuron which was present with probability p and in the consecutive layer multiplied this neuron by the weight w. And what will happen during testing? If we don’t change the weight w, that means that the consecutive layers will have an expected value which is even much bigger. And if we calculate the real expected weight of that neuron, the real expected vlue will be following. With probability p, it was multiplied by w. On the other hand, if with probability 1 minus p, it was nullified, indicating that neuron was not active. That means that we need to replace these weights w with pw during test. The authors of dropout say that it is similar to having an ensemble of exponentially large number of smaller networks. On every iteration of backpropagation during training, we sample the network. However, during testing, all the neurons appearing in the network are completely used.

One more technique that is used in modern convolutional neural networks is data augmentation. Modern models have millions of parameters, as we will later see. But data sets are not that huge. To tackle this problem, we can generate new examples through applying distortions, such as flips, rotations, color shifts, scaling, etc. You can see in the image with cats. When we distort that image, the cats are still the cats. And we have some new features that can help our deep neural network to generalize better. Remember, that convolutional neural networks are invariant to the translation, so that there is no need to add translation distortions.

In conclusion, we have reviewed the activation function, weights initializations, and a bunch of new techniques that help to train better networks. What’re the takeaways? First, always use ReLU activation, because it doesn’t saturate, and it converges faster. Use He et al initialization which is square root of two divided by square root of number of inputs to make variances converge as well. Try to add bash normalization or dropout, maybe the backpropagation will converge better. Or try to augment your training data.