Why is CNN (if MLP available)?

Posted by Chun-Min Jen on October 29, 2020

Introduction to Convolutional Neural Network (CNN)

This blog post is about how to solve computer vision tasks with narrow networks. We already know about Multi Layer Perceptron (MLP) that has lots of hidden layers. Besides MLP, we will introduce a new layer of neurons specifically designed for image input. What is an image input? Take a gray-scale image. It is actually a matrix of pixels or picture elements. Dimensions of these metrics are called image resolution. For example, it can be denoted as 300 by 300. Each pixel stores its brightness or intensity, ranging from zero to 255. Zero intensity corresponds to black color. Black colors correspond to roughly zeros, and light colors are close to 255. Color images store pixel intensities for three different channels: red, green, and blue.

Neural networks like the normalized inputs. One normalized image is just a matrix of numbers being divided by 255 and substract a 0.5. This way, we will have zero mean and our numbers are normalized. What do we do next? We already know about MLP, right? So what if we use it for the same task? We take our pixels, which are green nodes here, and for each of these pixels, we train a weight W. Our perceptron will take all that inputs, multiply by weights, add a bias term, and pass through activation function. It seems like we can apply MLP to images, right? But actually it doesn’t work in the same way as you imagine like that. Let’s see one example. Let’s say we want to train a cat detector. On these training image where we have a cat in the lower right corner, red weights will change during back propagation to better detect a cat. On the other hand, what if we have a cat in the upper left corner? Then, green weights will change, and what is the problem here? The problem is that we learn the same cat features but they are located in different spots. However, we don’t fully utilize the training set. The red weights are only trained on the images, where we have a cat in that corner, as well as for the green weights. How about looking at images regarded as the test set data where cats appearing in different spots? If MLP is employed in this case, our neurons won’t be ready for that.

Fortunately, we have convolutions. By definition,a convolution is a dot product of a kernel, or a filter, and a patch of an image, which is also called a local receptive field of the same size. Let’s see how it works. We have an input, which can be an image, and we have a sliding window, which has a red border. We multiplied one extracted patch, a local receptive field, by a kernel. Then, we slide that window across the image, and for all possible locations, we take a dot products with a kernel in every given location. Let’s see an example. We have an original image, and we have a kernel, which has an eight in the center, and all the rest on minus ones. How does it work? Actually, it will sum up to zero, which corresponds to black color when the patch is a solid fill. When all the inputs of our patch of the same color, then we will have zero. Actually, it works like an edge detection, because when one image contains an edge, which is contrary to the solid fill, we will have non-zero activation. Another example is a sharpening filter. It has a five in the center and minus ones on the north, west, east, and south so it doesn’t sum up to zero, and it doesn’t work like edge detection. For solid fills, it actually outputs the same color, but when we have an edge, it adds a little bit of intensity on the edges because it is somewhat similar to the edge detection kernel. That’s why we perceive it as an increase in sharpening. The last but not least, one simple convolution which takes an average grid inputs over one single patch but loses details acts like blurring one image.

Analogically, a convolution is actually similar to the correlation. Let’s take an image where a backwards slash was painted. When we try to convolve it with a kernel that looks like a backwards slash, we will have a pair of non-zero dot products that correspond to two locations of our sliding window, and all the rest are zeros. If we take a different image, where our slash is not a backslash, but a forward slash, and we convolve it with the same pattern, a kernel of a backslash, then, in the output, one of two activations will still be non-zero, but the other is nullified. What can we see here? If we take the maximum value of activations from our convolutional layer, for the first example it will become two, and for the second one, it will be one. As a matter of fact, it looks like we’ve made a simple classifier of slashes of backslashes on our image.

Another interesting property of the convolution is the translation equivariance. It means that if we translate the input, and imply convolution, it will actually act the same as if we first applied convolution prior to applying the translation to the input, and then translated an image. Let’s take the backslash example. We moved our backslash along one direction on the image. After the kernel is applied, we obtain one convolution result. The convolution result as determined with the translation shows the same numbers. The only difference is that these numbers are translated. In spite of the displacement of all numbers, their maximum actually stays the same as the original without the translation applied. Therefore, if we try to take the maximum of these outputs, it will actually stay the same as if our simple classifier is invariant to translation.

How does every single convolutional layer in neural network works? First, we have an input which can be an image, and we add one following layer, which is a so-called padding. The padding layer is added in order to keep the number of dimensions in the convolution output the same as before. Okay, let’s see how it works. We take the first batch of three by three from our image with padding. And if we try to take a dot product with our kernel, which have weights that we need to train, from W1 to W9 (3x3), then what we will have in the output is W6 plus W8, plus W9 plus a biased term, W6+W8+W9+b. Finally, we apply activation function, which can be sigmoid. After completing the convolution on the first patch, we slide the kernel window, the size of which is 1x1, so that we will get a different neuron, right? The step with which we move that window is actually called a stride. In this example, we have a stride of one, and we have the second output, which is W5 plus W7 plus W8, plus the bias term, that is W5+W7+W8+b. Note that W8 has been used in the first convolution, and it’s now being re-used. That is, the weight of W8 has been shared by the second patch of this image. Then, we apply the sigmoid activation to the second patch once more. If we continue sliding the kernel window by a stride of one, from top left down to bottom right, we will obtain a so-called feature map. This feature map has its shape with the same diemension as the input image, which is 3x3. As a consequence, the parameters is 10, which is 9 weight neurons ranging from W1 to W9, together with the biase term, b, in total.

How does back propagation work for convolutional neural networks? Let’s take advantage of this simple example to answer this question. The image contains 9 grids (3x3). Presumably the kernel window size is 2x2 (instead of 1x1), the overall number of neurons is 4. Let’s take our first batch from the image which has a purple border. The weight that corresponds to W4, let’s denote it with B. When we slide the kernel window to the right, the next patch will have the parameter A for the weight of the W4. We eventually realize that we actually have four different locations where W4 is used again and again. Let’s assume that the neuron weight is not named as W4, but four different parameters, A, B, C and D. How will back propagation work then? We get to the gradients as follows: dL, dA, dL, dB, and so forth. When we work on the back-propagation, we have to make a step in the direction opposite to the gradient (that’s why it’s called back-propagation), right? If we look at all these update rules then you can see that we were updating A, B, C, and D, with some rule, but actually, A, B, C and D, are all directed to the same parameter, which is W4. Why is that? Because we shared this neuron in one 3x3 convolutional layer. That means that we’re effectively changing the value of W4 in four sequential steps, and in each step, the gradient is taken. Hence, the overall gradient is equal to the sum of four subsequential gradients for all the parameters, A, B, C and D. That’s how the back propagation works for every single convolutional layer. In brief, for the same shared weight neuron, we just sum up their weights in the forward propagation. In the backward propagation, on the other hand, we just sum up corresponding gradients to A, B, C and D as the overall variation size in W4 by sliding the kernel window across the entire input image. In every single convolutional layer, the same kernel window is applied to the input image and then produces output neurons (one neuron is generated per convolution operation, which is the inner-product of patch and the same kernel window). By following this principle, weight parameters can be shared, so that a model’s performance can be getting better and better during the training process.

Lets recall that cat problem as mentioned earlier. When the cat appears in different regions for different images, how can we construct one effective model and train it to produce the same cat feature map? MLP cannot solve this problem. The convolutional neural network model is capable of handling this problem. With the convolutional layer, we will train the same cat features, wherever a cat is in diferent input images. Let’s look at the example, we have a 300 by 300 the input, the same size of output and five by five convolutional kernel. In the convolutional layer, we will only need 26 (=5x5+1) parameters to train. What if we want to make it a fully connected layer, where each output is a perceptron? If this is the case, then we will need eight billion of parameters. That is too much. The convolutional layer can be viewed as a special case of a fully connected layer, because, in the covolutional layer, all the weights outside the local receptive field of each neuron equals to zero, and all kernel parameters are shared among neurons.

To wrap it up, we have introduced about why the convolutional layer matters and how it works better than one fully connected layer for images. This convolutional layer will be used as a building block for large and narrow networks.