How is CNN architecture built up from scratch?

Posted by Chun-Min Jen on September 28, 2020

Introduction to Convolutional Neural Network (CNN)

In this blog post, I will discuss one more useful layer of neurons. And at the end, we will build our first fully working neural network for images.

Before we dive into the details, let’s look at how we deal with color images. When one image has colors, that means that it has three input channels, i.e. Red (R), Green (G) and Blue (B). Owing to the existance of these three colors, one image is regarded as a tensor instead of a matrix. One tensor is a multidimensional arrays, including width (W), height (H) and Cin (channel), e.g. 3 RGB channels.

How do we apply convolutions? One convolutional kernel becomes a tensor as well, of size Wk by Hk by Cin. And, to apply a convolution, firstly, we extract one volumetric patche from the image, and we take a inner (aka: dot) product of one volumetric patch with this kernel tensor. Consequently, an output and associated feature map with this convolution result are obtained. Then, we slide the same volumetric patch to the adjecent column (right next to the first, leftmost column), we will get another different convolution output (a different feature map included) in a different location of this input image. To gain more information embedded underneath the image along its depth direction, we need more volumetric kernels to dig it out. And that means that we need to train more Cout at the size Wk by Hk by Cin. Having a stride of 1 and enough zero padding, we can have W by H by Cout output neurons. So actually, we can slice one image into slices. Every depth slice of the convoluted volume corresponds to a feature map, to one convolutional kernel. The overall training parameters is obtained using Wk by Hk by Cout, plus one, which is the biase term.

But is one convolutional layer good enough? Let’s say neurons of the first convolutional layer look at the patches of the image of size 3 by 3. But what if an object of interest is bigger than that? Then, it looks like we need a second convolutional layer on top of the first. The first 3 by 3 convolutional layer will have a local receptive field of 3 by 3. If we take the second convolutional layer on top of the first, then the neurons of the second convolutional layer will actually have a receptive field of 5 by 5, because of the underlying neurons and their receptive field. Let’s look at what happens if we stack N convolutional layers. For simplicity, let’s look at one-dimensional inputs. A one by three convolutional layer will have a reception field of 1 by 3. When we take a second convolutional layer with the same size, then we have a receptive field of 1 by 5. If it continued after the fourth level, then we have a receptive field of 1 by 9. Can you derive a formula from this? Of course, if we stack N convolutional layers with the same kernel size 3 by 3, the receptive field on N layer will be 2N + 1 by 2N +1. What does it mean? It looks like we need to stack a lot of convolutional layers to be able to identify objects as big as the input image. Take 300 by 300, we will need 150 convolutional layers. What if we need to grow the receptive field faster? We can increase a stride in our convolutional layer to reduce the output dimensions. Let’s see how it works for 2 by 2 convolution with stride 2. We’re effectively splitting the image into non-overlapping patches of color pink, red, yellow, and blue. If we use one back slash kernel, then we will have the result of 7, 9, 4 and 6. That’s how our convolution works. If we add a second convolutional layer of the same size 2 by 2, then those layers will effectively double their receptor field, because we use the stride of 2.

But how do we maintain the translation invariance? Actually, we will introduce a new layer that is called a pooling layer. This layer works like a convolutional, but doesn’t have a kernel. Instead, it calculates maximum or average of its inputs. Let’s look at the example. We have a 200 by 200 by 64 input volume, let’s take a single depth slice from that volume. Let’s apply 2 by 2 max pooling with stride 2. How does one max pooling work? We take a maximum value from there, and that is our output. In this case, it is 6. Then we slide the pooling layer to the next patch, and we take the maximum value from the current patch to get 8. That’s exactly how the max pooling works. If you look at the feature map, it actually means that we downsample our image. As a consequence, we’re losing some details, but it actually stays kind of the same. And notice one more thing, when we apply the pooling layer, we do it depth-wise. It means that we don’t change the number of output channels. We simply change the dimensions. So the volume of 200 by 200 by 64 becomes the volume of 100 by 100 by 64.

But how does the back propagation work for the max pooling layer? Strictly speaking, getting one maximum value is not a differentiable function. Despite that, we will apply some heuristics here in order to make it work. Let’s look at the patch to which the max pooling layer is applied for taking maximum. Let’s take one neuron, which is not the maximum activation. If we change one value of the patch a little bit, this variation will not change the maximum over this patch. That is, the maximum of this patch will stay the same, which is in this case 8. That means that there is no gradient with respect to non-maximum patch neurons, since varying them slightly doesn’t affect the output. However, what happens if we change the neuron that reacts the max pooling layer to provide the maximum value? If we change this specific neuron’s value, then the maximum will change linearly as well. That means that for the maximum patch neuron, we have a gradient of 1. Let’s put it all together into a simple convolutional neural network that was developed in 1998 by Yann LeCun for handwritten digits recognition on MNIST dataset. This data set contains 10 clusters of hand written digits, ranging from 0 to 9. So how does it work? We take our input, which is a grayscale image of the size of 32 by 32. We apply our first convolutional layer, having 5 by 5 convolutions, and we learn six different kernels here. Then, we apply a pooling layer, so that we lose some details, and we have some translation invariance. The pooling layer effectively halves the resolution of the image, and it becomes 14 by 14 by 6, where the number of output channels doesn’t change. Then let’s add one more convolutional layer, which is the same size of the kernel, denoted as 5 by 5 by 6. And let’s learn 16 of these kernels. What do we do next? Then we apply one more pooling layer, right? So we have a 5 by 5 by 16 volume. We can go on and on, and then we will have to stop. Finally, we will have to use some classifier that will use those features in order to produce probability outputs for the digits. For the digit classification purpose, we will use a bunch of fully connected layers, a fully connected layer of 120 neurons, 84, and 10 neurons with applied softmax function on outputs. So what can we see from this diagram? It is known that neurons of deep convolutional layers learn complex representations that can be used as features for classification with MLP. The first convolutional slash pulling part is actually an automatic feature extractor, it is stressed features that are useful for the classification with MLP.

Let’s takes a task of human faces recognition. If you use convolutional neurons network for that task, you can see that, different convolutional layers are actually fired, when they see different patches of the image. The first convolutional layer provides huge activations when it see edges with different angles. The second convolutional layer uses those edges with different directions to learn some more complex things like a human nose or a human eye. The third convolutional layer actually uses the representations that the second convolutional layer has learned. And we’re using the concept of eye, nose, or throat, then you can put it all together and learn the representation of human face. What have we done so far? We have used convolutional pooling and fully connected layers to build our first network for handwritten digits recognition.