When do we need CNN to tackle computer vision tasks? - From Experimental Nuclear Physicist to Data Scientist/AI Engineer

Introduction to Convolutional Neural Network (CNN)

Here, we will take a quick look at computer vision problems that successfully utilize convolutional networks. So far, we have examined image classification task, which has an image as an input and a class label as an output. We will review two more computer vision tasks, beyond the image classification.

The first one is semantic segmentation where we have an image as an input. And as an output, we need to give a class label for each pixel of that image. For example, which pixels correspond to water, which pixels correspond to duck or grass. Another example is image classification plus localization. In this task, we not only need to say which object we see on the image, but also the way we see it. And for that, we need to define a bounding box, that contains the object of our interest.

Let’s start with semantic segmentation. For this task, we need to classify each pixel of our image. So what do we do when we have an image as an input? We stack convolutional layers, right? And we use the same width and height as our input image, because we will need to classify each pixel. And what do we do next? Usually we add pooling layers. But in this particular task, it is not easy to implement. Because when we add pooling layers that effectively down samples our image, our classification won’t be crisp but will be pixelated. And we don’t want that. So let’s maintain the width and height of our convolutional filters as we stack more and more pooling layers. However, the final layer will be a different one. The last layer will have the number of output channels that is equal to the number of classes that we need for our segmentation. For example, each depth slice will be responsible for different classifications such as water, duck, or grass. And what we do in the end is is that we can take all those values that are encoded in width, height, that are kept unchangable, and depth of that final volume for every pixel. Finally, we actually apply a softmax function over those values in the output channels. This is a rather naive approach. We stack convolutional layers and add per-pixel softmax. We go deep, but we don’t add pooling. Without pooling, that is too expensive, and it is harder to explain. Let’s add pooling which acts like down-sampling. So we have an image as an input, we have the first convolution layer followed by one pooling layer. After adding a pooling layer, we actually reduce the number of width and height of our volume, and meantime, we increase the depth. Then we have one more convolutional layer, plus one more pooling layer. Then what we do next is we stack one more convolutional layer. But wait a second, we need to classify each pixel. And right now, the width and height of our volume is significantly reduced because of two pooling layers. We need to do unpooling somehow. And for that task, we will use a special layer which we’ll do upsampling. And after upsampling, we will use convolutional layers to learn a transformation back to their original pixels. We add one more upsampling layer and one more convolutional layer. And that is how we get our semantic segmentation of input pixels.

How do we do that unpooling? The easiest way is to fill the nearest neighbor values. For instance, we have a 2x2 input, and we replace each cell in that input as a 2x2 patch with the same values. In this way, we get a pixelated output, which is not crisp. Therefore, it’s not the best way to go. Another technique is called the max unpooling. Let’s look at our architecture. We have corresponding pairs of downsampling and upsampling layers, which actually do the same thing but in the reverse order. Let’s use that information. What if we remember which element was maximum during pooling and fill that position during unpooling? Let’s look at the example. We have a 4x4 input, and let’s apply the max pooling of 2x2 with stride 2. And let’s remember which neurons gave us the maximum activations. Then we move to the rest of the network, and at some point, we will have to do the unpooling. We will have to do a 4x4 output out of the 2x2 input. And we choose to do that without filling nearest neighbor values. But we rather pool these values and put them in those locations where we had maximum activations during the corresponding pooling. This way, we get a crisper image, and it actually works better.

Previous approaches are not data-driven. Imagine that we have an object that are round, and not squared, and we don’t use that information. Neither the nearest neighbor unpooling nor the max unpooling is aware of that. We however can actually use that information to give better unpooling. Remember that we can replace the max pooling layer with the convolutional layer that has a bigger stride. What if we can apply convolutions for the unpooling? Let’s see how it might work. We have a 2x2 input, and we somehow need to produce a 4x4 output. Let’s use a 3x3 convolutional filter for that. How does it work? We take a convolutional filter, and we multiply it by the value in one chosen cell. And we add those values to the output in which a stride 2 is used, so that we double the resolution that we had in the input. Then we move to the next pixel in the input, and we take kernel (or filter) weights and multiply it by the value in one chosen cell. And we add it to the output as well. But what do we do with those values where we have an intersection of our filters? We actually take a sum of those values, and it still works.

Let’s go to object classification and localization task. For this, we need to find a bounding box used to localize our object. Let’s parameterize our bounding box with four numbers, x, y, w, and h. X and y stand for the coordinates of the upper left corner of that box, and w and h stand for width and height of that box. We can actually use the regression for those four parameters. Let’s see how it might work. We have a classification network which looks like a bunch of convolutional layers followed by a multilayer perceptron. And we train it using the cross-entropy (log-loss). But do we need a second network to do a bounding box regression? Actually, we can reuse those convolutional layers for our new task. And we can train a new, fully connected layer that will predict bounding box parameters and we will use mean squared error (MSE) for that. But how do we train such a network when we have two different losses? We actually take the sum of those losses, and that gives us the final loss for which we’ll propagate the gradients during the backpropagation.

In this blog post, we took a sneak peek into other computer vision problems that successfully utilize convolutional neural networks. This post concludes our introduction to neural networks for images.