What're modern CNN architectures?

Posted by Chun-Min Jen on September 29, 2020

Introduction to Convolutional Neural Network (CNN)

I will provide an overview of modern architectures of neural networks in this blog post.

Let’s first look at the ImageNet classification dataset, which has 1,000 classes distributed over 1 million labeled photos. The human top 5 error rate on this dataset is roughly 5%. Why is it not to zero? Because it hardly tell the classes as shown in these examples from this dataset. For example, a quail or partridge look exactly the same. Hence, I don’t know how a computer can distinguish between them. The first breakthrough that happened in 2012, and that is the first time the deep convolutional neural network was applied to the ImageNet data set. The deep convolutional neural network significantly reduced top five error from 26% to 15%. This deep convolutional neural network comprises 11 by 11, 5 by 5, 3 by 3 convolution kernels (or filters), max pooling, dropout, data augmentation, ReLU activations and SGD with momentum. It turns out 60 million parameters, and it takes 6 days on 2 GPUs for the model training.

The next breakthrough is VGG architecture in 2015. VGG is very similar to AlexNet, because it uses convolutional layers followed by pooling layers, just like LeNet architecture going back to 1998. But it contains much more filters. As a consequence, VGG reduced ImageNet top 5 error down to 8% for a single model. The training of this architecture is similar to AlexNet, but it uses additional multi-scale cropping as data augmentation. VGG uses 138 million parameters, and it trains on 4 GPUs for 2 or 3 weeks.

In 2015, Inception architecture came to the world. It is not similar to AlexNet. Inception instead uses Inception block that was introduced in GoogLeNet or also known as Inception V1. ImageNet top 5 error was reduced to 5.6% for a single model. Inception is a really complex and deep model. It uses batch normalization, image distortions as augmentation and RMSProp for grading decent. Only 25 million parameters are needed in Inception, but it trains on 8 GPUs for 2 weeks. Before we dive into the details about how the inception block works, lets have a look at 1 by 1 convolutions. Such 1 by 1 convolutions capture interactions of input channels in one pixel of feature map. They can reduce the number of channels not hurting the quality of the model, because different channels can correlate. A 1 by 1 convolution actually works like dimensionality reduction with the addition of ReLU activation. Generally speaking, the number of output channels is less than the number of inputs. All operations inside an inception block include stride 1 and enough padding to produce the same spatial dimensions, which is the same W x H of the feature map as the output. At the end, four different feature maps are concatenated on depth. Therefore, it looks like a layered cake. We stack all of those different feature maps on depth. Through the inception block, the input was added into the output with the 1 by 1 convolution. In summary, inside of the inception block, 1 by 1 convolutions are used in order to reduce the number of filters, besides 5 by 5, 3 by 3 convolutions, together with pooling layers.

Why does the inception block work better? In a simple neural network architecture, you have a fixed size of a kernel in every single convolutional layer. But when you use different scales of that sliding window, let’s say 5 by 5, 3 by 3, and 1 by 1, then we would like to use all that features at the same time, so that we can learn better representations. Let’s try to replace 5 by 5 convolutions, because 5 by 5 convolutions are currently the most expensive parts in our inception block. Therefore, let’s replace them with two layers of 3 by 3 convolutions, which, as we already know, have an effective receptive field of 5 by 5. Another technique that is known in the computer vision is the filter decomposition. It is known that for a Gaussian blur filter, we can decompose it into two one-dimensional filters. Our first blur, the source horizontally. Then we blur the blur vertically, and we get the result which is identical to applying a Gaussian blur to the input. Let’s borrow the same idea and apply it to our inception block. After replacing one 5 by 5 convolution with two 3 by 3 convolutions, at present, 3x3 convolutions turn to be the most expensive parts in this inception block. Let’s replace each 3x3 layer with one 1 by 3 layer followed by another 3 by 1 layer. Actually, what we do is that we decompose one 3 by 3 convolution into a series of one dimensional convolutions. For three 3x3 convolutional layers, we replace them by two layers of one dimensional convolutions. This becomes the final state of our Inception block, and this block is currently employed in the Inception V3 architecture.

Another architecture that appeared in 2015 is ResNet. It introduces residual connections and it reduces top 5 ImageNet error down to 4.5% for the single model, and 3.5% for an ensemble. ResNet has 152 layers, including few 7 by 7 convolutional layers that are expensive, along with 3 by 3 in the rest. ResNet uses the batch normalization, max and average pooling. It has 60 million parameters, and it trains on 8 GPUs for 2 or 3 weeks. What is that residual connection in this architecture? What we actually do is that we create output channels, adding a small delta, which is modeled as F(x), to original input channels. F(x), a small amount of data, actually behaves as a weight layer, followed by relu activation, and one more weight layer. This way, we can stick thousands of layers, and gradients do not vanish, thanks to that residual connection. We always add a small amount of data to the input channels, and that provides a better gradient flow during the back propagation.

To summarize, we can learn that by stacking more convolutional and pooling layers, the bigger error size as obtained through AlexNet or VGG can be effectively reduced. But we are disallowed to do that forever. We need to invent new kinds of convolution architectures such as Inception block or residual connections and utilize them. We have probably noticed that one needs a lot of time to train one neural network.