This is the first blog post I write in a few months, I missed it!
I have recently completed a project involving satellite image classification and it has inspired me to write a post.
Today, I will be talking about convolutional neural networks which have gained a lot of attention especially for computer vision and image classification. They are highly proficient in areas like identification of objects, faces, and traffic signs apart from generating vision in self-driving cars and robots. All the advanced vision technologies you see out there (robots, machines, self-driving cars to name a few) are probably using these types of neural networks to achieve their goals. Guess what? You will actually learn how it works and have a rough idea about what is going on. I will try to explain it assuming that you have just a rough idea of what machine and deep learning are.
As stated earlier, Convolutional Neural Networks (CNN) represent a type of neural networks. A neural network is simply a "mechanism" vaguely inspired by the biological neural networks that constitute animal brains. The neural network itself is not an algorithm, but rather a framework for many different machine learning algorithms to work together and process complex data inputs. Such systems "learn" to perform tasks by considering examples, generally without being programmed with any task-specific rules. In our case, the goal of CNN would be to use images to either detect specific objects, extract relevant information (such as face recognition), perform image segmentation and so on.
The images below show object detection and image segmentation using CNN, where one can notice how accurate it is.
Example of object detection. Source: https://arxiv.org/pdf/1506.01497v3.pdf
Separating roads from the rest in satellite images. Source: my Machine Learning project
Before diving into the how the CNN actually works (intuitively without detailing the mathematics), we first need to understand how an image is represented in the computer, how this neural network sees an input image.
When a computer takes an image as input, it will see an array of pixel values. For example, let's say we have an image in JPG form with size 480 x 480. The representative array will be 480 x 480 x 3 (3 refers to the RGB values). Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point (assuming it is an 8-bit image which is the standard). These numbers, while meaningless to us when we perform image classification, are the only inputs available to the computer.
Example of a 2D array given to the computer. Source: https://www.slideshare.net
Now that we know how a computer sees an image, we want to extract information from it. As human beings, seeing which is a dog and which is a cat is a fairly easy task, however for a computer it really is more complicated than using only numbers as inputs.
What this neural network will do is take the image, pass it through a series of convolutional, nonlinear, pooling and fully connected layers to get a certain output. The output can be a single class or a probability of classes that best describes the image (example of classes can be human, dog, cat, horse etc). In the example below, we see that the neural network predicts that this image represents a cat with a probability of 82% which is quite accurate. However, you will see that using it will start by being not accurate and will "learn" how to correctly predict an image. The final goal would be to predict it with 99-100% accuracy.
Example of how a computer sees an image and trying to label it. Source: http://cs231n.github.io/classification/
Architecture Details of CNN
Neural networks are made of layers which contain neurons. The particular architecture of the layers constitutes a very important concept because different types of layers yield different learning types.
In this section, I will describe the basic architecture used by CNN. Of course, a number of optimization techniques can be applied, but we will stick to the basics.
Extracting the features - Convolution Layer
Convolution is one of the main building blocks of a CNN. The term convolution refers to the mathematical combination of two functions to produce a third function. It is highly used in the field of signal processing.
In the case of a CNN, the convolution is performed on the input data with the use of a filter (also called kernel) to then produce a certain result called feature map.
We execute a convolution by sliding the filter over the input. At every location, a matrix multiplication is performed and sums the result onto the feature map.
In the animation below, you can see the convolution operation. You can see the filter (the yellow square) is sliding over our input (the green square) and the sum of the convolution goes into the feature map (the red square). Note that the filter shown below is of dimensions 3x3 (but you can use any dimension you want, however 3x3 is a popular choice)
Convolution in images
As you may notice, the pixel value in a certain location of the feature map only depends on its local neighbours. This is actually one key component of CNN. By using only the neighboring values, then it can analyze a particular object in image independently of where it is in the image. For example, if we shift our input image then the neighboring elements will also be shifted, making the analysis
Introducing non-linearity: ReLU Activation Function
The Rectified Linear Unit (ReLU) is a non-linear function used after the convolution layer. It is extremely important to remember that this function is not linear. Since most of the problems we deal with practically are not linear (otherwise life would be so easy), then by applying a non-linear function we generalise our problem to non-linear problems as well.
The ReLU function is shown below:
This function acts as an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero.
Reducing computation while maintaining what is important: Pooling
If we wish to reduce the dimensionality of individual feature maps and yet keep the crucial information, we use the Pooling layer after the Convolution layer. Different types of Pooling exist: Average, Sum, Maximum, etc.
For Max Pooling, we first specify a spatial neighborhood (such as a 2×2 window) and then pick out the largest element of that feature rectified map within that window. Instead of largest element, if we pick the Average one it’s called Average Pooling and when the summation of elements in the window is taken, we call it Sum Pooling.
Summing up what we have seen
So far I have shown you the architecture of the CNN and the various operations used to compute the result, without explicitly explaining how the result is computed leading to the following question:
How does it compute the result?
As I stated earlier, neural networks are made of layers, in which we have nodes representing neurons (similar to the brain). Between those layers exist edges (some sort of connections, again similar to the brain) that have weights or filter values. Those filter values need to be updated each time to produce better results, using what we call backpropagation.
Before the CNN starts, the weights or filter values placed in the connections are randomly initialized. Let's assume we have a training set that has thousands of images of dogs, and cats and each of the images has a label of what animal that picture is. The final goal is to have those weights be numbers in such a way that the label given in the training set matches the label.
An overview of the steps that the neural network takes are:
1. Initialize all filters and parameters / weights with random values (as discussed above).
2. The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
4. Use Backpropagation to calculate the gradients of the error with respect to all weights in the network and use (stochastic) gradient descent to update all filter values / weights and parameter values to minimize the output error.
In summary, CNN is a type of neural network that is based on the idea of convolution. By using a convolution layer along with other types of layers, it can learn how to perform object segmentation or detection. However, CNN does not stop at images! It is also widely used in natural language processing.
If you are interested in creating your own convolutional neural networks, I highly recommend the Keras library. It is a very intuitive Python library that uses TensorFlow and can help you learn neural networks in a practical way.
Ahmed Ahres, 22.