Convolutional Neural Networks (CNN) are frequently preferred in computer vision applications because of their successful results on object recognition and classification tasks. CNNs are composed of many neurons stacked together. Computing convolutions across neurons require a lot of computation, so pooling processes are often used to reduce the size of network layers. Convolutional approaches make it possible to learn many complex features of our data with simple computations. By performing many matrix multiplications and summations on our input, we can arrive at an answer to our question.
I always hear how great CNNs are. When do they fail?
It’s true that CNNs have shown great success in solving object recognition and classification problems. However, they aren’t perfect. If a CNN is shown an object in an orientation it is unfamiliar with or where objects appear in places that it’s not used to, it’s prediction task will likely fail.
For example, if you turn a face upside down, the network will no longer be able to recognize eyes, a nose, a mouth, and the spatial relationship between the two. Similarly, if you alter specific regions of the face (i.e. switch the positions of the eyes and nose), the network will be able to recognize the face but it is no longer a real face. CNNs learn statistical patterns in images, but they don’t learn fundamental concepts about what makes something actually look like a face.
Theorizing about why CNNs fail to learn concepts, Geoffrey Hinton, the Father of AI, focused on the pooling operation used to shrink the size and computation requirements of the network. He lamented:
“The pooling operation used in convolutional neural networks is a big mistake, and the fact that it works so well is a disaster!”
The pooling layer was destroying information and making it impossible for networks to learn higher-level concepts. So he set out to develop a new architecture that didn’t rely so heavily on this operation.
The result: Capsule Networks
Hinton and Sabour borrowed ideas from neuroscience that suggest the brain is organized into modules called capsules. These capsules are particularly good at handling features of objects like pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.
The brain, they theorize, must have a mechanism for routing low-level visual information to what it believes is the best capsule for handling it. Capsule networks and dynamic routing algorithms have been proposed as solutions to problems where convolutional neural network models are inadequate.
Capsules represent the various features of a particular entity that are present in the image. One very special feature is the existence of the instantiated entity in the image. The instantiated entity is a parameter such as position, size, orientation, deformation, velocity, albedo, hue, texture, etc. An obvious way to represent its existence is by using a separate logistic unit, whose output is the probability that the entity exists . To get better results than CNNs, we should use an iterative routing-by-agreement mechanism. These features are called instantiation parameters. In the classic CNN model, such attributes of the object in the image are not obtained. The average / max-pooling layer reduces the size of a set of information while the size is reduced.
Well, somewhere there is a lip, nose, and eye, but the convolutional neural network can’t decide where it should be and where it is. With traditional networks, misplaced features don’t faze it!
In deep neural networks, activation functions are simple mathematical operations applied to the output of layers. They are used to approximate non-linear relationships that exist in data. Activation layers typically act on scalar values—for example, normalizing each element in a vector so that it falls between 0 and 1.
In Capsule Networks, a special type of activation function called a squash function is used to normalize the magnitude of vectors, rather than the scalar elements themselves.
The outputs from these squash functions tell us how to route data through various capsules that are trained to learn different concepts. The properties of each object in the image are expressed in the vectors routing them. For example, the activations of a face may route different parts of an image to capsules that understand eyes, noses, mouths, and ears.
Now, the next step is crucial:
Just like the layers at different levels of deep CNNs learn different semantic attributes of images (content, texture, style, etc.), capsules can be organized into different levels as well. Capsules at one level make predictions, learn about the shapes of objects, and pass those on to higher level capsules, which learn about orientations. When multiple predictions agree, a higher-level capsule becomes active. This process is described as dynamic routing, which I will talk about in more detail now.
So, let’s create a step-by-step capsule architecture for classifying the MNIST dataset:
The first layer has a classic convolution layer. In the second layer, a convolution process is performed in the layer called the primary capsule, where the
squash function is applied. Each primary capsule receives a small region of the image as input (called its receptive field), and it tries to detect the presence and pose of a particular pattern—for example, a circle.
Capsules in higher layers (called routing capsules) detect larger and more complex objects, such as the number 8, made up of two circles. Then they use a novel squashing function to guarantee these vectors have a length between 0 and 1.
A standard convolution layer is applied before the primary capsule layer and an output of 9x9x256 is obtained. A new convolution process with 32 channels is applied in the primary capsule layer with a stride of 2. However, this feature that separates it from other convolution processes is the function of squashing. Lastly, this gives the output of the primary capsules.
This gives a 6x6 output. However, in the capsule layer, a dynamic routing algorithm is implemented, such that 32 outputs of these 8-length outputs DigitCaps vectors are obtained as a result of the capsule layer with the third layer of dynamic routing (routing-by-agreement algorithm). The routing-by-agreement algorithm includes a few iterations of agreement (detection, and routing) update.
The dynamic routing in
capsulelayers.py is defined in the class
CapsuleLayer (layers.Layer) function. Thanks to this calculation step, the vector values are small in areas where the object is not present in the image, while the dimensions of the vector in the detected areas vary depending on the attribute.
You can also find all the work here.
For the MultiMNIST dataset, with 80% overlapping handwritten numbers, the performance of the Capsule Network appears to be impressively good when the data overlap, especially when compared to the CNN model.
As compared to the CNN, the training time for the capsule network is slower because of its computational complexity. Here’s a look at 50-epoch training time on various hardware and on a cloud server:
To use Google Colab support, the most appealing option, please read the Google Colab Free GPU Tutorial!
✔️Capsule networks have the highest success in the MNIST dataset when compared to other state-of-the-art techniques.
✔️It’s successful with smaller datasets. (By forcing the model to learn the feature variant in a capsule, it can extrapolate possible variants more effectively with less training data.)
✔️The routing-by-agreement algorithm allows us to distinguish objects in overlapping images.
✔️It’s easier to interpret the image with activation vectors.
✔️Capsule networks maintain information such as equivariance, hue, pose, albedo, texture, deformation, speed, and location of the object.