Capsule Networks: A new and attractive AI architecture🚨

By Ayyüce Kızrak

Convolutional Neural Networks (CNN) are frequently preferred in computer vision applications because of their successful results on object recognition and classification tasks. CNNs are composed of many neurons stacked together. Computing convolutions across neurons require a lot of computation, so pooling processes are often used to reduce the size of network layers. Convolutional approaches make it possible to learn many complex features of our data with simple computations. By performing many matrix multiplications and summations on our input, we can arrive at an answer to our question.

Basic Convolutional Neural Network

I always hear how great CNNs are. When do they fail?

It’s true that CNNs have shown great success in solving object recognition and classification problems. However, they aren’t perfect. If a CNN is shown an object in an orientation it is unfamiliar with or where objects appear in places that it’s not used to, it’s prediction task will likely fail.

For example, if you turn a face upside down, the network will no longer be able to recognize eyes, a nose, a mouth, and the spatial relationship between the two. Similarly, if you alter specific regions of the face (i.e. switch the positions of the eyes and nose), the network will be able to recognize the face but it is no longer a real face. CNNs learn statistical patterns in images, but they don’t learn fundamental concepts about what makes something actually look like a face.

Geoffrey Hinton and Sara Sabour

Theorizing about why CNNs fail to learn concepts, Geoffrey Hinton, the Father of AI, focused on the pooling operation used to shrink the size and computation requirements of the network. He lamented:

“The pooling operation used in convolutional neural networks is a big mistake, and the fact that it works so well is a disaster!”

The pooling layer was destroying information and making it impossible for networks to learn higher-level concepts. So he set out to develop a new architecture that didn’t rely so heavily on this operation.

The result: Capsule Networks

Hinton and Sabour borrowed ideas from neuroscience that suggest the brain is organized into modules called capsules. These capsules are particularly good at handling features of objects like pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.

The brain, they theorize, must have a mechanism for routing low-level visual information to what it believes is the best capsule for handling it. Capsule networks and dynamic routing algorithms have been proposed as solutions to problems where convolutional neural network models are inadequate.

Portrait de femme au col d`hermine (Olga)

Capsules represent the various features of a particular entity that are present in the image. One very special feature is the existence of the instantiated entity in the image. The instantiated entity is a parameter such as position, size, orientation, deformation, velocity, albedo, hue, texture, etc. An obvious way to represent its existence is by using a separate logistic unit, whose output is the probability that the entity exists [1]. To get better results than CNNs, we should use an iterative routing-by-agreement mechanism. These features are called instantiation parameters. In the classic CNN model, such attributes of the object in the image are not obtained. The average / max-pooling layer reduces the size of a set of information while the size is reduced.

Well, somewhere there is a lip, nose, and eye, but the convolutional neural network can’t decide where it should be and where it is. With traditional networks, misplaced features don’t faze it!
Squash function

In deep neural networks, activation functions are simple mathematical operations applied to the output of layers. They are used to approximate non-linear relationships that exist in data. Activation layers typically act on scalar values—for example, normalizing each element in a vector so that it falls between 0 and 1.

In Capsule Networks, a special type of activation function called a squash function is used to normalize the magnitude of vectors, rather than the scalar elements themselves.

algorithm of routing-by-agreement

The outputs from these squash functions tell us how to route data through various capsules that are trained to learn different concepts. The properties of each object in the image are expressed in the vectors routing them. For example, the activations of a face may route different parts of an image to capsules that understand eyes, noses, mouths, and ears.

Angle estimation as performed by capsule networks

Now, the next step is crucial:

Just like the layers at different levels of deep CNNs learn different semantic attributes of images (content, texture, style, etc.), capsules can be organized into different levels as well. Capsules at one level make predictions, learn about the shapes of objects, and pass those on to higher level capsules, which learn about orientations. When multiple predictions agree, a higher-level capsule becomes active. This process is described as dynamic routing, which I will talk about in more detail now.

So, let’s create a step-by-step capsule architecture for classifying the MNIST dataset:

The first layer has a classic convolution layer. In the second layer, a convolution process is performed in the layer called the primary capsule, where the squash function is applied. Each primary capsule receives a small region of the image as input (called its receptive field), and it tries to detect the presence and pose of a particular pattern—for example, a circle.

Capsules in higher layers (called routing capsules) detect larger and more complex objects, such as the number 8, made up of two circles. Then they use a novel squashing function to guarantee these vectors have a length between 0 and 1.

A standard convolution layer is applied before the primary capsule layer and an output of 9x9x256 is obtained. A new convolution process with 32 channels is applied in the primary capsule layer with a stride of 2. However, this feature that separates it from other convolution processes is the function of squashing. Lastly, this gives the output of the primary capsules.

Convolution process of the primary capsule

This gives a 6x6 output. However, in the capsule layer, a dynamic routing algorithm is implemented, such that 32 outputs of these 8-length outputs DigitCaps vectors are obtained as a result of the capsule layer with the third layer of dynamic routing (routing-by-agreement algorithm). The routing-by-agreement algorithm includes a few iterations of agreement (detection, and routing) update.

Capulel layer

The dynamic routing in is defined in the class CapsuleLayer (layers.Layer) function. Thanks to this calculation step, the vector values are small in areas where the object is not present in the image, while the dimensions of the vector in the detected areas vary depending on the attribute.

You can also find all the work here.

When testing with a 10,000-image test dataset, we achieved 99.61% accuracy for the MNIST dataset and 92.22% accuracy for the FASHION MNIST dataset have been obtained. Yeah!! 😎

For the MultiMNIST dataset, with 80% overlapping handwritten numbers, the performance of the Capsule Network appears to be impressively good when the data overlap, especially when compared to the CNN model.

Capsule network outputs for the MultiMNIST dataset

As compared to the CNN, the training time for the capsule network is slower because of its computational complexity. Here’s a look at 50-epoch training time on various hardware and on a cloud server:

For sure :)
To use Google Colab support, the most appealing option, please read the Google Colab Free GPU Tutorial!

✔️Capsule networks have the highest success in the MNIST dataset when compared to other state-of-the-art techniques.

✔️It’s successful with smaller datasets. (By forcing the model to learn the feature variant in a capsule, it can extrapolate possible variants more effectively with less training data.)

✔️The routing-by-agreement algorithm allows us to distinguish objects in overlapping images.

✔️It’s easier to interpret the image with activation vectors.

✔️Capsule networks maintain information such as equivariance, hue, pose, albedo, texture, deformation, speed, and location of the object.

❌CIFAR10 does not succeed with the dataset compared to state of the art models.

❌Have not been tested on very large datasets.

❌Due to the routing-by-agreement algorithm, more time is required for training the model.

The application of the capsule network model with different routing algorithms shows that it’s a subject that needs more experimentation and is still developing.

“ Convolutional neural networks are doomed” Geoffrey Hinton

In the example shown above, when only the direction of the image of Kim Kardashian is changed, prediction accuracy drops significantly. In the image on the right, we can easily tell that an eye and her mouth are incorrectly placed and that this isn’t what a person is assumed to look like. Yet, we see a .90 prediction score.

A perfectly-trained CNN has some obstacles with this approach. Aside from being easily tricked by images with features in the incorrect locations, a CNN is also easily confused when viewing an image in a different orientation.

Beyond any doubt, CNNs can be affected by adversarial attacks. This is an important constraint that can cause security problems, especially when we embed a latent pattern into an object to make it look like something else. However, as we’ve learned, we can solve this problem with Capsule Networks!

✨The models of Capsule Networks are worthwhile and promising!

📝Our academic paper about Capsule Networks:
Recognition of Sign Language Using Capsule Networks
Hearing and speech impaired persons continue to communicate with the help of lip-reading or hand and face movements (i.e. sign language). Capsule Networks can help in ensuring that disabled people can not only participate fully in life but also increase the quality of life through healthy and effective communication with other people.

In this work; digits of the sign language were recognized with 94.2% validation accuracy by Capsule Networks.

Sign Language Digits Dataset, Arda Mavi ve Zeynep Dikle

[1] Sabour, S., Frosst, N. ve Hinton, G.E., “Dynamic Routing Between Capsules’’, arXiv preprint arXiv:1710.09829, 2017.

[2]Hinton, G. E., Krizhevsky A. ve Wang, S. D. “Transforming Auto-encoders.” International Conference on Artificial Neural Networks. Springer, Berlin, Heidelberg, 2011.

[3] CSC2535: 2013 Advanced Machine Learning Taking Inverse Graphics Seriously, Geoffrey Hinton Department of Computer Science University of Toronto, 2013.

[4] Capsule Network Implementation of MNIST dataset (in Turkish explanation, Deep Learning Türkiye, kapsul-agi-capsule-network.

[5] Recognition of Sign Language Using Capsule Networks, Fuat Beşer, Merve Ayyüce KIZRAK, Bülent BOLAT, Tülay YILDIRIM,

Editor’s Note: Interested in learning more about state-of-the-art machine learning models, tools, frameworks, and more? Check out Heartbeat’s growing library of machine learning resources to get started.